RISENT October 3, 2017 at 15:43

We write telegrams of the bot-parser of vacancies on JS

The theme of creating bots for Telegram is becoming increasingly popular, attracting programmers to try their hand at this field. Each periodically has ideas and tasks that can be solved by writing a thematic bot. For me, as a JS programmer, an example of such an urgent task is monitoring the job market on relevant topics.

However, one of the most popular languages and technologies in the field of creating bots is Python, which offers the programmer a huge number of good libraries for processing and parsing various sources of information in the form of text. But I wanted to do it precisely in JavaScript - one of my favorite languages.

Task

The main task: to create a detailed job feed with tagging and nice visual markup. It can be divided into separate subtasks:

interaction with Telegram API;
parsing RSS feeds of sites with vacancies;
parsing a single vacancy;
thematic tagging;
visual presentation of information;
duplication prevention.

At first I thought of using a universal ready-made bot, for example, @TheFeedReaderBot . But after a detailed study of it, it turned out that tagging is completely absent, and the options for setting the display of content are very limited. Fortunately, modern Javascript provides many libraries to help solve these problems. But first things first.

Bot frame

Of course, it would be possible to interact directly with the Telegram REST API, but from the point of view of labor costs, it is easier to take ready-made solutions. So I chose the npm slimbot package , which is referenced by the official bot creation tutorials. And although we will only send messages, this package will greatly simplify life, allowing us to create an internal bot API as an entity:

const Slimbot = require('slimbot');
const config = require('./config.json');
const bot = new Slimbot(config.TELEGRAM_API_KEY);
bot.startPolling();
function logMessageToAdmin(message, type='Error') {
    bot.sendMessage(config.ADMIN_USER, `${type}\n${message}`, {
        parse_mode: 'HTML'
    });
}
function postVacancy(message) {
    bot.sendMessage(config.TARGET_CHANNEL, message, {
        parse_mode: 'HTML',
        disable_web_page_preview: true,
        disable_notification: true
    });
}
module.exports = {
    postVacancy,
    logMessageToAdmin
};

We will use the usual setInterval as the scheduler, and feed-read for parsing RSS , and the sources of vacancies will be My Circle and hh.ru.

const feed = require("feed-read");
const config = require('./config.json');
const HhAdapter = require('./adapters/hh');
const MoikrugAdapter = require('./adapters/moikrug');
const bot = require('./bot');
const { FeedItemModel } = require('./lib/models');
function processFeed(articles, adapter) {
  articles.forEach(article => {
    if (adapter.isValid((article))) {
      const key = adapter.getKey(article);
      new FeedItemModel({
        key,
        data: article
      }).save().then(
        model => adapter.parseItem(article).then(bot.postVacancy),
        () => {}
      );
    }
  });
}
setInterval(() => {
    feed(config.HH_FEED, function (err, articles) {
        if (err) {
            bot.logMessageToAdmin(err);
            return;
        }
        processFeed(articles, HhAdapter);
    });
    feed(config.MOIKRUG_FEED, function (err, articles) {
        if (err) {
            bot.logMessageToAdmin(err);
            return;
        }
        processFeed(articles, MoikrugAdapter);
    });
}, config.REQUEST_PERIOD_TIME);

Parsing a single job

Due to the different structure of the pages with vacancies for each source site, the implementation of parsing is different. Therefore, adapters that provide a unified interface were used. To work with the DOM on the server, the jsdom library came up with which you can perform standard operations: finding an element using the CSS selector, getting the contents of the element that we actively use.

Moikrugadapter

const request = require('superagent');
const jsdom = require('jsdom');
const { JSDOM } = jsdom;
const { getTags } = require('../lib/tagger');
const { getJobType } = require('../lib/jobType');
const { render } = require('../lib/render');
function parseItem(item) {
    return new Promise((resolve, reject) => {
        request
            .get(item.link)
            .end(function(err, res) {
                if(err) {
                    console.log(err);
                    reject(err);
                    return;
                }
                const dom = new JSDOM(res.text);
                const element = dom.window.document.querySelector(".vacancy_description");
                const salaryElem =  dom.window.document.querySelector(".footer_meta .salary");
                const salary = salaryElem ? salaryElem.textContent : 'Не указана.';
                const locationElem =  dom.window.document.querySelector(".footer_meta .location");
                const location = locationElem && locationElem.textContent;
                const title =  dom.window.document.querySelector(".company_name").textContent;
                const titleFooter =  dom.window.document.querySelector(".footer_meta").textContent;
                const pureContent = element.textContent;
                resolve(render({
                    tags: getTags(pureContent),
                    salary: `ЗП: ${salary}`,
                    location,
                    title,
                    link: item.link,
                    description: element.innerHTML,
                    jobType: getJobType(titleFooter),
                    important: Array.from(element.querySelectorAll('strong')).map(e => e.textContent)
                }))
            });
    });
}
function getKey(item) {
    return item.link;
}
function isValid() {
    return true
}
module.exports = {
    getKey,
    isValid,
    parseItem
};

Hhadapter

const request = require('superagent');
const jsdom = require('jsdom');
const { JSDOM } = jsdom;
const { getTags } = require('../lib/tagger');
const { getJobType } = require('../lib/jobType');
const { render } = require('../lib/render');
function parseItem(item) {
    const splited = item.content.split(/\n|<\/p>|<\/p>\n/).filter(i => i);
    const [
        title,
        date,
        region,
        salary
    ] = splited;
    return new Promise((resolve, reject) => {
        request
            .get(item.link)
            .end(function(err, res) {
                if(err) {
                    console.log(err);
                    reject(err);
                    return;
                }
                const dom = new JSDOM(res.text);
                const element = dom.window.document.querySelector('.b-vacancy-desc-wrapper');
                const title = dom.window.document.querySelector('.companyname').textContent;
                const pureContent = element.textContent;
                const tags = getTags(pureContent);
                resolve(render({
                    title,
                    location: region.split(': ')[1] || region,
                    salary: `ЗП: ${salary.split(': ')[1] || salary}`,
                    tags,
                    description: element.innerHTML,
                    link: item.link,
                    jobType: getJobType(pureContent),
                    important: Array.from(element.querySelectorAll('strong')).map(e => e.textContent)
                }))
            });
    });
}
function getKey(item) {
    return item.link;
}
function isValid() {
    return true
}
module.exports = {
    getKey,
    isValid,
    parseItem
};

Formatting

After parsing, you need to present the information in a convenient form, but with the Telegram API there are not many opportunities for this: you can put only Unicode tags and symbols in emails (emoticons and stickers do not count). At the input, you get a couple of semantic fields in the description and the description itself in the "raw" HTML. After a short search, we find a solution - the html-to-text library . After a detailed study of the API and its implementation, one involuntarily wonders why formatting functions are called not from the dynamic config, but through the closure, which eliminates many of the advantages provided by the configuration parameters. And in order to beautifully display bullets instead of liin lists, you have to cheat a little:

const htmlToText = require('html-to-text');
const whiteSpaceRegex = /^\s*$/;
function render({
    title, location, salary, tags, description, link, important = [], jobType='' 
}) {
    let formattedDescription = htmlToText
        .fromString(description, {
            wordwrap: null,
            noLinkBrackets: true,
            hideLinkHrefIfSameAsText: true,
            format: {
                unorderedList: function formatUnorderedList(elem, fn, options) {
                    let result = '';
                    const nonWhiteSpaceChildren = (elem.children || []).filter(
                        c => c.type !== 'text' || !whiteSpaceRegex.test(c.data)
                    );
                    nonWhiteSpaceChildren.forEach(function(elem) {
                        result += ' ● ' + fn(elem.children, options) + '\n';
                    });
                    return '\n' + result + '\n';
                }
            }
        })
        .replace(/\n\s*\n/g, '\n');
    important.filter(text => text.includes(':')).forEach(text => {
        formattedDescription = formattedDescription.replace(
            new RegExp(text, 'g'),
            `${text}`
        )
    });
    const formattedTags = tags.map(t => '#' + t).join(' ');
    const locationFormatted = location ? `#${location.replace(/ |-/g, '_')} `: '';
    return `${title}\n${locationFormatted}#${jobType}\n${salary}\n${formattedTags}\n${formattedDescription}\n${link}`;
}
module.exports = {
    render
};

Tagging

Let's say we have nice job descriptions, but not enough tagging. To solve this issue, I tokenized natural Russian using the az library . So I managed to filter words in the token stream and replace with tags if there are corresponding words in the tag dictionary.

const Az = require('az');
const namesMap = require('../resources/tagNames.json');
function onlyUnique(value, index, self) {
    return self.indexOf(value) === index;
}
function getTags(pureContent) {
    const tokens = Az.Tokens(pureContent).done();
    const tags = tokens.filter(t => t.type.toString() === 'WORD')
        .map(t => t.toString().toLowerCase().replace('-', '_'))
        .map(name => namesMap[name])
        .filter(t => t)
        .filter(onlyUnique);
    return tags;
}
module.exports = {
    getTags
};

Dictionary format

{
  "js": "JS",
  "javascript": "JS",
  "sql": "SQL",
  "ангуляр": "Angular",
  "angular": "Angular",
  "angularjs": "Angular",
  "react": "React",
  "reactjs": "React",
  "реакт": "React",
  "node": "NodeJS",
  "nodejs": "NodeJS",
  "linux": "Linux",
  "ubuntu": "Ubuntu",
  "unix": "UNIX",
  "windows": "Windows"
   ....
}

Deploy and everything else

To publish each vacancy only once, I used the MongoDB database, reducing everything to the uniqueness of the links of the vacancies themselves. To monitor processes and their logs on the server, I chose the pm2 process manager , where the deployment is carried out by a normal bash script. By the way, the server uses the simplest Droplet from Digital Ocean.

Deployment script

#!/usr/bin/env bash
# rs - алиас для конфигурацци доступа к серверу
rsync ./ rs:/var/www/js_jobs_bot --delete -r --exclude=node_modules
ssh rs "
. ~/.nvm/nvm.sh
cd /var/www/js_jobs_bot/ 
mv prod-config.json config.json
npm i && pm2 restart processes.json
"

conclusions

Making simple bots turned out to be not difficult, you just need a desire, knowledge of some programming language (preferably Python or JS) and a couple of days of free time. You can find the results of my bot (as well as the thematic job feed) in the corresponding channel - @jsjobs .

PS The full source code can be found in my repository.

Tags: