iltmpz October 15, 2015 at 10:31

How I monitored Avito via SMS

From the sandbox

As you know, products of very good quality and very cheap at the same time appear on Avito. But they rarely appear, hang little there and disappear quickly.

Therefore, I had an idea: instead of looking for a service that checks the ads once every few minutes, and if something interesting appears to me, does it notify you? At the same time, it is best to notify via SMS, otherwise I do not always check mail promptly.

Google has issued several such services, “only” from 3 rubles per SMS or from 4 rubles per day.

In the end, I decided to write such a service myself, but more on that later ...

For fun, I signed up for one of the services. Yesterday, he checked the links every 15 minutes, and if something changed, he sent notifications to the mail. About SMS on their website, it was casually mentioned that mail.ru is able to send SMS. In fact, it turned out that mail.ru can send only by megaphone, but I don’t even have it ... And if you need Beeline-MTS, then please, the service will be happy to help, for a fee.

I also note that I have been a user of a very convenient and free service for a long time , which I wrote about right there on the hub , and which allows you to send an email with a specific subject to a specific mailbox, and the content of the email will come to me in the form of SMS. I wanted to specify my_yaschik@sms.ru for service letters, but did not understand how to change the subject of the letter, without which you won’t receive SMS.

In addition, today the demo period of glis has ended, and the verification frequency has become 720 minutes.

In general, thinking about what to pay for, I’m sorry, a “service” of this level is like paying for ~~Windows~~ for air, I decided that the easiest way is to spend 3 hours of my valuable time and build such a service myself, since the parsing of the Avito page is trivial and , as follows, took exactly 1 line of code from me.

I used VPS hosting for this script. WEB hosting is also suitable, subject to the availability of a pearl on it, “outside” access and a scheduler. In extreme cases, any computer included in the Internet is suitable. I think many have something similar.

What is the script written on

I decided to write it on the pearl, and although I know the pearl quite mediocre, it is best suited for scripts of this kind. Where it was too lazy to deal with the pearl, I did not really strain, I called the shell command through the system. Nevertheless, it turned out, in my opinion, pretty well and do not even hesitate to show my creation to the public.

The logic of work, briefly

- Run the script every xxx minutes;
- Download the page using wget;
- We store the page downloaded last time, comparing it with the newly downloaded one, if any announcements have changed / new ones appeared, we send an SMS about this.

The info that is pulled from ads is:

1. URL of the ad (which I use as a unique identifier for the ad);
2. Name;
3. Price.

At the same time, it is provided: if suddenly one of the page downloads crashes, the old list will remain, and the page will just download the next time, then SMS will come about the changes if they happened.

More detailed

Before using, check the paths and names for mailer and wget, make sure that you have them and work. In particular, in centos my mailer is called mutt, mail or sendmail with the same syntax is more common. Maybe you need to replace wget with / usr / local / bin / wget, etc.

You should also set your mailbox and phone to which you want to receive notifications.

Run the script with the command: ./avito.pl url_pages_with_adverts.

I note that the URL of the page should be in the form of a "list with photos." In other words, there should not be any & view = list or & view = gallery in the url.

Example url: www.avito.ru/moskva?q=%D1%80%D0%B5%D0%B7%D0%B8%D0%BD%D0%BE%D0%B2%D1%8B%D0%B9+% D1% 81% D0% BB% D0% BE% D0% BD

Page downloaded in a file named derived from Urla, including the replacement of all the characters on the left underscores, like this:

https ___ www.avito.ru_moskva_q__D1_80_D0_B5_D0_B7_D0_B8_D0_BD_D0_BE_D0_B2_D1_8B_D0_B9__D1_81_D0_BB_D0_BE_D0_BD

It must be unique, and be supported in Linux and in Windows and still be quite readable.

If such a file already exists, the script tries to pull ads from it. If no ads are found in the file, the script calls wget while overwriting the file. If the ads are found, the file is saved with the suffix -1:

https ___ www.avito.ru_moskva_q__D1_80_D0_B5_D0_B7_D0_B8_D0_BD_D0_BE_D0_B2_D1_8B_D0_B9__D1_81_D0_BB_D0_BE_D0_BD-1

Next page is downloaded again, the following situations are verified in it:

1. If no ads are found in the new downloaded page, the script simply ends - the old page remains with the suffix -1. This is in case the network suddenly disappeared or hung - the past list of ads will not be lost.
2. If the script is run for the first time (the previously downloaded page was not found), then the infa will simply come about the number of available ads:

Found 25 items, page www.avito.ru/moskva?q=%D1%80%D0%B5%D0%B7%D0%B8%D0%BD%D0%BE%D0%B2%D1%8B%D0% B9 +% D1% 81% D0% BB% D0% BE% D0% BD monitoring started

If this message came, then the system started, it is mainly a check that everything worked.

Since SMS should be shorter, the better, then all messages are very concise.

3. If a new announcement has appeared, then the info about this will be added to the text of the future SMS. Then, for all ads, infa will come in the form of one SMS.
4. If the price or the name of the product has changed, then the infa will come in the form: old_price -> new_price name link. Or new_name link.

I don’t know if the name can change, but it was not a pity to make an extra check.

5. The console displays in a separate text a list of what was found. This is done more for debugging, because today the parser is working, and tomorrow, when they change the markup, it will stop. Will have to change the parsing.

About parsing and nuances

Actually, all parsing is in this line:

while($text=~/\n(.*?)\n.*?\n\s*(\S*)/gs)

Although, the price also contains a space in the form of nbsp, which I cut with another regexp:

$price=~s/ //g

So parsing, formally speaking, is still not in one, but in two lines.

g - global search modifier, which allows you to put the search inside the while condition, giving each time the next declaration;
s - allows you to search in several lines within one regexp (on Avito URL, name and price are on 4 lines, but now, until they change the layout).

I also note that for multi-line reading of the file at the beginning of the script is assigned:

undef $/;

This is so that my $ text =; read the entire file.

Another nuance: I insert clickable urls into all sms. I have a normal smartphone, which allows me to poke url inside SMS and get to the desired page, very convenient. So, for some reason sms.ru spoils such an innocent character as underscore. Replacing it with% C2% A7. I can’t influence this, but I can replace it with an underscore code, which comes up normally, while the URL becomes clickable for sms.ru, remaining the same for regular mail: $ text = ~ s / _ /% 5F / g;

Add the task to the scheduler

#crontab -e
*/20    *       *       *       *       cd /scripts/avito && ./avito.pl 'https://www.avito.ru/moskva?q=%D1%80%D0%B5%D0%B7%D0%B8%D0%BD%D0%BE%D0%B2%D1%8B%D0%B9+%D1%81%D0%BB%D0%BE%D0%BD'

Every 20 minutes, call the script, checking the page. Remember to escape the URL with single quotes.

You can ask such tasks as many as you want, they will all work independently of each other.

What I have not done for the industrial option and what would be easy to complete

1. Web face for adding / removing users and tasks. Storage urls, frequency, mailbox and phone users on sms.ru in the mysql database. The script would be called every minute, check what url to execute and send SMS not to my hard-coded number, but to the one set by the user.

Then it would be possible to rip off users at 8 rubles a day or something like that. Maybe to do it? Anyone wanting to pay for such a thing?

2. Price filter. Ignore the price above or below the set price. It is done elementarily, one more if: next if($page_new{"price"}{$uri}>$max_price or $page_new{"price"}{$uri}<$min_price). It just didn’t have to.

3. By analogy with Avito, add auto.ru, irr, etc. sites.

It’s also elementary, just then while(...){...}add a few more while- each site one at a time. The main thing is that they are filled inside $page{"name"}{$uri} и $page{"price"}{$uri}.

Each site will trigger its own while, the rest just return an empty result.

Well, actually the script code

#!/usr/bin/perl
use strict;
undef $/;
my $url=$ARGV[0];
my $mailer="mutt";
my $wget="wget";
if($url eq ""){
    print "Usage: avito.pl ";
    exit;
}
my $filename=$url;
$filename=~s#[^A-Za-z0-9\.]#_#g;
$url=~m#(^.*?://.*?)/#;
my $site=$1;
print "site:".$site."\n";
sub sendsms {
    my $text=shift;
    $text=~s/_/%5F/g;
    $text=~s/&/%26/g;
    system("echo '$text' | $mailer -s 79xxxxxxxxx xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\@sms.ru");
}
sub parse_page {
    open(MYFILE,"<".shift);
    my $text=;
    close(MYFILE);
    my %page;
    while($text=~/\n(.*?)\n.*?\n\s*(\S*)/gs)
    {
        my $uri=$1;
        my $name=$2;
        my $price=$3;
        $uri=~s/^\s+|\s+$//g;
        $name=~s/^\s+|\s+$//g;
        $price=~s/^\s+|\s+$//g;
        $price=~s/ //g;
        $page{"name"}{$uri}=$name;
        $page{"price"}{$uri}=$price;
    }
    return %page;
}
my %page_old=parse_page($filename);
if(scalar keys %{$page_old{"name"}}>0){
    system("cp $filename ${filename}-1");
}
else{
    %page_old=parse_page("${filename}-1");
}
system("$wget '$url' -O $filename");
my %page_new=parse_page($filename);
if(scalar keys %{$page_old{"name"}}>0){ # already have previous successful search
    if(scalar keys %{$page_new{"name"}}>0){ # both searches have been successful
        my $smstext="";
        foreach my $uri(keys %{$page_new{"name"}})
        {
            if(!defined($page_old{"price"}{$uri})){
                $smstext.="New: ".$page_new{"price"}{$uri}." ".$page_new{"name"}{$uri}." $site$uri\n ";
            }
            elsif($page_new{"price"}{$uri} ne $page_old{"price"}{$uri}){
                $smstext.="Price ".$page_old{"price"}{$uri}." -> ".$page_new{"price"}{$uri}." ".$page_new{"name"}{$uri}." $site$uri\n";
            }
            if(!defined($page_old{"name"}{$uri})){
                # already done for price
            }
            elsif($page_new{"name"}{$uri} ne $page_old{"name"}{$uri}){
                $smstext.="Name changed from ".$page_old{"name"}{$uri}." to ".$page_new{"name"}{$uri}." for $site$uri\n";
            }
        }
        if($smstext ne ""){
            sendsms($smstext);
        }
    }
    else{ # previous search is successful, but current one is failed
        # do nothing, probably a temporary problem
    }
}
else{ # is new search
    if(scalar keys %{$page_new{"name"}}<=0){ # both this and previous have been failed
        sendsms("Error, nothing found for page '$url'");
    }
    else{ # successful search and items found
        sendsms("Found ".(scalar keys %{$page_new{"name"}})." items, page '$url' monitoring started");
    }
}
foreach my $uri(keys %{$page_new{"name"}})
{
    print "uri: $uri, name: ".$page_new{"name"}{$uri}.", price: ".$page_new{"price"}{$uri}."\n";
    if($page_new{"price"}{$uri} eq $page_old{"price"}{$uri}){print "old price the same\n";}
    else{print "old price = ".$page_old{"price"}{$uri}."\n";}
    if($page_new{"name"}{$uri} eq $page_old{"name"}{$uri}){print "old name the same\n";}
    else{print "old name = ".$page_old{"name"}{$uri}."\n";}
}

Tags: