Do-it-yourself position monitoring

  • Tutorial

We do the monitoring of the positions of queries in the search engine, the beginning.


Usually we are interested in increasing customers.
And in order to increase something, you must first evaluate it.
And it so happened historically that some customers come to online stores from search engines.
(I’ll write about working with contextual advertising and price aggregators in the following articles, if anyone is interested.)
And to evaluate your condition in search engines, you usually need to collect statistics from them on the status of requests in the search results.

Our tool will consist of 2 parts:
  • script for parsing search results using Curl and lxml
  • web interface for management and display, on Django


We learn from yandex.ru our position on request.


I want to clarify right away, in this article the basics will be described and we will make the simplest option, which we will improve in the future.

To begin with, let's make a function that returns html by url.

We will load the page using pycurl.
import pycurl    
c = pycurl.Curl()

Set the url we will load
url = 'ya.ru'
c.setopt(pycurl.URL, url)

To return the body of the page, curl uses a callback function, which passes a string with html.
We will use the StringIO string buffer, it has a write () function for input, and we can get all the contents from it through getvalue ()
from StringIO import StringIO
c.bodyio = StringIO()
c.setopt(pycurl.WRITEFUNCTION, c.bodyio.write)
c.get_body = c.bodyio.getvalue

Just in case, we will make our curl look like a browser, write timeouts, user agent, headers, etc.
c.setopt(pycurl.FOLLOWLOCATION, 1)
c.setopt(pycurl.MAXREDIRS, 5)
c.setopt(pycurl.CONNECTTIMEOUT, 60)
c.setopt(pycurl.TIMEOUT, 120)
c.setopt(pycurl.NOSIGNAL, 1)
c.setopt(pycurl.USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:13.0) Gecko/20100101 Firefox/13.0')
httpheader = [
    'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language: ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3',
    'Accept-Charset:utf-8;q=0.7,*;q=0.5',
    'Connection: keep-alive',
    ]
c.setopt(pycurl.HTTPHEADER, httpheader)

Now load the page
c.perform()

That's all, the page is with us, we can read the html pages
print c.get_body()

So we can read the headers
print c.getinfo(pycurl.HTTP_CODE)

And if you received some server response other than 200, then we can process it. Now we just throw an exception, we will handle exceptions in the following articles
if c.getinfo(pycurl.HTTP_CODE) != 200:
	raise Exception('HTTP code is %s' % c.getinfo(pycurl.HTTP_CODE))


Wrap everything that turned out into a function, in the end we got
import pycurl
try:
    from cStringIO import StringIO
except ImportError:
    from StringIO import StringIO
def get_page(url, *args, **kargs):
    c = pycurl.Curl()
    c.setopt(pycurl.URL, url)
    c.bodyio = StringIO()
    c.setopt(pycurl.WRITEFUNCTION, c.bodyio.write)
    c.get_body = c.bodyio.getvalue
    c.headio = StringIO()
    c.setopt(pycurl.HEADERFUNCTION, c.headio.write)
    c.get_head = c.headio.getvalue
    c.setopt(pycurl.FOLLOWLOCATION, 1)
    c.setopt(pycurl.MAXREDIRS, 5)
    c.setopt(pycurl.CONNECTTIMEOUT, 60)
    c.setopt(pycurl.TIMEOUT, 120)
    c.setopt(pycurl.NOSIGNAL, 1)
    c.setopt(pycurl.USERAGENT, 'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:13.0) Gecko/20100101 Firefox/13.0')
    httpheader = [
        'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language: ru-ru,ru;q=0.8,en-us;q=0.5,en;q=0.3',
        'Accept-Charset:utf-8;q=0.7,*;q=0.5',
        'Connection: keep-alive',
        ]
    c.setopt(pycurl.HTTPHEADER, httpheader)
    c.perform()
    if c.getinfo(pycurl.HTTP_CODE) != 200:
        raise Exception('HTTP code is %s' % c.getinfo(pycurl.HTTP_CODE))
    return c.get_body()

Check the function
print get_page('ya.ru')


We’ll select from the search results page a list of sites with positions

We’ll construct a search query,
on yandex.ru/yandsearch we need to send 3 GET parameters, a
'text'-request, a' lr'-search region, a 'p'-issuance page
import urllib
import urlparse
key='кирпич'
region=213
page=1
params = ['http', 'yandex.ru', '/yandsearch', '', '', '']
params[4] = urllib.urlencode({
	'text':key, 
    'lr':region,
    'p':page-1,
})
url = urlparse.urlunparse(params)

Print the url and check in the browser
print url 

Get through the previous function a page with the output
html = get_page(url)

Now we will parse it by dom model using lxml
import lxml.html
site_list = []
for h2 in lxml.html.fromstring(html).find_class('b-serp-item__title'):
    b = h2.find_class('b-serp-item__number')
    if len(b):
        num = b[0].text.strip()
        url = h2.find_class('b-serp-item__title-link')[0].attrib['href']
        site = urlparse.urlparse(url).hostname
        site_list.append((num, site, url))

I’ll write in more
detail what happens here lxml.html.fromstring (html) - from the html line we make the html object of the document
.find_class ('b-serp-item__title') - we look for all the tags that contain the class 'b-serp-item__title' in the document , we get a list of H2 elements that contain the information that intersects us on the positions, and cycle through them with the cycle
b = h2.find_class ('b-serp-item__number') - we look inside the found H2 tag for the element b that contains the site position number, if you find then we collect the position b [0] .text.strip () of the site and the line c of the site url
urlparse.urlparse (url) .hostname - we get the domain name

Check the resulting list
print site_list

And we will collect everything turned out into a function
def site_list(key, region=213, page=1):
    params = ['http', 'yandex.ru', '/yandsearch', '', '', '']
    params[4] = urllib.urlencode({
        'text':key, 
        'lr':region,
        'p':page-1,
    })
    url = urlparse.urlunparse(params)
    html = get_page(url)
    site_list = []
    for h2 in lxml.html.fromstring(html).find_class('b-serp-item__title'):
        b = h2.find_class('b-serp-item__number')
        if len(b):
            num = b[0].text.strip()
            url = h2.find_class('b-serp-item__title-link')[0].attrib['href']
            site = urlparse.urlparse(url).hostname
            site_list.append((num, site, url))
    return site_list

Check the function
print site_list('кирпич', 213, 2)


Find our site in the list of sites

We need a helper function that cuts off 'www.' at the beginning of the site
def cut_www(site):
    if site.startswith('www.'):
        site = site[4:]
    return site


Get a list of sites and compare with our site
site = 'habrahabr.ru'
for pos, s, url in site_list('python', 213, 1):
	if cut_www(s) == site:
		print pos, url

Hmm, there is no hub on the first page of python output, we’ll try to go through the output in a loop in depth,
but we need to set a limit, max_position - to what position we will check,
at the same time and wrap it in a function, and in case nothing is found we will return None, None
def site_position(site, key, region=213, max_position=10):
    for page in range(1,int(math.ceil(max_position/10.0))+1):
        site = cut_www(site)
        for pos, s, url in site_list(key, region, page):
            if cut_www(s) == site:
                return pos, url
    return None, None

Check
print site_position('habrahabr.ru', 'python', 213, 100)


So we actually got our position.

Please write, is this topic interesting and should I continue?

What to write about in the next article?
- make a web interface for this function with signs and charts and hang a script on cron
- do captcha processing for this function, and manually and through special api
- make a monitoring script for something with multi
- threading - describe how to work with the directive using the example of generating and filling announcements through api or price management

Also popular now: