livid_hour December 9, 2015 at 11:51

Simple metasearch algorithm in Python

From the sandbox

Lyrical digression

As part of the research work at the university, I came across a task such as the classification of textual information. In fact, I needed to create an algorithm that, processing a specific text document at the input, would return to me the output array, each element of which would be a measure of whether this text belongs (probability or degree of confidence) to one of the given topics.

This article is not about solving the classification problem specifically, but about trying to automate the most boring stage of developing a rubricator - creating a training sample.

When too lazy to work with your hands

My first and most obvious thought is to write a simple metasearch algorithm in Python. In other words, all automation comes down to using another search engine (Google Search) for lack of its own databases. Immediately make a reservation, there are already ready-made libraries that solve a similar problem, for example pygoogle.

Get to the point

I used requests for HTTP requests, and BeautifulSoup parsing library to extract links from search results. Here's what happened:

from bs4 import BeautifulSoup
import requests
query = input('What are you searching for?:   ' )
url ='http://www.google.com/search?q='
page = requests.get(url + query)
soup = BeautifulSoup(page.text)
h3 = soup.find_all("h3",class_="r")
for elem in h3:
	elem=elem.contents[0]
	link=("https://www.google.com" + elem["href"])
	print(link)

I pulled only links to sites that are inside the {h3 class = "r"} tags on the Chrome search results page.

Well, fine, now let's try to pick up the links bypassing several browser pages:

from bs4 import BeautifulSoup
import requests
query = input('What are you searching for?:  ' )
number = input('How many pages:  ' )
url ='http://www.google.com/search?q='
page = requests.get(url + query)
for index in range(int(number)):
	soup = BeautifulSoup(page.text)
	next_page=soup.find("a",class_="fl")
	next_link=("https://www.google.com"+next_page["href"])
	h3 = soup.find_all("h3",class_="r")
	for elem in h3:
		elem=elem.contents[0]
		link=("https://www.google.com" + elem["href"])
		print(link)
	page = requests.get(next_link)

Chrome stores the address of the next page in the tag {a class = "fl"}.

And finally, let's try to get information from some page and make a dictionary out of it for the future heading. We will collect the necessary information from the same Wikipedia:

from bs4 import BeautifulSoup
import requests
dict=""
query = input('What are you searching for?: ' )
url ='http://www.google.com/search?q='
page = requests.get(url + query)
soup = BeautifulSoup(page.text)
h3 = soup.find_all("h3",class_="r")
for elem in h3:
	elem=elem.contents[0]
	elem = elem["href"]
	if "wikipedia" in elem:
		link=("https://www.google.com" + elem)
		break
page = requests.get(link)
soup = BeautifulSoup(page.text)
text = soup.find(id="mw-content-text")
p= text.find("p")
while p != None:
	dict+=p.get_text()+"\n"
	p = p.find_next("p")
dict=dict.split()

For the query “god” we get a good dictionary with 3,500 terms, which, in truth, will have to be finalized with a file, removing punctuation marks, links, stop words and other “garbage”.

Conclusion

Summarizing the work done, it should be noted that the dictionary of course turned out to be “raw” and “dragging” the parser to a specific resource requires time to study its structure. This leads to a simple thought - it is worthwhile to generate the training sample yourself, or use ready-made databases.

On the other hand, with due care for writing the parser (clearing html markup from unnecessary tags is not difficult) and a large number of classes, some degree of automation can add the necessary flexibility to the rubricator.

Links to the tools used

BeautifulSoup: www.crummy.com/software/BeautifulSoup
Requests: docs.python-requests.org/en/latest

Tags: