rndr July 11, 2012 at 12:50

SPARQL queries to the content of HTML pages

From the sandbox

Hello.
After attending one conference, I got an idea, the embodiment of which I present.
This post provides an example of working with grab and rdflib libraries , as well as a ready-made class for executing SPARQL queries to the contents of web pages.

It is supposed to use this tool to turn information from sites that do not provide it in a structured form (rdf-triples, xml, json) into a form that is understandable to “machines”.

In order to execute SPARQL queries to html content, you need to create a local rdf repository, and fill it with information obtained from the html page using grab.

To begin, we import all the necessary libraries, and also register the use of the SPARQL plugin.

# -*- coding: utf-8 -*-
import grab
import rdflib
from rdflib import *
from rdflib import plugin
plugin.register(
    'sparql', rdflib.query.Processor,
    'rdfextras.sparql.processor', 'Processor')
plugin.register(
    'sparql', rdflib.query.Result,
    'rdfextras.sparql.query', 'SPARQLQueryResult')

Define our class, and build a constructor for it.
The designer can accept the url of the page that needs to be resolved and placed in the repository, and can also determine the non-standard location of the namespace definition.

class SQtHD():
    '''
    sparql query to html documents
    '''
    def __init__(self,url=None,htmlNamespace='http://localhost/rdf/html#'):
        '''
        Constructor
        '''
        self.__grab__=grab.Grab()#Наш парсер
        self.__storage__=Graph()#Наше хранилище
        self.__namespace__=Namespace(htmlNamespace)#Создаем пространство имен
        self.__storage__.bind('html', URIRef(htmlNamespace))#Задаем пространству имен короткое имя
        self.__initnamespace__=dict(self.__storage__.namespace_manager.namespaces())
        if url:#Если необходимо, то загружаем содержимое указанной страницы в хранилище.
            self.__store__(url)

Next, you need to define a utility function for filling the repository with the contents of the page.

    def __store__(self,url):
        self.__storage__.remove((None,None,None))#Очищаем хранилище
        self.__grab__.go(url)#Выполняем переход по указанному адресу средствами grab
        root=self.__grab__.tree.getroottree().getroot()
        self.__parse__(root)#Парсим содрежимое страницы.

The following item information is stored in local storage:

Information about the type of element (which html tag)
Parent Information
Information on the position of the element in relation to the brothers
Element Nesting Information
Test contained in the element
Information on the number of children
References to child elements
Attribute Value for an Item

The next utility function recursively passes through the tree of elements collecting the necessary information and adding it to the repository.

    def __parse__(self,element,parent=None,children_position=None,children_level=0):
        current_element=BNode()
        children_elements=element.getchildren()
        if str(element.tag)=='':
            self.__storage__.add((current_element, RDF.type, 
                                  self.__namespace__['comment']))
        else:
            self.__storage__.add((current_element, RDF.type, 
                                  self.__namespace__[element.tag]))
        if not parent==None:
            self.__storage__.add((current_element,self.__namespace__['parent'],parent))   
            self.__storage__.add((parent,self.__namespace__['children'],
                                  current_element))    
            self.__storage__.add((current_element,self.__namespace__['children_position'],
                                  Literal(children_position)))     
        self.__storage__.add((current_element,self.__namespace__['children_level'],
                              Literal(children_level)))     
        if element.text and len(element.text.strip())>0:
            self.__storage__.add((current_element,self.__namespace__['text'],
                                  Literal(element.text.strip())))
        if element.text_content() and len(element.text_content().strip())>0:
            self.__storage__.add((current_element,self.__namespace__['text_content'],
                                  Literal(element.text_content().strip())))
        self.__storage__.add((current_element,self.__namespace__['children_count'],
                              Literal(len(children_elements))))
        for i in element.attrib:
            self.__storage__.add((current_element,self.__namespace__[i],
                                  Literal(element.attrib[i])))
        for i in range(len(children_elements)):
            self.__parse__(children_elements[i],current_element,i,children_level+1)

This function performs a SPARQL query to local storage.

    def executeQuery(self,query,url=None):
        '''
        execute query on storadge
        '''
        if url:#Если необходимо, то загружаем содержимое указанной страницы в хранилище.
            self.__store__(url)
        return self.__storage__.query(query,
                                  initNs=self.__initnamespace__)#Возвращаем результат выполнения запроса.

This function fills the repository with the contents of the specified page.

    def loadStoradge(self,url):
        '''
        load and parse html page to local rdf storadge
        '''
        self.__store__(url)

And finally, some simple query examples.

if __name__ == "__main__":
    endPoint = SQtHD()#Создаем экземпляр класса SQtHD
    endPoint.loadStoradge('http://habrahabr.ru')#Загружаем страницу в хранилище
    print "All sources for images given by tag :"#Вывести все уникальные адреса картинок
    q=endPoint.executeQuery('SELECT DISTINCT ?src { ?a rdf:type html:img. ?a html:src ?src. }')
    for row in q.result:
        print row
    print
    print "All link urls:"#Вывести все уникальные адреса ссылок
    q=endPoint.executeQuery('SELECT DISTINCT ?href { ?a rdf:type html:a. ?a html:href ?href. }')
    for row in q.result:
        print row
    print
    print "All class names for elements:"#Вывести все уникальные имена классов
    q=endPoint.executeQuery('SELECT DISTINCT ?class { ?a html:class ?class. }')
    for row in q.result:
        print row
    print
    '''
    print "All scripts (without loaded by src):"#Тест всех внутристраничных скриптов.
    q=endPoint.executeQuery('SELECT ?text { ?a rdf:type html:script. ?a html:text ?text. }')
    for row in q.result:
        print row
    print'''
    print "All script srcs:"#Все ссылки на скрипты.
    q=endPoint.executeQuery('SELECT ?src { ?a rdf:type html:script. ?a html:src ?src. }')
    for row in q.result:
        print row
    print

The result of executing a request to display all script links:

All script srcs:
/javascripts/1341931979/all.js
/javascripts/1341931979/_parts/posts.js
/javascripts/1341931979/_parts/to_top.js
/javascripts/1341931979/_parts/shortcuts.js
/javascripts/1341931979/libs/jquery.form.js
/javascripts/1341931979/facebook_reader.js
/js/1341931979/adriver.core.2.js
/javascripts/1341931979/libs/highlight.js
/javascripts/1341931979/hubs/all.js
/javascripts/1341931979/posts/all.js

Thus, there are 3 ways to fill the repository:

At class initialization
Via the loadStoradge function
For every storage request

The GIST project also contains a namespace definition using xml. The namespace defines what is a tag, lays down the necessary properties and relationships, and defines html 4 tags.
Recommended reading:
“Programming the Semantic Web” By Toby Segaran, Colin Evans, Jamie Taylor

Tags:

SPARQL queries to the content of HTML pages

Also popular now: