Examples of xpath html requests

    Xpath is the language of requests for xml or xhtml elements in a document. Like SQL, xpath is a declarative query language. To get the data of interest, you just need to create a query that describes this data. All the "black" work for you will be done by the xpath interpreter.
    Very convenient, isn't it? Let's see what features xpath offers for accessing web page nodes.

    Create a request to web page sites


    I bring to your attention a small laboratory work during which I will demonstrate the creation of xpath requests for a web page. You will be able to repeat my requests and, most importantly, try to fulfill yours. I hope that this will make the article equally interesting for beginners and programmers familiar with xpath in xml.

    For the laboratory we need:
    - xhtml web page;
    - Mozilla Firefox browser with add-ons;
    - firebug ;
    - firePath ;
    (you can use any other browser with visual xpath support)
    - a bit of time.

    As a web page for the experiment, I propose the main page of the World Wide Web Consortium website (' http://w3.org '). It is this organization that develops the xquery (xpath) languages, the xhtml specification, and many other Internet standards.

    Task

    Get information about consortium conferences using xpath requests from the xhtml code of the w3.org main page.
    Let's start writing xpath requests.


    First xpath request

    Open the Firepath tab in FireBug, select the element for analysis with the selector, click: Firepath created an xpath request to the selected element.

    If you select the title of the first event, then the query will look like this:

    .//*[@id='w3c_home_upcoming_events']/ul/li[1]/div[2]/p[1]/a

    After deleting the redundant indexes, the query will correspond to all elements of the "header" type.

    .//*[@id='w3c_home_upcoming_events']/ul/li/div/p/a

    Firepath highlights items that match the query. You can see in real time which nodes of the document match the query. Move on. We create requests to search for conference venues and their sponsors either using the selector or by modifying the first request. Request for information on conference venues: So we get a list of sponsors:






    .//*[@id='w3c_home_upcoming_events']/ul/li/div/p[2]


    .//*[@id='w3c_home_upcoming_events']/ul/li/div/p[3]

    Xpath syntax


    Let's go back to the created queries and see how they work.
    Consider the first query in detail.



    In this query, I highlighted three parts to demonstrate the capabilities of xpath. (The division into parts is catchy)

    The first part
    . // - a recursive descent to zero or more levels of the hierarchy from the current context. In our case, the current context is the root of the document.

    The second part
    * is any element,
    [@ id = 'w3c_home_upcoming_events'] is the predicate based on which we search for a node with the id attribute equal to 'w3c_home_upcoming_events'. XHTML element identifiers must be unique. Therefore, the query "any element with a specific ID" should return the only node we are looking for.

    We can replace* the exact name of the div node in this request
    div[@id='w3c_home_upcoming_events']

    Thus, we go down the document tree to the div node [@ id = 'w3c_home_upcoming_events'] we need. We are absolutely not worried about what nodes the DOM tree consists of and how many hierarchy levels remain above.

    The third part is
    / ul / li / div / p / a –xpath-path to a specific element. The path consists of addressing steps and the conditions for checking nodes (ul, li, etc.). Steps are separated by a "/" (slash).

    Xpath collections

    It is not always possible to access a node of interest using a predicate or addressing steps. Very often at one level of the hierarchy is how many nodes of the same type are located and it is necessary to select “only the first” or “only the second” nodes. For such cases, collections are provided.

    The xpath collections allow you to access an element by its index. Indexes correspond to the order in which the elements were presented in the original document. The serial number in the collections is counted from one.

    Based on the fact that the “venue” is always the second paragraph after the “conference name”, we get the following query:
    .//*[@id='w3c_home_upcoming_events']/ul/li/div/p[2]
    Where p [2] is the second element in the set for each node of the list / ul / li / div.

    Similarly, we can get the list of sponsors by request:
    .//*[@id='w3c_home_upcoming_events']/ul/li/div/p[3]

    Some xpath functions

    There are many functions in xpath for working with elements within a collection. I will give only some of them.

    last ():
    Returns the last item in the collection.
    Request ul/li/div/p[last()] - will return the last paragraphs for each node of the ul list.
    The first () function is not provided. To access the first item, use the index "1".

    text ():
    Returns the test content of an element.
    .//a[text() = 'Archive'] - we get all the links with the text "Archive".

    position () and mod:
    position () - returns the position of the element in the set.
    mod - remainder of division.

    By the combination of these functions we can get:
    - not even elements ul/li[position() mod 2 = 1]
    - even elements: ul/li[position() mod 2 = 0]

    Comparison operations
    • <- logical "less"
    • > - logical "more"
    • <= - logical "less than or equal to"
    • > = - logical "greater than or equal to"

    ul/li[position() > 2] , ul/li[position() <= 2]- list items starting from the 3rd number and vice versa.

    Full feature list

    Independently


    Try to get:
    - even URL links from the left menu "Standards";
    - headlines for all news except the first from the w3c.org main page.

    Xpath in PHP5


    	$dom = new DomDocument();
    	$dom->loadHTML( $HTMLCode );
    	$xpath = new DomXPath( $dom );
    	$_res = $xpath->query(".//*[@id='w3c_home_upcoming_events']/ul/li/div/p/a");
    	foreach( $_res => $obj ) {
                    echo 'URL: '.$obj->getAttribute('href');
    		echo $obj->nodeValue;
            }
    


    Finally


    In a simple example, we saw the capabilities of xpath to access web page nodes.
    Xpath is the industry standard for accessing xml and xhtml, xslt transform elements.
    You can use it to parse any html page. If the source html-code contains significant markup errors, pass it through tidy . Errors will be fixed.

    Try to refuse regular expressions when parsing web pages in favor of xpath.
    This will make your code easier, more understandable. You make fewer mistakes. Reduce debugging time.

    Resources


    Firepath add-on Mozzilla Firefox
    Wikipedia brief annotation of the language A
    good xpath reference. Do not pay attention to the fact that it is for the NET Framework. Xpath works the same in all environments, except for a couple of specific functions.
    Xpath 1.0 specification.
    Xpath 1.0 specification in Russian.
    XQuery 1.0 and XPath 2.0
    Tidy
    PHP5 tidy :: repairFile

    Also popular now: