maeln0r April 14, 2013 at 18:25

PHP HTML DOM parser with jQuery like selectors

Recovery mode

Good afternoon, dear Khabrovites. In this post, we will talk about a joint project between SC Chen and John Schlick called PHP Simple HTML DOM Parser (links to sourceforge).

The idea of the project is to create a tool that allows you to work with html code using jQuery-like selectors. The original idea belongs to Jose Solorzano's and is implemented for php fourth version. This project is a more advanced version based on php5 +.

The review will present brief excerpts from the official manual , as well as an example implementation of the parser for twitter. In fairness, it should be noted that a similar post is already presenton habrahabr, but in my opinion contains too little information. Who are interested in this topic, welcome to cat.

Getting html page code

$html = file_get_html('http://habrahabr.ru/'); //работает и с https://

Comrade Fedcomp made a useful comment about file_get_contents and 404 responses. The original script does not return anything when requested to page 404. To fix this, I added a check on get_headers. The modified script can be taken here .

Search for an item by tag name

foreach($html->find('img') as $element) { //выборка всех тегов img на странице
       echo $element->src . '
'; // построчный вывод содержания всех найденных атрибутов src
}

Modification of html elements

$html = str_get_html('Hello
World'); // читаем html код из строки (file_get_html() - из файла)
$html->find('div', 1)->class = 'bar'; // присвоить элементу div с порядковым номером 1 класс "bar"
$html->find('div[id=hello]', 0)->innertext = 'foo'; // записать в элемент div с id="hello" текст foo
echo $html; // выведет foo
World

Getting the text content of an element (plaintext)

echo file_get_html('http://habrahabr.ru/')->plaintext;

The purpose of the article is not to provide comprehensive documentation on this script, a detailed description of all the features you can find in the official manual , if the community has a desire, I will be happy to translate the entire manual into Russian, for now I will give the example parser for twitter promised at the beginning of the article.

Example parser of messages from twitter

require_once 'simple_html_dom.php'; // библиотека для парсинга
            $username = 'habrahabr'; // Имя в twitter
            $maxpost = '5'; // к-во постов
            $html = file_get_html('https://twitter.com/' . $username);
            $i = '0';
            foreach ($html->find('li.expanding-stream-item') as $article) { //выбираем все li сообщений
                $item['text'] = $article->find('p.js-tweet-text', 0)->innertext; // парсим текст сообщения в html формате
                $item['time'] = $article->find('small.time', 0)->innertext; // парсим время в html формате
                $articles[] = $item; // пишем в массив
                $i++;
                if ($i == $maxpost) break; // прерывание цикла
            }

Message output

                for ($j = 0; $j < $maxpost; $j++) {
                    echo '';
                    echo '' . $articles[$j]['text'] . '';
                    echo '' . $articles[$j]['time'] . '';
                    echo '';
                }

Thank you for attention. I hope it didn’t work out very hard and easy for perception.

Similar libraries

htmlSQL - thanks Chesnovich
Zend_Dom_Query - thanks majesty
phpQuery - thanks theRavel
QueryPath - thanks ZonD80
of The DomCrawler (the Symfony) - thanks choor
CDOM - thanks to the author samally
notorious the XPath - thanks for the reminder kandy

PS
Habrozhitel Groove suggested that such materials have already been
PPS
will try to use your spare time collect all libraries and compile summary data on performance and usability.

Tags: