
PHP HTML DOM parser with jQuery like selectors
Good afternoon, dear Khabrovites. In this post, we will talk about a joint project between SC Chen and John Schlick called PHP Simple HTML DOM Parser (links to sourceforge).
The idea of the project is to create a tool that allows you to work with html code using jQuery-like selectors. The original idea belongs to Jose Solorzano's and is implemented for php fourth version. This project is a more advanced version based on php5 +.
The review will present brief excerpts from the official manual , as well as an example implementation of the parser for twitter. In fairness, it should be noted that a similar post is already presenton habrahabr, but in my opinion contains too little information. Who are interested in this topic, welcome to cat.
Comrade Fedcomp made a useful comment about file_get_contents and 404 responses. The original script does not return anything when requested to page 404. To fix this, I added a check on get_headers. The modified script can be taken here .
The purpose of the article is not to provide comprehensive documentation on this script, a detailed description of all the features you can find in the official manual , if the community has a desire, I will be happy to translate the entire manual into Russian, for now I will give the example parser for twitter promised at the beginning of the article.
Thank you for attention. I hope it didn’t work out very hard and easy for perception.
htmlSQL - thanks Chesnovich
Zend_Dom_Query - thanks majesty
phpQuery - thanks theRavel
QueryPath - thanks ZonD80
of The DomCrawler (the Symfony) - thanks choor
CDOM - thanks to the author samally
notorious the XPath - thanks for the reminder kandy
PS
Habrozhitel Groove suggested that such materials have already been
PPS
will try to use your spare time collect all libraries and compile summary data on performance and usability.
The idea of the project is to create a tool that allows you to work with html code using jQuery-like selectors. The original idea belongs to Jose Solorzano's and is implemented for php fourth version. This project is a more advanced version based on php5 +.
The review will present brief excerpts from the official manual , as well as an example implementation of the parser for twitter. In fairness, it should be noted that a similar post is already presenton habrahabr, but in my opinion contains too little information. Who are interested in this topic, welcome to cat.
Getting html page code
$html = file_get_html('http://habrahabr.ru/'); //работает и с https://
Comrade Fedcomp made a useful comment about file_get_contents and 404 responses. The original script does not return anything when requested to page 404. To fix this, I added a check on get_headers. The modified script can be taken here .
Search for an item by tag name
foreach($html->find('img') as $element) { //выборка всех тегов img на странице
echo $element->src . '
'; // построчный вывод содержания всех найденных атрибутов src
}
Modification of html elements
$html = str_get_html('HelloWorld'); // читаем html код из строки (file_get_html() - из файла)
$html->find('div', 1)->class = 'bar'; // присвоить элементу div с порядковым номером 1 класс "bar"
$html->find('div[id=hello]', 0)->innertext = 'foo'; // записать в элемент div с id="hello" текст foo
echo $html; // выведет foo
Getting the text content of an element (plaintext)
echo file_get_html('http://habrahabr.ru/')->plaintext;
The purpose of the article is not to provide comprehensive documentation on this script, a detailed description of all the features you can find in the official manual , if the community has a desire, I will be happy to translate the entire manual into Russian, for now I will give the example parser for twitter promised at the beginning of the article.
Example parser of messages from twitter
require_once 'simple_html_dom.php'; // библиотека для парсинга
$username = 'habrahabr'; // Имя в twitter
$maxpost = '5'; // к-во постов
$html = file_get_html('https://twitter.com/' . $username);
$i = '0';
foreach ($html->find('li.expanding-stream-item') as $article) { //выбираем все li сообщений
$item['text'] = $article->find('p.js-tweet-text', 0)->innertext; // парсим текст сообщения в html формате
$item['time'] = $article->find('small.time', 0)->innertext; // парсим время в html формате
$articles[] = $item; // пишем в массив
$i++;
if ($i == $maxpost) break; // прерывание цикла
}
Message output
for ($j = 0; $j < $maxpost; $j++) {
echo '
';
}
Thank you for attention. I hope it didn’t work out very hard and easy for perception.
Similar libraries
htmlSQL - thanks Chesnovich
Zend_Dom_Query - thanks majesty
phpQuery - thanks theRavel
QueryPath - thanks ZonD80
of The DomCrawler (the Symfony) - thanks choor
CDOM - thanks to the author samally
notorious the XPath - thanks for the reminder kandy
PS
Habrozhitel Groove suggested that such materials have already been
PPS
will try to use your spare time collect all libraries and compile summary data on performance and usability.