The story of parsing one aspx site

Background


There is one online system for working with customer requests with which my young man has to work. The system is probably functional, good for administrators, efficient in management and others, but how inconvenient it is in daily use!
  1. It does not remember the login, password and city - as a result, after entering, you need to wait for all applications to be downloaded from the default city, and then switch to your own.
  2. Not all necessary information is available from the general list of applications. For part of it, you have to look inside the application, and each of them opens in a new window (there is a javascript and there is not even a normal href attribute, imagine?).
  3. This charm was made on asp, and therefore, with each transition, drives its viewstate over the network.
  4. Well, the minimum width of the site in one and a half with something thousand points does not give pleasure.

The specifics of the work sometimes makes you log into the system from a mobile phone and from the mobile Internet.
And if I worked with her myself, nothing would have happened - I would have got used to it, adapted, and indeed, the bosses are hungry ... But I feel sorry for the loved one, and the idea arose to write an application parser.

History


I’m actually a typesetter. And a web developer, but in this direction the skill is not so high, I just make decent websites on wordpress. With all sorts of harsh curl requests I have not encountered before. And with aspx sites too.
But it’s interesting!
(it resulted in a month of nights with php and a few sleepless nights. And a

lot of fun, of course) At first there were attempts at cross-domain queries using javascript, but nothing came of it.
Then, timid excavations aside phantomjs and other emulation of user behavior. But it turned out that I still lacked js skills.
As a result, everything works on curl requests coming from a php page.

Receiving the information

Authorization was quick enough, and earned more or less without problems.
The most nasty problem was the restriction on the number of incorrect password entries: twice - and call the administrator, restore access ...

But with the transition to the desired city, it stubbornly failed. The transition was made, but somewhere in the wrong direction, although the POST request was performed according to all the rules.
It turned out that preg_match does not work correctly with very large numbers of characters.
From this the directive saves

ini_set("pcre.backtrack_limit", 10000000);

First, we get the initial state of the page (since we are not logged in yet, we get to the login page), and rip out the viewstate from there:

	$url = 'http://***/Default.aspx';
	$content = curlFunction($url);
	preg_match_all("/id=\"__VIEWSTATE\" value=\"(.*?)\"/", $content, $arr_viewstate);
	$viewstate = urlencode($arr_viewstate[1][0]);

Now, already having the actual cast of the page status on hand, enter the username and password.
(postdata is the POST parameter of the request to the page, you can peep in the same firebug).

	$url = 'http://***/Default.aspx?ReturnUrl=%2fHome%2fRoutes.aspx';
	$postdataArr = array(
		'__LASTFOCUS=',
		'__EVENTTARGET=',
		'__EVENTARGUMENT=',
		'__VIEWSTATE='.$viewstate,
		'ctl00$cphMainContent$loginBox$loginBox$UserName='.$login,
		'ctl00$cphMainContent$loginBox$loginBox$Password='.$password,
		'ctl00$cphMainContent$loginBox$loginBox$LoginButton=Войти',
		);
	$postdata = implode('&',$postdataArr);
	$content = curlFunction($url, $postdata);
	preg_match_all("/id=\"__VIEWSTATE\" value=\"(.*?)\"/iu", $content, $arr_viewstate);
 	$viewstate = urlencode($arr_viewstate[1][0]);

Due to the fact that the initial link is issued with a redirect, and curl has a setting

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);	// переходит по редиректам

we get we get a viewstate of the page we need.

It was at this point that a problem arose with a disabled preg_replace, but a solution - thanks to the Habr - was found.
There is! Now you can switch to applications for the desired city and do parsing.

	$url = 'http://***/Home/Routes.aspx';
	$postdataArr = array(
			'__EVENTTARGET=ctl00$cphMainContent$ddlCityID',
			'__EVENTARGUMENT=',
			'__LASTFOCUS=',
			'__VIEWSTATE='.$viewstate,
			'ctl00$cphMainContent$ddlCityID='.$city,
			'ctl00$cphMainContent$tbConnectionDate='.$date,
			);
	$postdata = implode('&',$postdataArr);
	$content = curlFunction($url, $postdata);

When you finally understand what you are doing, everything is quite simple: you need to click on the link that you received the viewstate in the last step.

Data processing

Got it, start parsing.
The first experience was with regular expressions. Unfortunately, php on hosting somehow worked very strangely with multi-line expressions, and didn’t tear completely select (with all option), no matter how I tried to persuade it (everything worked on LAN).

The next step was the Simple Html Dom library . Everything is fine, we got there, click on the links and parse the information ... It takes 0.9 seconds to get one page, another 5 seconds to get data from five inputs on the page. When you need to go to nine such links, everything becomes very sad.

Google, we think, we read. We find Nokogiri . You know, easy and worthwhile! Really fast and pleasant thing to work with:

	$html = new nokogiri($content);
	//Получаем данные input'а
	$RepairNumber = $html->get('#ctl00_cphMainContent_tbRepairNumber')->toArray();
	$result['RepairNumber'] = $RepairNumber[0]['value'];
	//Получаем данные select'а
	$ConnectionTimeArr = $html->get('#ctl00_cphMainContent_ddlConnectionTime')->toArray();
	foreach($ConnectionTimeArr as $e) {
		foreach($e['option'] as $el) {
			if(isset($el['selected'])) {
				$result['ConnectionTime'] = $el['#text'][0];
			}
		}
	}


Beauty and Design

Suddenly a very strange problem appeared: the customer, with obvious dissatisfaction, used the developer version without css, js and other bells and whistles. More precisely, he did not understand how to use it at all .

We are looking for information about XHR requests .

//забираем данные, необходимые для POST-запроса
	var login = $('#login').val();
	var password = $('#password').val();
	var val = $('#datePicker').val();
//формируем запрос
	var params = 	'login=' + encodeURIComponent(login) + 
					'&password=' + encodeURIComponent(password) + 
					'&date=' + encodeURIComponent(date) +
					'&firstlogin=true';
//открываем соединение и отдаем данные скрипту на сервере, который логинится, переходит по ссылкам, забирает нужную информацию
	var req = getXmlHttp()
	req.open('POST', 'script.php', true)
	req.setRequestHeader('Content-Type', 'application/x-www-form-urlencoded')
	req.send(params);
//затемняем экран - ну нужны же плюшечки!
	$('.dark').fadeIn();
	req.onreadystatechange = function() {
		if (req.readyState == 4) {
			if(req.status == 200) {
//получаем данные, выводим и выключаем затемнение
				$('.dark').fadeOut();
				$('#worker').html(req.responseText);
			}
		}
	}


Profit! The user rejoices, the user's mobile phone is relieved of the need to overtake tons of viewstates on the mobile Internet, and managing the design of a hand-written page is somehow simpler.

PS They just asked me if it is possible with the help of this client to change the data in the system of work with applications. It seemed like a threat ...

Also popular now: