Web Scraping and .Net
Recently, I have been interested in web scraping (aka web mining) and as a result I decided to write an article for those who have already heard that it exists, but have not tried it yet.
So, in my understanding, web-scrapping itself is the transfer of data posted on the Internet in the form of HTML-pages to some kind of storage. The storage can be either a plain text file, an XML file or a database (DB). That is, the reverse (reverse) process is obvious - after all, a web application usually takes data from the database.
For an example we will take a simple case - analysis of a page of a site auto.ru. By clicking on the link http://vin.auto.ru/resolve.html?vin=TMBBD41Z57B150932 we will see some information displayed for the identification number TMBBD41Z57B150932 (make, model, modification, etc.). Imagine that we need to display this information in a window, for example, Windows-based applications. Working with the database in .Net is widely described, so we will not focus on this problem, we will deal with the essence.
So, let's create a WinForms application project, drop one TextBox component on the form with the name tbText, in which our address will be written (link); the btnStart button, when clicked, the request will be executed at the specified address, as well as the ListBox lbConsole, where we will get the received data. In a real application, links also have to be taken from some external source, but do not forget that this is just an example.
Actually with the interface everything, now we will create a method called in response to a button click.
In this method, we need to do the following things:
1. Contact the address provided in our TextBox
2. Get the page
3. Select the necessary data from the page
4. Display the data on the form
First, create a variable in which the page received by request will be stored:
Next, create a request by passing the link we know as a parameter:
We set the properties of the request, they will help us impersonate us as a browser. This is not important in this case, but some sites analyze the request headers, so this is a clue for the future.
We also indicate that the GET method will be used.
Now we fulfill the request and go to the next item -
Actually, the server’s response, and therefore the page itself, is now stored in our autoResponse variable. Now you need to analyze this answer, if everything is OK, then you can present the page as a string:
And if everything is really OK, then we now have a line of the same type in the AutoResult variable that we can see in the browser using the "Page Source" menu. Well, perhaps in an unformatted form.
This is all, of course, great. But I would like to choose exactly what we need from this mess of tags. Here, regular expressions that we use with the help of expander methods will come to our aid. Expander methods, I recall, are such static methods of a static class that can be called as a method of an object of another class, if this object of this class is the first parameter of a method of a static class marked with the this keyword. The example is simpler. If we have a StringWithEq method of the StringOperations class
then we can use this method both in the usual way (1) and as an expander method (2):
If you look at the source code of an HTML page in a browser, you will notice that the data we need is contained inside a tag that is not used anywhere else:
So, in my understanding, web-scrapping itself is the transfer of data posted on the Internet in the form of HTML-pages to some kind of storage. The storage can be either a plain text file, an XML file or a database (DB). That is, the reverse (reverse) process is obvious - after all, a web application usually takes data from the database.
From theory to practice
For an example we will take a simple case - analysis of a page of a site auto.ru. By clicking on the link http://vin.auto.ru/resolve.html?vin=TMBBD41Z57B150932 we will see some information displayed for the identification number TMBBD41Z57B150932 (make, model, modification, etc.). Imagine that we need to display this information in a window, for example, Windows-based applications. Working with the database in .Net is widely described, so we will not focus on this problem, we will deal with the essence.
So, let's create a WinForms application project, drop one TextBox component on the form with the name tbText, in which our address will be written (link); the btnStart button, when clicked, the request will be executed at the specified address, as well as the ListBox lbConsole, where we will get the received data. In a real application, links also have to be taken from some external source, but do not forget that this is just an example.
Actually with the interface everything, now we will create a method called in response to a button click.
In this method, we need to do the following things:
1. Contact the address provided in our TextBox
2. Get the page
3. Select the necessary data from the page
4. Display the data on the form
We are addressing
First, create a variable in which the page received by request will be stored:
- string AutoResult = String.Empty;
* This source code was highlighted with Source Code Highlighter.
Next, create a request by passing the link we know as a parameter:
- var autoRequest = (HttpWebRequest)WebRequest.Create(tbLink.Text);
* This source code was highlighted with Source Code Highlighter.
We set the properties of the request, they will help us impersonate us as a browser. This is not important in this case, but some sites analyze the request headers, so this is a clue for the future.
- autoRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)";
- autoRequest.Headers.Add("Accept-Language", "ru-Ru");
- autoRequest.Accept = "image/gif, image/jpeg, image/pjpeg, image/pjpeg, application/x-shockwave-flash, application/x-ms-application, application/x-ms-xbap, application/vnd.ms-xpsdocument, application/xaml+xml, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*";
* This source code was highlighted with Source Code Highlighter.
We also indicate that the GET method will be used.
- autoRequest.Method = "GET";
* This source code was highlighted with Source Code Highlighter.
Now we fulfill the request and go to the next item -
Getting page
- HttpWebResponse autoResponse = (HttpWebResponse)autoRequest.GetResponse();
* This source code was highlighted with Source Code Highlighter.
Actually, the server’s response, and therefore the page itself, is now stored in our autoResponse variable. Now you need to analyze this answer, if everything is OK, then you can present the page as a string:
- if (autoResponse.StatusCode == HttpStatusCode.OK)
- {
- using (Stream autoStream = autoResponse.GetResponseStream())
- {AutoResult = new StreamReader(autoStream, Encoding.GetEncoding("windows-1251")).ReadToEnd(); }
- }
* This source code was highlighted with Source Code Highlighter.
And if everything is really OK, then we now have a line of the same type in the AutoResult variable that we can see in the browser using the "Page Source" menu. Well, perhaps in an unformatted form.
This is all, of course, great. But I would like to choose exactly what we need from this mess of tags. Here, regular expressions that we use with the help of expander methods will come to our aid. Expander methods, I recall, are such static methods of a static class that can be called as a method of an object of another class, if this object of this class is the first parameter of a method of a static class marked with the this keyword. The example is simpler. If we have a StringWithEq method of the StringOperations class
- static class StringOperations
- {internal static string StringWithEq(this string s) {return string.Format("{0} = ", s);}}
* This source code was highlighted with Source Code Highlighter.
then we can use this method both in the usual way (1) and as an expander method (2):
- string test = "Test";
- (1) Console.Write(StringOperations.StringWithEq(test));
- (2) Console.Write(test.StringWithEq());
* This source code was highlighted with Source Code Highlighter.
If you look at the source code of an HTML page in a browser, you will notice that the data we need is contained inside a tag that is not used anywhere else:
- Идентификационный номер
- TMBBD41Z57B150932
- Марка
- SKODA
- Модель
- Octavia II (A5)
- Модификация
- Elegance
- Модельный год
- 2007
- Тип кузова
- седан
- Количество дверей
- 5-дверный
- Объем двигателя, куб.см.
- 2000
- Описание двигателя
- 150лс
- Серия двигателя
- BLR, BLX, BLY
- Система пассивной безопасности
- подушки безопасности водителя и переднего пассажира
- Сборочный завод
- Solomonovo
- Страна сборки
- Украина
- Страна происхождения
- Чехия
- Производитель
- Skoda Auto a.s.
- Серийный номер
- 50932
- Контрольный символ
- NOT OK!
Партнёр проекта - vinformer.su
Therefore, we will take advantage of this, first we extract the data from inside this tag, and then we parse it and put it, for example, into an object of the Dictionary class. After that, display the received data in the ListBox lbConsole. I would like the final code to look like this:
- string BetweenDL = AutoResult.BetweenDL();
- Dictionary
d = BetweenDL.BetweenDTDD(); - foreach (var s in d)
- {
- lbConsole.Items.Add(string.Format("{0}={1}", s.Key, s.Value));
- }
* This source code was highlighted with Source Code Highlighter.
In the first line, we get a line containing the necessary data. Here we use an expander method of this kind:
- internal static string BetweenDL(this string dumpFile)
- {
- var _regex = new Regex(@"
]*>(?[\s\S]+?)", RegexOptions.IgnoreCase | RegexOptions.Compiled);
- Match _match = _regex.Match(dumpFile);
- return _match.Success ? _match.Groups["value"].Value : string.Empty;
- }
* This source code was highlighted with Source Code Highlighter.
Next, using another extender method, select the necessary data and write it to an object of the Dictionary class:
- internal static Dictionary
BetweenDTDD(this string dumpFile) - {
- var _regex = new Regex(@"
- (?
[\s\S]+?)- ]*>(?
[\s\S]+?)", RegexOptions.IgnoreCase | RegexOptions.Compiled); - MatchCollection matches = _regex.Matches(dumpFile);
- Dictionary
d = new Dictionary(); - foreach (Match match in matches)
- {
- GroupCollection groups = match.Groups;
- d.Add(groups["valDT"].Value, groups["valDD"].Value);
- }
- return d;
- }
* This source code was highlighted with Source Code Highlighter.
Next, in the foreach loop, display the resulting data in a ListBox.
Of course, you could use only the second expander method, the result would be the same. In real applications, it is sometimes more convenient to select a part of the text containing the necessary data, and then deal with it. Other improvements and / or changes can be made to this code, but I hope that the goal of this short article I have achieved is to give you an idea of what web scraping is.