Top comments of Habr - service, implementation details, and some statistics (C #)
Some time ago the page "Best Comments" was deleted from Habr (details here: habrahabr.ru/qa/18401 ).
Nevertheless, it was interesting for me to look there - and for the sake of lulz, and sometimes interesting articles come across from those that I missed in the tape. So I decided to make my little service. I hope the administration will not mind.
Current service URL: habrastats.comyr.com
The first "version" displayed top comments from the last N posts, was written in LINQPad in two hours and took up one screen ( pastebin) It became clear that “on demand” it is unrealistic to generate even for posts over the past 24 hours (download speed of 1-2 posts per second), which means that a periodic update is necessary. From here came the idea of turning the service on a home machine (always on) and uploading the static results to a free hosting.
Code: code.google.com/p/habra-stats
- Windows Service in C # (4.5, VS2012 - new features are not used, can be assembled under 4.0)
- Parsing on Regexp (and yes, I know: You can't parse HTML with regex , but it’s okay here)
- MS SQL Express + Entity Framework (well, very convenient ORM)
- XSLT for HTML generation (it took css and layout from the hub, let the administration forgive me again)
Every two hours, the service wakes up and the habrahabr page parses . ru / posts / collective / new and receives the Id of the newest post, then it downloads the posts in reverse order until the publication date reaches the threshold (older than 3 days). Posts are parsed and put into the database.
Previously, all existing posts were loaded into the database (it took two days).
Then, “reports” are generated from the database, such as “best of the day”, “worst of the month with a picture”, etc. The data for the report is simply a collection of Comment objects that are serialized and transformed by XSLT. Results are uploaded via FTP to the hosting.
There is a little trick to generating reports and navigating between them: each of the filtering methods (ZaDen, ZaNedelya, Best, Worst, etc.) is marked with the attribute:
Through Reflection we get all combinations of such methods into categories, get data from the database and generate navigation. Thus, to add another “report” (for example, “in three days”), you just need to add a method with an attribute. Glory to LINQrobots and Entity Framework.
Initially, I thought to do without a database and make everything as simple as possible: store raw HTML on disk, load it into memory, and process it there. But he underestimated the scale of the disaster: 150 thousand posts in HTML took 10-plus gigabytes. Even on SSDs, loading and parsing times are unacceptable.
Then I tried SQL Compact Edition (in-process database, supports entity framework). I ran into a 4GB limit on the size of the database file. At that time, there was only one Comments table with duplicate (denormalized) data. After switching to SQL Express, I partially removed duplication by adding the Posts table and deleted comments without votes (of which there were about 30%). As a result, the size of the base is now about 2GB.
In the process of parsing I found out that recklessly used RegexOptions.IgnoreCase reduces performance several times.
At the time of writing the article in the database:
90619 posts
18 comments on average (no comments without votes in the database)
15 of them with positive ratings
1676593 comments in total
721 comments per day
Average number of comments by day of the week
Comments per week: time dynamics
Website:
http://habrastats.comyr.com/
Once again: code.google.com/p/habra-stats
RSS with the best of the previous day
P.S. Suggest more interesting requests
P.PS the famous commentary on the famous topic about pornolab is not displayed, as the author of the publication is blocked.
Commented by nForce , you can see the comment here: habrahabr.ru/users/nforce/comments/page2
Nevertheless, it was interesting for me to look there - and for the sake of lulz, and sometimes interesting articles come across from those that I missed in the tape. So I decided to make my little service. I hope the administration will not mind.
Current service URL: habrastats.comyr.com
The first "version" displayed top comments from the last N posts, was written in LINQPad in two hours and took up one screen ( pastebin) It became clear that “on demand” it is unrealistic to generate even for posts over the past 24 hours (download speed of 1-2 posts per second), which means that a periodic update is necessary. From here came the idea of turning the service on a home machine (always on) and uploading the static results to a free hosting.
Briefly about the implementation
Code: code.google.com/p/habra-stats
- Windows Service in C # (4.5, VS2012 - new features are not used, can be assembled under 4.0)
- Parsing on Regexp (and yes, I know: You can't parse HTML with regex , but it’s okay here)
- MS SQL Express + Entity Framework (well, very convenient ORM)
- XSLT for HTML generation (it took css and layout from the hub, let the administration forgive me again)
Every two hours, the service wakes up and the habrahabr page parses . ru / posts / collective / new and receives the Id of the newest post, then it downloads the posts in reverse order until the publication date reaches the threshold (older than 3 days). Posts are parsed and put into the database.
Previously, all existing posts were loaded into the database (it took two days).
Then, “reports” are generated from the database, such as “best of the day”, “worst of the month with a picture”, etc. The data for the report is simply a collection of Comment objects that are serialized and transformed by XSLT. Results are uploaded via FTP to the hosting.
There is a little trick to generating reports and navigating between them: each of the filtering methods (ZaDen, ZaNedelya, Best, Worst, etc.) is marked with the attribute:
[CommentReport(Category = "Время" , Name = "За сутки" , CategoryOrder = 0)]
Through Reflection we get all combinations of such methods into categories, get data from the database and generate navigation. Thus, to add another “report” (for example, “in three days”), you just need to add a method with an attribute. Glory to LINQ
A bit about problems and solutions
Initially, I thought to do without a database and make everything as simple as possible: store raw HTML on disk, load it into memory, and process it there. But he underestimated the scale of the disaster: 150 thousand posts in HTML took 10-plus gigabytes. Even on SSDs, loading and parsing times are unacceptable.
Then I tried SQL Compact Edition (in-process database, supports entity framework). I ran into a 4GB limit on the size of the database file. At that time, there was only one Comments table with duplicate (denormalized) data. After switching to SQL Express, I partially removed duplication by adding the Posts table and deleted comments without votes (of which there were about 30%). As a result, the size of the base is now about 2GB.
In the process of parsing I found out that recklessly used RegexOptions.IgnoreCase reduces performance several times.
Some statistics
At the time of writing the article in the database:
90619 posts
18 comments on average (no comments without votes in the database)
15 of them with positive ratings
1676593 comments in total
721 comments per day
Average number of comments by day of the week
Comments per week: time dynamics
Finally, the links!
Website:
http://habrastats.comyr.com/
Once again: code.google.com/p/habra-stats
In the plans
RSS with the best of the previous day
P.S. Suggest more interesting requests
P.PS the famous commentary on the famous topic about pornolab is not displayed, as the author of the publication is blocked.
Commented by nForce , you can see the comment here: habrahabr.ru/users/nforce/comments/page2