SternMore September 9, 2016 at 11:32

GrabDuck: How We Make Bookmarked Articles

Greetings to you reader. Not so long ago, a new parser / article extractor appeared on our GrabDuck service . Between each other, we call it GrabDuck Article Extractor 2.0 or abbreviated GAE 2.0. Why so loud? The fact is, there have been so many changes and improvements that we had to completely throw out the old one with which we lived the last year and a half and create a new article parser “from scratch”. And yes, for us this is a big and important change. What we did not like and what we did as a result is described under the cut.

So, we lived for a long time with an old parser, taken from the side in the form of a fork from some open source project. Yes, he was good and tried to do his job at 100 (somewhere in one of the first articles we gave a link to it - if you're interested, take a look). And for those whose requirements do not exceed the average, we still recommend it for use - it can handle it for sure.

But over time, increasingly began to come across various restrictions. We all know that we have all kinds of sites, we still come across a terrible legacy of the 2000s, when there were no special standards. In general, here our library crashed and had to climb more and more into someone else's code and edit it "for itself." Over time, perhaps the main complaint was that the library was like a Swiss knife, good, but I did everything myself: I downloaded the document by url, tracked redirects, knew how to decrypt various shorteners, tried to determine the encoding, even if it was not explicitly set, parsed a document, identified images, and even tried to figure out the date the article was published. In general, not a library, but a fairy tale ... True, until it was required to fix something or change a little. And we often faced the choice: either edit the code directly, or, after processing the document, conduct another, your own, naturally duplicating in places some kind of logic from our original parser. And this decision was not easy - the creator of this very open source library was clearly not a fan of testing, and therefore everything worked exactly until it was touched and made no significant changes to the code.

And considering that the process of parsing and retrieving articles, for those who have never encountered this, relies entirely on statistics, and not on clear criteria, it was enough to slightly change the weight of the applied statistical model, and we immediately risked that once , some classes of sites will simply stop being processed correctly. After all, there is no general format - the whole article is just a big piece of html, in which somewhere inside there are several paragraphs of the text we need so much. So over time, we got our parallel world, when an already processed and seemingly already prepared article was once again run, but through our own under-parser.

How this open source library worked in multithreaded mode was a separate and very sad song. And initially at large imports, when the bill went to tens of thousands of bookmarks standing simultaneously in line for processing, everything in our kingdom simply froze.

And this was lesson number 1 for us: when building your system, use independent components - bricks. And it is from them that collect what you need. If something goes wrong, or a new interesting project appears that does the job better, you can always turn off the old one and try a new one without breaking the system and not risking everything to crumble at some stage. And here, believe our experience, if something in the world of numbers can go wrong, it will certainly go wrong almost immediately.

So in the end we decided - that's enough, it's time to take control into our own hands and write something of ours, according to our requirements and with satisfying quality. And so a new component appeared on our architectural diagram - GAE 2.0.
First of all, I wanted to build it in the form of a bunch of independent components. For some steps, we needed parallel processing according to the principle, the more, the better, somewhere we could get by with a single thread, and somewhere we wanted to speed up parallel work, but there were serious restrictions on the number of simultaneously processed elements.

As a result, a sort of conveyor or pipeline emerged, where each bookmark turns into a full-fledged article, with each step being filled with user-relevant data.

So, what steps do we need to take to make a full-fledged article out of a link that can already be shown to the user?

After reflection, the areas of responsibility turned out to be: Actually Url Fetcher himself. Responsible for directly downloading the article provided by Url. I must understand all kinds of redirects, be able to work through SSL and link shorteners. And it needs to be parallelized, because the very wait for a response from the server takes years of computer time and you need to do something about it. But the strategy, the bigger the better, it also doesn’t work here: we will bombard the same site with requests and we will simply be banal. So we need some kind of optimum, which is called both ours and yours.

The result must be checked for errors, and it can also be a duplicate of an article already existing on GrabDuck, and then it’s enough just to attach a new user to this existing article.

And only after this comes the time to extract the data and prepare the final article, which we will show to our users. What is included here? Receiving meta-information, and these are headings, images, calculation of tags and language of the document. Of course, we need the content itself for full-text search, we also need to generate a text snippet that briefly represents the document itself.

After that, the document is ready for use and searchable on GrabDuck.

So the new parser is ready and cheers all new bookmarks will go through it and we will finally get what we wanted! But the big question that the reader may have is: what will happen to the existing bookmarks? After all, they were ALREADY processed and ALREADY saved in the system! Will they really remain untouched? And our answer is no! First of all, the user always has the opportunity to select a bookmark and force update it. To do this, just select the appropriate item from the context menu. It looks something like this.

Well, or just wait a bit. One of the great features of GrabDuckIt consists in the fact that we conduct a periodic check of all bookmarks: is everything all right, are the sites still alive, have new comments appeared on the page, etc. So sooner or later, your bookmarks will get updated and go through GAE 2.0.

Today is all we wanted to tell you. Leave your comments and see you soon.

Tags:

GrabDuck: How We Make Bookmarked Articles

Also popular now: