
2000 hours alone, or how an RSS reader was made / I'm a robocop

I am going to share with you the technical side of how I made a new web rss reader in 16 weeks and almost went crazy.
Moving away from a long history, we will assume that everything started in February this year when David and I ( dmiloshev , UI designer) decided to make a prototype of our brainchild together.
“Alone” - because there were no scars, meetings, “collective reason”, and I had to do the entire technical part myself.
If I was asked to describe the entire article in one sentence, it would
turn out: No-SQL, mongodb, node.js, my brain, Evented I / O, queues, outputs, git, nginx, memcached, Google Reader, Atom, TTL , PHP, ZF, jQuery, conclusions.
I. Technology
1. PHP / ZendFramework + something else
All that I had from the very beginning is a small framework that makes working with zf a little more convenient. It has Dependency Injection, Table Data Gateway, Transfer Object, more convenient work with configs, configured Phing with already registered task tasks for almost all occasions. In general, working with all of this is very pleasant.
Architecturally php application consists of the following layers:
- Routing / Controller / View - I see ..
- Service - here ACL, validation, caching, logging. You can easily attach REST to it and there will be a great API.
- Gateway is the body of business logic. Each entity in the system has its own gateway. Absolutely abstracted from the database.
- Mapper - here, in fact, direct work with the database.
- KISS
- No bicycles
- Any logical part of the system must be able to scale horizontally
- All procedures dealing with external sources should be performed in the background and not freeze the user interface
- Any high-load should rest on the queue, processor power and timeouts, and not on the number of processes or open connections
- Any data can be recovered.
- You need to log absolutely everything, not just errors
- “It can be cached”
- David: “This plate needs to be 1 pixel to the left ..” =)
2. nginx no comment
3. git Once made it clear that I was not as smart as I thought.
4. Mongodb
Previously, we used it for production in another project, but very carefully, so we could not try it out in full. Recently, the No-SQL, sharding, map-reduce, and No-SPOF mods have been particularly developed. I decided it was time to get out of my children's pants. At least it diluted the overall routine very much and shook me slightly.
The guys’s documentation is very detailed, so it turned out to get into the depths of mongodb in the first two weeks. I had to turn my brain slightly inside out after years of working with relational databases. But all the non-trivial tasks turned out to be solved independently, without resorting to asking questions on the forums.
I am a little afraid of its launch on production. To keep abreast of possible problems, I regularly read groups, study the issues that other people face.
At the moment, it is configured as master-master, which is not fully supported, but in our case it should work as it should. In the future we will shard, and it will be definitely easier than with the same mysql.
5. Memcached
There is nothing to say. Simple as a door. Unless, in the future I want to try it on UDP ... just for fun.
6. Memcacheq
There are many alternatives to this today, but I can say that he showed himself very well in production in the previous project.
And the nice thing is that it does not need a special driver - it works on top of memcached (it helped in the next paragraph).
7. node.js
This is probably the most interesting thing that happened to me over these four months. Server Event I / O is very exciting, even more than a differential . I immediately wanted to rewrite all php to ruby in the mood. These are my dreams.
The fact is that I discovered it recently and quite by accident. After that, quite a few things fell into place, both in the system itself and in my head. I had to rewrite quite a lot, but the result is very pleasing to the soul, and, I hope, will delight future users.
I smoked this page before the filter , at the moment I use it: mongoose, kiwi, step, memcache, streamlogger, hashlib, consolelog, eyes, daemon
From his libraries he wrote jsonsocket, which, in my opinion, speaks for itself. Hands do not reach his github-thread. And now I dream of making bsonsocket from it. Of course, I had to write things for working with queues, and a layer for working with the Gateway layer in php (more on that later).
I also added prowl , now the background sends me a push message to the phone random quotes from bashorg once an hour (along with some statistics in the form of memory usage, etc.)
Many libraries (modules) are very raw, so I sometimes had to edit my hands right in someone else’s code (no time patches to do). And dear gentlemen, node.js are crap on backward-compatibility, so you can often find libraries that just don't work.
8. jQuery
For me, this is almost a synonym for client-side javascript.
Used plugins: blockUI, validate, form, tooltip, hotkeys, easing, scrollTo, text-overflow and a couple more small ones.
II. Development
I will not delve into the specifics of the service itself, technically, it is almost Google Reader (GR).
While David “drove” the gray squares through Photoshop, thinking about business logic, I started with basic modeling, after which I immediately switched to the feed pumping system.
1. Feed Pull
It would seem that everything is simple here - we pull the address, pump out xml, parsim, write to the database. But there are nuances.
- Each feed needs to be uniquely identified in order to be able to save it in the system.
- Also, you need to identify each record to avoid duplicates.
- Support for things like if-modified-since and etag
- Redirect processing
- Different versions of RSS / Atom
- Extensions of various services, for example gr: date-published
- HTML inside each entry needs to be cleaned, but not completely, leaving good tags, filtering out any heresy
- Finding and processing an icon turned out to be not the most pleasant thing ... for example, livejournal does not give a content-type, you have to use magic.mime
- Apparently, few people read the specification; therefore, the xml may not be valid, or AT ALL not valid
External sources are very different, many just spat on the standards. And they cannot be trusted - content validation should be as strict as when interacting with the user.
The perfect code did not work. It has overgrown with many conditions and exceptions.
Each item took a lot of time. More than it might seem at first glance.
2. Updater
Now I would like all existing threads to be updated automatically. Moreover, it is desirable that the TTL (refresh rate) of each individual stream is taken into account. And I would also like to smear this TTL by time of day. I did not rely on the protocol, because according to my research, either it does not exist at all, or it does not correspond to reality. One way or another, its not enough.
I began to think about my own system for determining the update frequency of streams, and here's what happened:
- TTL - average time distance in seconds between recordings in a stream during an hour (minimum 2 minutes, maximum 1 hour)
- Each stream has a list of average TTLs for each of 24 hours in the last 10 days.
- Based on the actual data for the last 10 days, a forecast for the next day is formed, which represents the average TTL values for each hour
- Each time a stream is updated, the system recounts its average actual TTL for the current hour (0-23)
Actually, the update procedure itself is the previously described Feed Pull, which, as if anyway, is what to update.
And then we smoothly understand that we would like to pull all this into a queue. But I will talk about their organization a little later.
By the way, plans to screw PubSub , as well as launch your hub.
3. Discovery
Definitely, the list of skills of a convenient rss reader should include a search for rss / atom feeds on any html page. When a user simply drives a website address (for example, www.pravda.ru ), the system should go and search there, in fact, feeds for which you can subscribe.
But this procedure is complicated by the fact that such things cannot be done directly in the user's request, since this is not the task of the web server at all - this must be done asynchronously. At the user's request, we first check directly whether such a stream exists in the database, then look in the discovery cache (which lasts 2 hours), and if we don’t find anything, then put this thing in the queue and wait a maximum of 5 seconds (about how just waiting, I'll tell you later). If the task fails to complete during this time, we end the script by returning json in the style of {wait: true}. After which, after some timeout, the client side makes the same request to the server. As soon as the task is completed on the background, its result will be in the discovery cache.
A few nuances associated with this procedure:
- Different encodings - sometimes the encoding is not indicated either in the headers or in the header ... you have to determine it by byte (which does not always work)
- On one page there can be two identical feeds, one RSS is another Atom - in this situation you need to choose one of them
- You need to additionally request each of the feeds in order to make sure that it works, and take the true title and description
- Redirects
- Icons (same problems)
- Standards and Validity (the same)
By the way, it is often found that specifically on this page there are no alternates, but somewhere on other pages of the same site there are. There was an idea to write a back-up crawler, which will silently search for rss / atom streams on those sites that are often entered by users.
Conclusions:
When dealing with external sources of various types, it seems that you are digging in a giant trash in search of a randomly thrown document.
Requires specific refinement. From the point of view of usability, a simple search for alternates on this page is not enough. Something more universal needs to be done.
4. Interface
The next thing I really wanted was to see an interface where I could subscribe to some stream, add it as a bookmark to the left column, click and read its contents.
I will not go into details of the implementation of interfaces, I just want to say that I did all the layout and ui myself. It was very unprofitable and distracted from other tasks. But jquery saved time.
On the reader and the general interface, I spent a total of two weeks (this is, apart from rather intense modifications and alterations in the future). After that, we got a pretty cute toy that lit up on our monitors, pleasing to the eyes and soul.
Folders
We, of course, are the minimum guys, but without folders I can not imagine working with a reader. And, gentlemen, sorry, in Google Reader their usability is poor. We tried to implement them as accessible and simple as possible.
But I didn’t think that technically this could be such a problem. The interface is the interface, and on the server side I had to work pretty hard to make it work as it should - see the next paragraph.
Whenever possible I tried to use css sprites (where it turned out).
All js and css are collected in one file, minified and compressed by gzip. The average page (with all the statics) weighs 300kb. With cache - 100kb.
And for ie6 we have a special page.
Conclusions:
The interface itself looks very easy, but I would not say the same thing about its implementation.
Ultimately, when everything is compressed and firebug is turned off, it works smartly.
In total, I counted 28 screens at the moment, and a million usecase.
5. Read / unread entries
It turned out that this is a rather non-trivial task for a system where there may potentially be hundreds of threads and even more subscribers. Most importantly, it can be scaled horizontally.
In each Entry entity, I keep a list of users who read it. There could potentially be at least a million identifiers on this list, this will not affect performance, thanks to the mongodb architecture. Also, a separate collection stores additional information about the reading time, etc. it is not indexed and is needed exclusively for statistics; therefore, everything works quite quickly.
For each user, the date of the last update of his counters is stored - for all threads to which he is subscribed.
When a user refreshes a page, the system finds the number of new records for each stream that appeared later than this date, and adds it to the number of unread ones (simple increment). When a user reads any record, a simple decrement occurs.
Fetching unread entries by a separate stream is also very simple.
But selecting only unread in the folder is already a problem. I do not want to clarify the nuances, but this is due to the fact that there are simply no joins in mongodb. With a simple request or a few, this cannot be solved, only through CodeWScope. It is impossible to index, scale - m / r. This is currently a potential bottleneck.
5.1. Unread on top
If any of you used Google Reader, then you probably know about the function “watch only unread”. So, if there are no records in the stream that you have not read, you are looking at a blank page. At first, we did just that too, but testing showed that users do not even realize that they have this feature turned on. They do not understand why the stream is empty, why there are no records on it, and where they go.
David proposed a very interesting solution, where unread entries simply appear on top and read ones go down. And it cost me several days of breaking the brain over how to optimally implement it, namely in folders.
Conclusions:
No-SQL is good in terms of speed and scalability. But some seemingly trivial things, it turned out to be quite difficult to do with him.
Denormalization is good. It is not necessary to consider a problem that some counter will be knocked down. But for any denormalized data, you need to have a full recount function (on the background, of course).
M / R in mongodb is still raw for production. After a little testing, it turned out that it blocks everything to hell during operation. In version 1.6, the developers promise to improve it. So far, without him.
Schema-less solves.
8. Sharing
This is a function that allows you to strip any record from a readable stream to your page. Briefly, this means that any authoritative guy A, reading various feeds, can instantly save specific records (of course, the most interesting and useful) into his stream (s) - similar to Shared Items in GR. And other users have the option of direct subscription to its “Shared Items” stream, as well as to any feed.
One of the main concepts of our service is the convenient distribution of information. Quite an interesting task from a technical point of view for me was to implement the construction of sharing chains. Mongodb really helped with its schema-less properties.
An interesting point:
Recently, Google announced a new feature ReSharein buzz. So in this article (by reference), where “A little more background”, I came across the points that David and I discussed closely 4 months ago, and came to the same conclusions. Our sharing implementation is very close.
9. Node.js, background, queues
Initially, the daemons were written in PHP, and it was very crooked. And, apart from mongodb, it was the most dumb place for me in the application, since the erener is not intended for such things.
But when I stumbled upon node.js (it was only two weeks ago), my soul began to sing, and again I was able to "sleep" calmly. But the problem was that it was not the time to rewrite all the background code that had already been implemented in PHP (feed-pull, discovery, feed-info) onto it.
A very brief picking in the capabilities of node led me to a compromise solution - child-process.
9.1. Queue Manager
This is the first node daemon. Its task is to read the lines, distribute tasks to the workers and monitor the process of their work.
- One manager can serve many queues
- The managers in the system can be run as many as desired, one per server
- Each can be configured in its own way, for example, different managers can work with a different set of queues
- The configuration of the queues may vary and has the following parameters
- The maximum number of working workers at the same time (the actual number is regulated depending on the load)
- The size of the task buffer (you need to configure it depending on the type of task and the number of workers)
- Maximum worker idle time (automatically kills and frees memory in case of idle time)
- The maximum life time of the worker (if it is php-cli, he should not live long, it is better to restart it sometimes)
- The maximum memory usage of the worker (as soon as exceeds, kill)
- Timeout for the task (if the worker got stuck during the task, kill it, return the task to the queue)
- The number of times a task can fail
- When a task is selected from the queue, Lock is set on it (memcache is used for locks)
- If the task has a result, it will be saved in memcache
- Each queue has its own worker, it must be a js class with a specific interface
- At the moment, only one such one works - import (there will be more about it)
- There is also WorkerPhp.js that runs php-cli as a child-process and communicates with it in json
- The life of such a worker (process) does not end with the performance of one task - he can perform it one by one until the manager sees that he has noticeably “gotten fat” and fired him
- In practice, more than 4 php processes per queue does not start simultaneously
- Understands POSIX Signals
- In case of correct completion (do not kill -9), neatly returns all tasks from memory back to the queue
- Each manager opens a port with a REPL interface, you can go into it and ask how you are doing. Also, you can change its configuration without rebooting on the fly.
And all this - 500 lines of code (with comments).
Conclusions:
Evented I / O is how most server-side applications are required to work. Locking should only be where it really is needed.
Proxying php through node showed good results and saved time.
A bunch of work is served by only one process (not counting php-cli). JS workers work there, asynchronously and very abruptly.
9.2. Controller - Publish / Subscribe hub
It often happens that you need to perform bulk tasks (for example, 100) in parallel, and even asynchronously. But the line is a black hole. Submitting 100 tasks there ... and even once a second contacting memcache for results is unprofitable.
You could still bypass the queue, you could use the socket to directly contact the manager and ask him to complete these tasks, waiting for an answer in the same connection. But this option is not suitable, since there may be a dozen managers, and we don’t know which of them can be contacted ... in short, this is wrong.
And I created the Controller (node). It is generally the only one on the entire system, while simple as a stool:
- All managers open a permanent connection with the controller
- In case of any result or file of any task, the manager informs the controller in detail about it
- You can connect to it "on the other hand" and subscribe to a specific task or task list
- As information about tasks arrives, the controller notifies all subscribers
- If the subscriber expects a lot of tasks, the controller notifies him as they arrive
- There is a client for PHP (blocking)
- Garbage collection
Conclusions: The
Publish / Subscribe scheme is very effective in non-blocking environments.
One hundred percent result is not required. If in the end 5 tasks out of 100 were not completed for some reason, as a rule, this is not scary and we continue to work.
9.3. Feed-updater (background updater)
Node process, one for the entire system. Periodically contacts the database, receiving a list of feeds that need to be updated (using TTL data), and throws them into the queue.
9.4. Queues
To avoid race-conditions, a unique md5 identifier is generated for each task. This identifier is placed in the queue, and the data of this task itself is in memcached. Because almost all tasks have an unfixed size, and memcacheq is not friendly with this - and should not. When the manager takes the task, he puts a lock on it, which is also a memcached entry. This allows you to avoid re-entering the queue of identical tasks directly during their execution.
I plan to consider Redis as an alternative to all this, because memcached in this case is used for other purposes. If he falls, the whole line will be lost.
Also, he divided the queues into two groups: user and system. The first is a priority.
This simply led to the addition of feed-pull-sys, which is used by the background updater, without interfering with user tasks.
Conclusions:
This implementation is still very crude.
The queue must be restored at any drop.
Need to use a more advanced locking system - mutex?
User and background processes must have different priority.
10. Import / Export
Here is another interesting point that I want to talk about. All decent readers are required to support import / export in OPML format. But the fact is that some users can upload their opml with hundreds of feeds that are not yet in our system. And then he will have to wait until they all boot. And yet, there may well be a dozen such people at the same time.
Node saves. There is a new worker called “import” (at the moment it can work up to 10 at a time). After downloading and validating the opml file, php throws the task into the queue and returns the user to the interface, to the progress bar. Meanwhile, “import” picks up and scatters smaller tasks in the “feed-pull” queue, after which it waits for them to be executed from the controller, while updating the counter. And the user sees a smoothly creeping progress bar. At what, he can leave this page, take a walk, and then return. It's nice.
III. conclusions
- Do not make bicycles. For almost any task, a ready-made solution already exists that just needs to be slightly adapted.
- The simpler the product for the user in the end, the more difficult its implementation. The consumer, by the way, is unlikely to notice this.
- Do not overestimate yourself. It is not possible to independently make an adult product “in a week” (although, I do not like this word).
- Motivation, however, at times does the impossible.
- The product will never be perfect. A working application is always a compromise between time and quality.
- If you work on your own, brainstorminga in the team is very lacking. It is worth using the collective mind whenever possible.
- Switching context takes a lot of time. It is much more effective when one developer is engaged in more similar tasks.
- If you intend to do your startup and move further than ideas, forget about personal life and Friday's beer.
In parallel, I want to announce a semi-closed launch next week.
My colleague will write about the project in another article the other day.
I would be grateful for any technical comments, advice and constructive criticism.