We filter rss Habr through Yahoo.Pipes



    The comments often complain about the abundance of inappropriate content on the main one. Posts cannot be liked by everyone at once. Duh ... There is
    only one conclusion - it is necessary to filter. Cap prompts. Intuition tells us that we will filter using Yahoo.Pipes.

    With pictures.


    Instead of joining


    About Y! P already written: one , two , and more . We will write again, it will not be superfluous. The tool is truly impressive.
    And why was Y! P actually selected when this functionality was already implemented in many rss readers?

    Just because it is very interesting to do something new and so close to * nix philosophy. In addition, after filtering through Y! P, you get only useful traffic from the tape, without loading your channel with unnecessary data.

    New engine


    Not so long ago, Y! P got a version two engine: Yahoo! Pipes V2 engine .

    In short:
    • engine V1 will cease to exist after a while
    • V2 will add new features
    • V2 is faster
    • V2 most likely has a bunch of bugs (this is an unofficial release)


    Receiving a tape


    It all starts with rss tape. However, the Habralent is not suitable for Y! P. Without bothering with the details, we take the feedburner'a feed: the picture shows the Fetch Feed module. His task is to take the tape from the specified address and transfer it further to the pipe. And then, as you can see, the Split module follows, which divides one tape into two completely identical ones.




    Is that redundancy for fault tolerance?

    No, this is necessary for clever filtering: everything that is not needed is cut off in the left channel, and only what is needed in the right channel.

    Filtration




    The top module blocks recording on two RegExp'am, and the bottom allows recording on two other RegExp'am.
    What is RegExp?

    About RegExp'y eng , Russian , as well as a good guide on the topic.

    To prevent our filter modules from turning into huge monsters, our keywords for filtering are entered separately (text [wired] field with suspicious gray links). In the meantime, it is worth stopping on the left side of the module. From the drop-down list, you can select the record item of interest. The item.title field is perfect for our task (blocking records), because it’s the title that first of all records are eliminated by the title. The item.description field contains the body of the record and is used in the filter to leave topics of interest to us. For example, you,% username%, blocked the word Microsoft ©but resolved the word Linux. In this case, if in the post with the heading “Microsoft takes new heights” it says “linux is still cooler”, then this entry will go straight to your rss reader.

    Reporting the result




    After cutting \ leaving records, you need to combine the two tapes into one again. This is done by the Union element, which already has 5 inputs (it is a pity to waste unused inputs). Now we again have one tape in which duplicates could very well be made. The Unique module will help us to remove these unwanted parasites: based on the item.link field, it will look through all the records and remove the excess ones. It remains only to bring beauty, sorting the entries by some criterion. In the picture, entries are sorted in descending order of publication date (new at the beginning). The most important module at the end of any pipe is Output. It is on it that our project should end. Well, actually it happened. By clicking on the Output module, you can enjoy the result of efforts:



    So, where are the patterns for filtering out the records?


    Patterns





    This is how templates for deleting posts look. Three large String Builders are needed just for beauty (you can put everything in one). In addition, separation helps with some sort of systematization of patterns. All records are separated from each other by a vertical bar, and this is a very important point. String Builder simply takes all the fields and concatenates them into one line. The small String Builder collects all the information from the large ones and then wraps it in a suitable wrapper. As you know, the headings of posts look like this on Habré: “blog / post”, therefore, in this case, all templates are aimed at filtering specific blogs. The word "stub" helps us to avoid a situation where there is nothing at the end of the OR block (microsoft | linux | freebsd |) that will filter out absolutely all posts.



    The second template for the deny filter is a little simpler, and it blocks certain posts (from any blogs): The resolving filter looks noticeably smaller, but no less important: It will search for keywords from String Builder in the title and body of the post. These elements will help us not to miss an important topic due to strict filtering.








    And then what?

    We subscribe to rss or atom of the resulting tape and of course PROFIT.

    Instead of a conclusion


    I hope this article helps you,% username%, master Yahoo.Pipes and implement your ideas there.
    Link to the pipe from the article: habrapipe

    Also popular now: