Vindicar October 15, 2014 at 21:53

rawdog - RSS aggregator without excessive requests

From the sandbox

Lyrical introduction

In connection with the recent budding of a new resource from Habrahabr, I had a need to arrange a convenient way to read both resources. The first thought, of course, was about RSS, since the engine on both sites supports it. There were mere trifles - to find a good RSS aggregator that could be installed on a weak VPS (since the fate of Google Reader somewhat cooled the desire to rely on a third-party service).

At first, a tip from Tsyganov_Ivanbrought to the Tiny Tiny RSS aggregator, which seemed like a real "silver bullet." However, a closer acquaintance with the system requirements somewhat dampened my fervor - to pile up a full-fledged LAMP on a machine with, God forbid, 256 meters of unoccupied memory, and all this for the sake of a resource literally for one person? Moreover, acquaintance with the FAQ, which contained links to overtly mocking answers on the package forum, finally discouraged the desire to deal with tt-rss.

The first round of the search failed, as alternatives (like FeedHQ ) required roughly the same thing. Desperate, I was about to write the tool I needed myself and began to look for suitable libraries for Python (to which I have a weakness) when I came across practically what I needed .

Name itselfRAWDOG hints that the author was overwhelmed by similar feelings at the time of writing. This utility is designed to be launched manually or by cron and can only do one thing: parse the specified RSS feeds and write new elements to the output file according to the specified template.

Installation and setup

Since rawdog is present in the Ubuntu repository, getting the package is straightforward. But the setting has its own characteristics.
First, you have to add the rawdog call yourself to crontab, or to cron. *. It will look something like this:

rawdog --dir WORKDIR --log /var/log/rawdog/rawdog.log --no-lock-wait --update --write

where the key --no-lock-wait will prevent the second copy of rawdog from starting, and WORKDIR is the utility working directory.

The fact is that rawdog looks for the configuration file and keeps all its temporary files in one working directory - by default ~ / .rawdog . This may be convenient for a workstation, but contrary to normal practice. If you, like me, love order and uniformity, you can specify a different working directory using the --dir switch , which allowed you to send the working directory to / var / cache / rawdog (since its main content, apparently, is the cache of downloaded tapes) . Since the configuration file is also searched there (the --config switch allows you to specify an additionalconfiguration, but does not alter the basic search), he was replaced by a symbolic link, then went along with the templates in / etc .

A well-documented example of a configuration file can be found on the Web , so I will only briefly outline the main directives:

maxarticles N allows you to set the length of the results ribbon (single-page output, which can be inconvenient);
maxage T indicates the records for which time interval will be shown in the feed tape;
expireage T sets how long entries that disappear in the original RSS feed will remain. If this interval is less than maxage, then in the case of a frequently updated tape, obsolete entries will disappear from the results before the expiration of the usual term.
pagetemplate FILEPATH and itemtemplate FILEPATH allow you to specify a template file for the page as a whole and for an individual record, respectively. By default ( default ), a simple built-in template is used.
outputfile FILEPATH - where output results will be written. Setting up a web server to render this static page is best left out of this article (for example, I use lighttpd). The only thing is to make sure that this file will have write access on rawdog (not a problem if the utility is run via cron with root privileges) and read access on the web server.
the feed interval URL [params] directive allows you to add an RSS feed for viewing at a given interval (since the call is usually made through cron, rawdog will simply ignore the "not obsolete" feeds if called earlier). Among the parameters it is worth highlighting id (see below) and http_proxy , which allows you to set a proxy server to access a specific feed (if you want a strange one, like aggregating RSS feeds from Tor, or just from a site that came under RosKomKatok).
include FILEPATH will allow you to include another configuration file.

Configure logrotate

Since rawdog is usually called several times a day, and generates about a kilobyte of logs each time, it makes sense to either disable logging altogether (by removing the --log option ) or configure logrotate. For the latter, it’s enough to put something like this in /etc/logrotate.d/ (assuming that you chose the same path to the log file as I did):

/var/log/rawdog/rawdog.log {
	weekly
	missingok
	rotate 5
	compress
	delaycompress
	notifempty
}

Bringing beauty

Rawdog’s built-in template is minimalistic, if not tougher, so it makes sense to set your template files. The most important is the pagetemplate template , because it is in it that you can set styles and connect the necessary scripts. To see the default page template, you can use the following command (be sure to specify --dir WORKDIR if you, like me, moved the working directory):

rawdog -s pagetemplate> template.html

Any built-in template can be viewed with a similar command, replacing pagetemplate with the name of the template. Patterning is implemented through a simple search with a replacement, although there is a conditional operator that allows you to insert a stub in the absence of a value. By the way, you can define your variables using the define VARNAME VALUE directive (globally) or the define_VARNAME = VALUE parameter (for a separate RSS feed).

It should be noted that each entry is marked by default with the feed-FEEDID CSS class , where FEEDID is the source id specified in the parameters above. This allows you to set your own design for posts from different sources (for example, display the site icon next to the title).

Grouping tapes into separate issues

Offhand, you can come up with one way that makes it relatively easy to create several coexisting feed collections, with separate sets of subscriptions, target files, and design.

To do this, cron. * Instead of the call described above places something in the spirit:

#!/bin/sh
WORKDIRS=/var/cache/rawdog
CONFIGS=/etc/rawdog
PLUGINS=/usr/share/rawdog/plugins
LOGS=/var/log/rawdog
for CFG in "$CONFIGS/"*.conf
do
    WORKDIR="$WORKDIRS/"`basename "$CFG" .conf`
    [ -d "$WORKDIR" ] || mkdir -p "$WORKDIR"
    [ -f "$WORKDIR/config" ] || ln -s -f "$CFG" "$WORKDIR/config"
    if [ -d "$PLUGINS" ];
    then
        [ -d "$WORKDIR/plugins" ] || ln -s -f "$PLUGINS" "$WORKDIR/plugins"
    fi
    rawdog --dir "$WORKDIR" --log "$LOGS/rawdog" --no-lock-wait --update --write
done

The principle of operation is simple: for each * .conf file in / etc / rawdog, an appropriate working subdirectory in / var / cache / rawdog will be created (if necessary) , and a link to the configuration file itself will be placed in it. A link to the directory with common plugins will be placed there (if not).
For greater convenience, you can transfer the general settings to a separate file ( / etc / rawdog / config or / etc / default / rawdog ) by connecting it in * .conf files using the include directive .

Plugin extension

rawdog looks for Python scripts located in the plugins subdirectory in the rawdog working directory. A number of ready-made plug-ins (in particular, multi-page output and output in RSS format) can be found on the author's website.

Tags: