Parsim RSS LostFilm by grep and send it to download via wget

  • Tutorial
RSS feed
Once, I was tired of manually scanning LostFilm for new series that were released, and I decided to automate this process. The fact is that many BitTorrent clients have so-called wach directory folders in their settings. As soon as a new torrent file appears in this folder, the BitTorrent client immediately starts downloading it. A common practice, for example, is to create such a folder and open FTP access to it for recording . Now, all we need is to automate downloading the torrent file by the release of a new episode in the specified folder for their further automatic download. I’ll show you how to do just that.

For reference , the tracking folder in Transmissionare set by options ( watch-dir-enabledand watch-dir), and in rTorrent you need to add the following line to the configuration file:
schedule = watch_directory,5,5,load_start=./watch/*.torrent

Point one

So, first of all, we need to get an RSS feed from LostFilm 'a. To do this, we use the command wget:

wget -qO - http://www.lostfilm.tv/rssdd.xml

here: the option " -q" tells wget not to display information about its work, i.e. " be quiet ";
" -O -" forces the loaded tape to be output not to a file, but to the standard output stream. This is done so that the received data can be pipelined to the input of the grep filter.

Point two

Now we need to select all links to torrent files from the received tape. To do this, we will ask grep to search for the substring at the following regular expression: 'http.*torrent'. Here the dot symbol means "any character" and the asterisk means "repeat any number of times." Those. we will find all entries starting with "http" and ending with "torrent" which will be links to torrent files. The command itself looks like this:

grep -ioe 'http.*torrent'

where " -i" is a case-insensitive search,
" -o" - select only the matched part of the substring (done to filter the tags that surround the link),
" -e" - search by regular expression

Point three

After we found all the links to torrent files, we need to select only those that are of interest to us. For example, I like the series Lost, House MD, Lie to me and Spartacus . Using their example, I will show how to filter. All links to the torrent files to the RSS feed LostFilm have the form:

http://lostfilm.tv/download.php/2030/The.Oscars.The.Red.Carpet.2010.rus.LostFilm.TV.torrent

Thus, to select the title series that interested me, I use the following regular expression: '[0-9]{4}/(lost|house|lie|spartacus)'. It searches for 4 digits in a row ("[0-9] {4}", where the number of repetitions is set in curly brackets), followed by a slash, and then one of four options by the name of the series (" (lost|house|lie|spartacus)", where the character " | "reads like OR). But, for the grep command, service characters must be escaped with "\". Total, we have:

grep -ie '[0-9]\{4\}/\(lost\|house\|lie\|spartacus\)'

Point four

Now we only had links to torrent files of the series we were interested in. Now we just have to upload them to the tracking folder of our torrent client. But, the fact is that without authorization, LostFilm will not let you download files. In order to be able to download files, you need to send cookies with authorization information along with a GET request. Fortunately, the command wget can load cookies from the specified file. Let's look at the call wget:

wget -nc -qi - -P ~/.config/watch_dir --load-cookies=~/.config/cookies.txt

where the option " -nc" tells the command not to reload files if we already have them on the disk,
" -q" - the option discussed above, tells the command " be quiet ",
" -i -" - get a list of files to download from the standard input stream,
" -P ~/.config/watch_dir" - an indication of our tracking folder where the files will be downloaded,
" --load-cookies=~/.config/cookies.txt" - use cookies from the specified file.

the file c cookies has the following format: I draw attention to the fact that neither the password nor the uid are transmitted in clear text ! Their values can be seen by opening the window view cookies in your browser, or, for example, use a plugin for FireFox to export all cookies in the file and that should be referred to .

.lostfilm.tv TRUE / FALSE 2147483643 pass <ваши данные>
.lostfilm.tv TRUE / FALSE 2147483643 uid <ваши данные>


wget'у

Last item

and now all together:

wget -qO - http://www.lostfilm.tv/rssdd.xml | grep -ioe 'http.*torrent' | grep -ie '[0-9]\{4\}/\(lost\|house\|lie\|spartacus\)' | wget -nc -qi - -P ~/.config/watch_dir --load-cookies=~/.config/cookies.txt

The final item :)

Well and now, for the final automation, we will write all this in cron:

*/15 * * * * wget -qO - http://www.lostfilm.tv/rssdd.xml | grep -ioe 'http.*torrent' | grep -ie '[0-9]\{4\}/\(lost\|house\|lie\|spartacus\)' | wget -nc -qi - -P ~/.config/watch_dir --load-cookies=~/.config/cookies.txt > /dev/null 2>&1

where " > /dev/null 2>&1" suppresses the output of the command and thereby does not force cron you to generate an email with the output of the commands.

UPD . Added a continuation of the article where the issue is discussed with RSS feeds that do not have direct links to torrent files.

UPD2. In the comments, we rightly noted that in this implementation, access to the server occurs every time even if new data does not appear on it.

So, the nebulosa haberman suggested in his comment “checking for the existence of files so that wget does not jerk the server every time”,

and Guria , at the same time,recommends “in order not to parse and load the same thing, and why the server shouldn’t pull in vain, write the value of the Last-Modified header and pass it in the If-Modified-Since header. The server can also support ETag. ”

UPD3 . If there are difficulties with the transfer of cookies, you can use another way. To do this, replace the call to the last command wget in the pipeline with this:

wget -nc -qi - -P ~/ --header "Cookie: uid=***; pass=***"

Also popular now: