Labinskiy March 11, 2010 at 20:02

We continue to parse RSS now kinozal'a using grep, wget / curl

Tutorial

In my previous post about automating downloads new episodes to RSS tape LostFilm 'and habrayuzer AmoN raised the right question about impossible to describe my way of hands download, direct links to the torrent file which RSS tape is not contained. As an example, a cinema hall tracker . This post is dedicated to solving this issue;)

Instead of introducing

I will briefly explain the essence of the last post. Many popular torrent clients allow you to set tracking folders in their settings, analyzing which the appearance of new files automatically starts downloads. The shell script previously written periodically looks at the RSS feed of the tracker, selects the distributions of interest to us, and uploads their torrent files to the tracking folder.

What's in a name?

The selection and filtering of RSS feeds of the previous method was based on the analysis of regular expression links to a torrent file. For example, even glancing briefly at a view link http://www.lostfilm.tv/download.php/2035/Lost.s06e07.rus.PROPER.LostFilm.TV.torrent
you can immediately see what kind of series, season and episode it is. However, as AmoN correctly noted , not all RSS trackers contain direct links to torrent files, which somewhat complicates our task of automating downloads. It is this feature that became the reason for this post :)

Nuss, let's get started

To begin with, I carefully looked at the format of the experimental RSS feed. And here is what I saw: namely: the link not only does not contain the name of the distribution, but is also not a direct link to the torrent file. Well, then, in order to get the torrent file itself, you need to follow the link, and on the loaded page already take a direct link to the file.


The 3 Great Tenors - VA / Classic / 2002 / MP3 / 320 kbps
Раздел: Музыка - Буржуйская
http://kinozal.tv/details.php?id=546381

We develop a plan

With a little thought, I came up with the following algorithm:

we read the RSS feed http://kinozal.tv/rss.xmland grepselect the distributions of interest to us according to the description:

curl -s http://kinozal.tv/rss.xml | grep -iA 2 'MP3'

where " -s" - indicates "be quiet,
" -i"" is a case-insensitive search,
" -A 2" - tells grep to display two lines following it along with the found line (namely they contain the link that interests us)
among the selected hands with grep'a, we leave only the links:

grep -ioe 'http.*[0-9]'
open the cycle for all the links found:

for i in ... ; do ... ; done

where on the list place, using the "magic" quotes we `...`substitute the two results of our previous research:

for i in `curl -s http://kinozal.tv/rss.xml | grep -iA 2 'MP3' | grep -ioe 'http.*[0-9]'`; do ... ; done
in a loop, we load a page for each of the links and, again, grepom pull out a link to the torrent file from it:

curl -sb "uid=***; pass=***; countrys=ua" $i | grep -m 1 -ioe 'download.*\.torrent'

where, " -b "uid=***; pass=***; countrys=ua"" - an option for setting the transmitted cookies with authorization information,
" -m 1" - leaves only the first of two direct links to a torrent file (yes, on the pages of the cinema’s distribution descriptions, the link to the same file is found twice)

I draw attention to the fact that neither the password nor uid are transmitted in clear text ! Their values can be seen by opening the window view cookies in your browser, or, for example, use a plugin for FireFox .
we load torrent files wget' om:

wget -nc -qi - -B "http://kinozal.tv/" -P ~/.config/watch_dir --header "Cookie: uid=***; pass=***; countrys=ua"

where I’ll mark “" -B "http://kinozal.tv/"”- setting the prefix / domain for downloading relative links (namely, they are on the pages of movie distribution descriptions),
and --header "Cookie: uid=***; pass=***; countrys=ua"“ - ”- setting the title for the GET request (this time I wanted transfer cookies in just such a way and not through a file :))
transition to the beginning of the cycle

And what do we have

And in the end, we get such a " simple " command:

for i in `curl -s http://kinozal.tv/rss.xml | grep -iA 2 'mp3' | grep -ioe 'http.*[0-9]'`; do curl -sb "uid=***; pass=***; countrys=ua" $i | grep -m 1 -ioe 'download.*\.torrent' | wget -nc -qi - -B "http://kinozal.tv/" -P ~/.config/watch_dir --header "Cookie: uid=***; pass=***; countrys=ua"; done

And for complete happiness, this command should be written in cron:

*/15 * * * * наша команда > /dev/null 2>&1

For sim everything, let me take my leave :)

UPD . In the comments to my previous post from this series, several interesting suggestions were made on optimizing the load on the server:
habrahabr.ru/blogs/p2p/87042/#comment_2609116 (checking for the existence of files)
habrahabr.ru/blogs/p2p/87042/#comment_2609714 (using Last-Modified and ETag)

UPD2 . On the advice of apatrushev replaced " head -1" with the option grep" -m 1".

Tags: