First Steps in Python Programming
A couple of months ago I started learning Python. After reading about the structures used, working with strings, generators, the basics of OOP, I thought about what a useful program to write to apply all this to a real task.
By a happy coincidence, friends contacted me with a request to download the cartoon “Miracles on Turns”.
Having visited one of the popular trackers in UA-IX, I found this cartoon, only each episode was uploaded separately, but I didn’t want to press the “Download” button 65 times. At that moment, I remembered Python.
Immediately I began to look for information on how to get files from the site. The answer was quickly received thanks to Google and the notorious stackoverflow site . It turned out that you can “pull out” files by importing a library and adding a couple of lines. Having tested on files of small dimension how this all works, I proceeded to the next stage. It was necessary to collect all the download links and their corresponding file names.
They were not specified anywhere within the same tag, so I collected links and file names separately.
To collect links, the lxml library was used, which was already discussed on this site. After downloading and installing this library, I proceeded to writing the program itself. The program code is presented below:
All collected links were saved to a file for further work with them. If constructs were used to filter all data. Thus, I received only links that were used to download the file to the computer.
The file names were not quite convenient. Therefore, when the program received the name of the file, it immediately changed it to a more convenient one. Thus, all the files received the name of the form: “Miracles in turns. Series XX ”, instead of XX - series number.
Program code:
So, as the version of the Python 2.6 interpreter used, for the correct work with the Cyrillic alphabet I had to use the encode method. The collected data was also saved to a file.
After running both programs, there were two text files on the hard drive. In one, links for downloading files were stored, and in the other, the names of the series.
I used a dictionary to link the link and file name. The link was the key, and the file name was stored in the key value. After that, it was only necessary to take the key, substitute it in the calling function and specify the location, file name to save.
The code that performs these actions:
Also used is a list in which the names of the series that have already been downloaded are entered. This is used to ensure that in case of interruption in the download, the series that are already on the hard disk do not swing.
It may have taken more time to write all of this code compared to if I manually clicked on the Download button. But a working program brought much more pleasure. Plus, there is also new knowledge.
Thank you for attention.
The site address is hidden so as not to be considered advertising.
By a happy coincidence, friends contacted me with a request to download the cartoon “Miracles on Turns”.
Get to the point
Having visited one of the popular trackers in UA-IX, I found this cartoon, only each episode was uploaded separately, but I didn’t want to press the “Download” button 65 times. At that moment, I remembered Python.
Immediately I began to look for information on how to get files from the site. The answer was quickly received thanks to Google and the notorious stackoverflow site . It turned out that you can “pull out” files by importing a library and adding a couple of lines. Having tested on files of small dimension how this all works, I proceeded to the next stage. It was necessary to collect all the download links and their corresponding file names.
They were not specified anywhere within the same tag, so I collected links and file names separately.
To collect links, the lxml library was used, which was already discussed on this site. After downloading and installing this library, I proceeded to writing the program itself. The program code is presented below:
#! /usr/bin/env python
import urllib
import lxml.html
load = 'load'
page = urllib.urlopen('http://www.***.ua/view/12345678')
doc = lxml.html.document_fromstring(page.read())
for link in doc.cssselect('p span.r_button_small a'):
if link.text == None:
continue
if load not in link.get('href'):
continue
print 'http://***.ua'+link.get('href')
All collected links were saved to a file for further work with them. If constructs were used to filter all data. Thus, I received only links that were used to download the file to the computer.
The file names were not quite convenient. Therefore, when the program received the name of the file, it immediately changed it to a more convenient one. Thus, all the files received the name of the form: “Miracles in turns. Series XX ”, instead of XX - series number.
Program code:
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
import lxml.html
file_name = u'Чудеса на виражах. Серия '
episode = 0
page = urllib.urlopen('http://www.***.ua/view/12345678')
doc = lxml.html.document_fromstring(page.read())
for name in doc.cssselect('tr td a'):
if name.text == None:
continue
if not name.text.endswith('.avi'):
continue
name.text = file_name + str(episode) + name.text[-4:]
print name.text.encode('utf8')
episode += 1
So, as the version of the Python 2.6 interpreter used, for the correct work with the Cyrillic alphabet I had to use the encode method. The collected data was also saved to a file.
After running both programs, there were two text files on the hard drive. In one, links for downloading files were stored, and in the other, the names of the series.
I used a dictionary to link the link and file name. The link was the key, and the file name was stored in the key value. After that, it was only necessary to take the key, substitute it in the calling function and specify the location, file name to save.
The code that performs these actions:
#! usr/bin/env python
# -*- coding: utf-8 -*-
import urllib
links = open('link','r')
names = open('file_name', 'r')
download = {}
path = '/media/6A9F550C59BC1824/TaleSpin/'
url = 'http://www.***.ua/load/12345678'
loadf = []
download = dict(zip(links, names))
for link in download.iterkeys():
name = download[link].rstrip()
if name not in loadf:
urllib.urlretrieve(link,path+name)
loadf.append(name)
else:
continue
Also used is a list in which the names of the series that have already been downloaded are entered. This is used to ensure that in case of interruption in the download, the series that are already on the hard disk do not swing.
Conclusion
It may have taken more time to write all of this code compared to if I manually clicked on the Download button. But a working program brought much more pleasure. Plus, there is also new knowledge.
Used materials
- "LXML" or how to parse HTML with ease
- Official lxml documentation
- Urllib library documentation
- Python Tips, Tricks, and Hacks (Part 2)
Thank you for attention.
The site address is hidden so as not to be considered advertising.