Four Methods for Downloading Images from a Website Using Python
Recently, I had to write a simple parser in python for work, which would download images from the site (in theory, the same parser can download not only images, but also files of other formats) and save them to disk. In total, I found four methods on the Internet. In this article, I decided to collect them all together.
These methods are:
The first method uses the urllib module (or urllib2). Let there be a link to some image img. The method is as follows:
Here you need to pay attention that the recording mode for images is 'wb' (binary), and not just 'w'.
The second method uses the same urllib. In the future it will be shown that this method is slightly slower than the first (the negative connotation of the parsing speed factor is ambiguous), but is worthy of attention because of its brevity:
Moreover, it is worth noting that the urlretrieve function in the urllib2 library for reasons unknown to me (can someone tell me for what) is missing.
The third method uses the requests module. The method has the same order of image upload speed with the first two methods:
At the same time, when working with the web in python, it is recommended to use requests instead of the urllib and httplib families because of its brevity and ease of handling.
The fourth method is fundamentally different in speed from previous methods (by a whole order). Based on the use of the httplib2 module. As follows:
Caching is explicitly used here. Without caching (h = httplib2.Http ()), the method works 6-9 times slower than its previous counterparts.
Speed testing was carried out on the example of downloading images with the extension * .jpg from the site of the lenta.ru news feed . The selection of images that fit this criterion and the measurement of the execution time of the program were made as follows:
Constantly changing pictures on the site did not affect the purity of measurements, since the methods worked out one after another. The results obtained are as follows:
The data are presented as a result of averaging the results of seven measurements.
Request to those who dealt with the Grab library (and others) to write in the comments a similar method for downloading images using this and other libraries.
These methods are:
1st method
The first method uses the urllib module (or urllib2). Let there be a link to some image img. The method is as follows:
import urllib
resource = urllib.urlopen(img)
out = open("...\img.jpg", 'wb')
out.write(resource.read())
out.close()
Here you need to pay attention that the recording mode for images is 'wb' (binary), and not just 'w'.
2nd method
The second method uses the same urllib. In the future it will be shown that this method is slightly slower than the first (the negative connotation of the parsing speed factor is ambiguous), but is worthy of attention because of its brevity:
import urllib
urllib.urlretrieve(img, "...\img.jpg")
Moreover, it is worth noting that the urlretrieve function in the urllib2 library for reasons unknown to me (can someone tell me for what) is missing.
3rd method
The third method uses the requests module. The method has the same order of image upload speed with the first two methods:
import requests
p = requests.get(img)
out = open("...\img.jpg", "wb")
out.write(p.content)
out.close()
At the same time, when working with the web in python, it is recommended to use requests instead of the urllib and httplib families because of its brevity and ease of handling.
4th method
The fourth method is fundamentally different in speed from previous methods (by a whole order). Based on the use of the httplib2 module. As follows:
import httplib2
h = httplib2.Http('.cache')
response, content = h.request(img)
out = open('...\img.jpg', 'wb')
out.write(content)
out.close()
Caching is explicitly used here. Without caching (h = httplib2.Http ()), the method works 6-9 times slower than its previous counterparts.
Speed testing was carried out on the example of downloading images with the extension * .jpg from the site of the lenta.ru news feed . The selection of images that fit this criterion and the measurement of the execution time of the program were made as follows:
import re, time, urllib2
url = "http://lenta.ru/"
content = urllib2.urlopen(url).read()
imgUrls = re.findall('img .*?src="(.*?)"', сontent)
start = time.time()
for img in imgUrls:
if img.endswith(".jpg"):
"""реализация метода по загрузке изображения из url"""
print time.time()-start
Constantly changing pictures on the site did not affect the purity of measurements, since the methods worked out one after another. The results obtained are as follows:
Method 1, s | Method 2, s | Method 3, s | Method 4, s (without caching, s) |
---|---|---|---|
0.823 | 0.908 | 0.874 | 0.089 (7.625) |
The data are presented as a result of averaging the results of seven measurements.
Request to those who dealt with the Grab library (and others) to write in the comments a similar method for downloading images using this and other libraries.