Four Methods for Downloading Images from a Website Using Python

    Recently, I had to write a simple parser in python for work, which would download images from the site (in theory, the same parser can download not only images, but also files of other formats) and save them to disk. In total, I found four methods on the Internet. In this article, I decided to collect them all together.

    These methods are:

    1st method

    The first method uses the urllib module (or urllib2). Let there be a link to some image img. The method is as follows:

    import urllib
    resource = urllib.urlopen(img)
    out = open("...\img.jpg", 'wb')
    out.write(resource.read())
    out.close()
    


    Here you need to pay attention that the recording mode for images is 'wb' (binary), and not just 'w'.

    2nd method

    The second method uses the same urllib. In the future it will be shown that this method is slightly slower than the first (the negative connotation of the parsing speed factor is ambiguous), but is worthy of attention because of its brevity:

    import urllib
    urllib.urlretrieve(img, "...\img.jpg")
    


    Moreover, it is worth noting that the urlretrieve function in the urllib2 library for reasons unknown to me (can someone tell me for what) is missing.

    3rd method

    The third method uses the requests module. The method has the same order of image upload speed with the first two methods:

    import requests
    p = requests.get(img)
    out = open("...\img.jpg", "wb")
    out.write(p.content)
    out.close()
    

    At the same time, when working with the web in python, it is recommended to use requests instead of the urllib and httplib families because of its brevity and ease of handling.

    4th method

    The fourth method is fundamentally different in speed from previous methods (by a whole order). Based on the use of the httplib2 module. As follows:

    import httplib2
    h = httplib2.Http('.cache')
    response, content = h.request(img)
    out = open('...\img.jpg', 'wb')
    out.write(content)
    out.close()
    


    Caching is explicitly used here. Without caching (h = httplib2.Http ()), the method works 6-9 times slower than its previous counterparts.

    Speed ​​testing was carried out on the example of downloading images with the extension * .jpg from the site of the lenta.ru news feed . The selection of images that fit this criterion and the measurement of the execution time of the program were made as follows:

    import re, time, urllib2
    url = "http://lenta.ru/"
    content = urllib2.urlopen(url).read()
    imgUrls = re.findall('img .*?src="(.*?)"', сontent)
    start = time.time()
    for img in imgUrls:
        if img.endswith(".jpg"):
            """реализация метода по загрузке изображения из url"""
    print time.time()-start
    


    Constantly changing pictures on the site did not affect the purity of measurements, since the methods worked out one after another. The results obtained are as follows:

    Method Speed ​​Comparison Table
    Method 1, sMethod 2, sMethod 3, sMethod 4, s (without caching, s)
    0.8230.9080.8740.089 (7.625)

    The data are presented as a result of averaging the results of seven measurements.
    Request to those who dealt with the Grab library (and others) to write in the comments a similar method for downloading images using this and other libraries.

    Also popular now: