Glitches in Python libraries or not?

Published on July 11, 2008

Glitches in Python libraries or not?

    I wrote here the other day a Python web spider, the task, in general, is simple, but it has serious loads, so you have to actually run five spiders (in five threads), in addition, there are several initial conditions that complicate things ... in general, the solution was interesting, it was possible to climb the standard pythonic libs in giblets socket, httpliband urllib2(if interested, I can describe this experience).



    What I want to talk about now is what the addiction to not following created objects can be, inculcated with garbage collection languages. While monitoring my spider, I noticed that there are many sockets in state in the system CLOSE_WAIT. The reason for this is that the sockets are already closed on the server side, but are still in memory. That is, roughly speaking, the method was not called on the socket close, and the object itself still hangs somewhere in memory.

    Having rummaged in urllib2, httpliband socket, I received the following information about the mechanism of their work:

    1. A call is made to load the page urllib2.OpenDirector.open.
    2. It calls a method urllib2.HTTPHandler.open, which in turn callsurllib2.AbstractHTTPHandler.do_open
    3. In do_opencreating an object htype httplib.HTTPConnectionfor the direct implementation of communication tasks. An important point - this object disappears when you exit do_open!
    4. hspawns and opens a socket, storing it in its attribute self.sock.
    5. h sends a request to the server.
    6. do_openrequests hserver response and receives rtype object httplib.HTTPResponse.
    7. This object, when created on the basis of a socket, h.sockcreates a file object self.fpby the method h.sock.makefilethat the application will use to read data. Again, the important point is that the socket object passed to the constructor is not saved anywhere.
    8. do_openwraps the received HTTPResponsein the service object and returns to the application.
    9. The application reads the data and closes HTTPResponse.


    Thus, the socket object itself (the wrapper over the real socket) may no longer exist. At least there are no links to it anywhere. But the socket itself still lives! No one called him close! In short, so far only one option came to my mind: after completing the reading, manually close the socket through the heels of service links with an “understandable” code of the following form:
    tf.fp._sock.fp._sock.close ()
    

    Where tfis the link obtained from urllib2.open. Such are the pies :) This is by the way back in 2.5; in 2.4 there are still a couple of bugs worse. I will be glad to any tips on how to correctly defeat this behavior.