Reading large amounts of data in Python / Postgresql
Stack of technologies under consideration : Postgresql 9.3, Python 2.7 with the installed module "psycopg2".
How often in your practice had to deal with the task of processing large tables (more than 10 million records)? I think you will agree that this task is quite resource-intensive both in terms of processing time and system resources involved. Today I will try to show an alternative way to solve the problem.
Postgresql DBMS has an excellent operator for working with large volumes of information, namely “COPY”. The use of this operator allows us to read and write huge volumes of information in a table. In this article we will consider the reading mode.
According to the documentation of the “COPY” operator, several modes of reading to a file or to the STDOUT stream, as well as various formats, including “csv”, are available to us. Just we will try to use it to the maximum benefit.
As an “experimental rabbit” we will create a table with 1 million records and write a small script that reflects the essence of the method. The sql file can be found in my git repository (the link can be found at the bottom of the article).
Also do not forget to install the psycopg2 extension!
We will use the wonderful copy_expert function to fetch data, which allows us to execute COPY requests from the Python client.
Code Explanations:
Generator Code:
Explanations:
That, in fact, is all we need!
My configuration : MacBook Air 2013 Processor: 1.3 GHz Intel Core i5, Ram: 4 GB 1600 MHz DDR3, SSD.
PS:
I want to note that this approach to accelerating reading does not always work, namely, if you have a fairly simple table of 3-5 fields, you will not notice any noticeable difference (at least up to 1 million). However, this method shows just a crazy increase in speed, with complex queries, the acceleration reaches up to 10-20 times! Also, the configuration of the iron on which the script is executed is very strongly affected.
All code can be found in the git repository https://github.com/drizgolovicha/python_bulk_read .
I will be glad to comments and suggestions for optimization!
Thank you for reading to the end.
UPD:
Measurement results sample (14k) of records:
Problem
How often in your practice had to deal with the task of processing large tables (more than 10 million records)? I think you will agree that this task is quite resource-intensive both in terms of processing time and system resources involved. Today I will try to show an alternative way to solve the problem.
Sentence:
Postgresql DBMS has an excellent operator for working with large volumes of information, namely “COPY”. The use of this operator allows us to read and write huge volumes of information in a table. In this article we will consider the reading mode.
According to the documentation of the “COPY” operator, several modes of reading to a file or to the STDOUT stream, as well as various formats, including “csv”, are available to us. Just we will try to use it to the maximum benefit.
Training:
As an “experimental rabbit” we will create a table with 1 million records and write a small script that reflects the essence of the method. The sql file can be found in my git repository (the link can be found at the bottom of the article).
Also do not forget to install the psycopg2 extension!
Implementation:
We will use the wonderful copy_expert function to fetch data, which allows us to execute COPY requests from the Python client.
query = """
SELECT * from big_data inner join big_data as t1 USING(fname)
"""
output = StringIO()
self.cursor.copy_expert("COPY (%s) TO STDOUT (FORMAT 'csv', HEADER true)" % query, output)
data = output.getvalue()
output.close()
result = list()
for item in getResults(data):
# do whatever we need
item = {k: None if v == "" else v for k, v in item.items()}
result.append(item)
Code Explanations:
- In the request, we do the union for ourselves, to complicate it (it is noted that the advantage in speed is directly proportional to the complexity of the request);
- As a buffer we use the “StringIO” object, where we will write data from the cursor;
- We will parse the string with the getResults generator;
- For ease of interpretation, I will convert all empty lines to type “None”, because after using “COPY” we get string values;
- I want to note that I will use the “csv” format with the leading line of headers, which is why I understand it a bit later.
Generator Code:
def getResults(stream):
"""
get result generator
"""
f = StringIO(stream)
result = csv.DictReader(f, restkey=None)
for item in result:
yield item
f.close()
Explanations:
- As can be seen from the listing, again we use the familiar StringIO buffer;
- To convert the string “csv” into a dictionary, we use the “DictReader” method of the csv native library. By default, this method takes the first line as a list of dictionary fields.
That, in fact, is all we need!
My configuration : MacBook Air 2013 Processor: 1.3 GHz Intel Core i5, Ram: 4 GB 1600 MHz DDR3, SSD.
PS:
I want to note that this approach to accelerating reading does not always work, namely, if you have a fairly simple table of 3-5 fields, you will not notice any noticeable difference (at least up to 1 million). However, this method shows just a crazy increase in speed, with complex queries, the acceleration reaches up to 10-20 times! Also, the configuration of the iron on which the script is executed is very strongly affected.
All code can be found in the git repository https://github.com/drizgolovicha/python_bulk_read .
I will be glad to comments and suggestions for optimization!
Thank you for reading to the end.
UPD:
Measurement results sample (14k) of records:
- Direct SELECT, Where condition on non-indexed field - 21.4s
- COPY of the previous request - 13.1s
- Selection of the same SELECT, but from a materialized view with an index by field - 12.6s
- COPY materialized view - 1.8s