Like PostgreSQL and ClickHouse in Python a lot, quickly and immediately in numpy

From the sandbox

Smashed a lot of mugs in search of a solution to quickly get long price histories for a large number of assets in Python. I also had the courage to work with prices in numpy-arrays, or better immediately in pandas.

Standard approaches in the forehead worked disappointingly, which led to the execution of a query to the database for 30 seconds or more. Not wanting to put up, I found several solutions that completely satisfied me.

Feet grow out of the object nature of Python. After all, he even has integers as objects, which extremely negatively affects the speed of work. I absolutely did not want to change the language.

The first solution was to group the price history by PostgreSQL, which led to a slight performance drop on the database side, but it accelerated the task by about 3 times. The method is described in more detail in another article.

As a result, an understanding emerged that in Python it is necessary to somehow get the entire data set in one piece, at least in a string. And disassembled by numpy-arrays or immediately in pandas.

Final results:

PostgreSQL forehead solution

We do grouping of the data in a sql-request. Example:

SELECT
    string_agg(symbol::text, ',') AS symbol_list
    , string_agg(dt::text, ',') AS dt_list
    , string_agg(open::text, ',') AS open_list
    , string_agg(high::text, ',') AS high_list
    , string_agg(low::text, ',') AS low_list
    , string_agg("close"::text, ',') AS close_list
    , string_agg(volume::text, ',') AS volume_list
    , string_agg(adj::text, ',') AS adj_list
FROM v_prices_fast
WHERE symbol IN ('{symbols}')

Parse data is easy:

{
    'symbol': np.array(r[0].split(',')),  # str'dt': np.array(r[1].split(','), dtype='datetime64'),  # str w/type'open': np.fromstring(r[2], sep=','),  # numbers# ...
}

Productivity on ~ 1.7 million lines:

%timeit get_prices_fast(is_adj=False)  # 11.9s

Ready Python Packages

Python is good for its community, which faces similar problems. The following will suit our purpose:

odo - designed to optimize the speed of data transfer from one source to another. Fully in Python. It interacts with PostgreSQL through SQLAlchemy.
warp_prism - C-extension used by the Quantopian project to get data from PostgreSQL. It is based on odo functional.

Both packages use the PostgreSQL feature to copy data to CSV:

COPY {query} TO :path
    WITH (
        FORMAT CSV,
        HEADER :header,
        DELIMITER :delimiter,
        QUOTE :quotechar,
        NULL :na_value,
        ESCAPE :escapechar,
        ENCODING :encoding
    )

At the output, the data is parsed in pandas.DataFrame () or numpy.ndarray ().

Since warp_prism is written in C, it has a significant advantage in the speed of data parsing. But at the same time has a significant drawback - limited support for data types. That is, it parses int, float, date, and str, but not numeric. Odo has no such restrictions.

To use, you need to describe the structure of the table and the query using the sqlalchemy package:

tbl_prices = sa.Table(
    'prices', metadata,
    sa.Column('symbol', sa.String(16)),
    sa.Column('dt', sa.Date),
    sa.Column('open', sa.FLOAT),
    sa.Column('high', sa.FLOAT),
    sa.Column('low', sa.FLOAT),
    sa.Column('close', sa.FLOAT),
    sa.Column('volume', sa.BIGINT),
    sa.Column('adj', sa.FLOAT),
)
query = sa.select(tbl_prices.c).where(
    tbl_prices.c.symbol.in_(SYMBOLS)
).order_by('symbol', 'dt')

Speed tests:

%timeit odo(query, pd.DataFrame, bind=engine)  # 13.8s
%timeit warp_prism.to_dataframe(query, bind=engine)  # 8.4s
%timeit warp_prism.to_arrays(query, bind=engine)  # 8.0s

warp_prism.to_arrays () - preparing a python dictionary with numpy arrays.

What can be done with ClickHouse?

PostgreSQL is all good, except for appetite with the size of the storage and the need to adjust sharding for large tables. ClickHouse itself shorts, stores everything compactly, and works with lightning speed. For example, a PostgreSQL table with a size of ~ 5Gb in ClickHouse fits into ~ 1Gb. Using ClickHouse to store prices is described in another article.

To my chagrin, odo did not help, even though there is a clickhouse extension for sqlalchemy. Memories of the speed of the clickhouse in the console led me to the idea of accessing the database through the creation of a separate process. I know that this is a long and resource-intensive, but the results were beyond all praise.

sql = 'SELECT days.symbol, days.date, days.open/10000, days.high/10000, days.low/10000, days.close/10000, days.volume FROM days ' \
      'WHERE days.symbol IN (\'{0}\') ORDER BY days.symbol, days.date;'.format("','".join(SYMBOLS))
cmd = 'clickhouse-client --query="{0}"'.format(sql)
defch_pandas(cmd):
    p = subprocess.Popen([cmd], stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    return pd.io.parsers.read_csv(p.stdout, sep="\t", names=['symbol', 'date', 'open', 'high', 'low', 'close', 'volume'])

Result:

%timeit ch_pandas(cmd)  # 1.6s

ClickHouse HTTP Port Request

The results slightly worsened when accessing directly to port 8123, where the database responds:

import urllib
%timeit pd.io.parsers.read_csv('http://localhost:8123/?{0}'.format(urllib.parse.urlencode({'query': sql})), sep="\t", names=['symbol', 'date', 'open', 'high', 'low', 'close', 'volume'])
# 1.9s

But not without a spoon of tar.

Spoon of tar with ClickHouse

The database was impressive on large samples, but on small results they were disappointing. ~ 20 times worse than odo. But this is the cost of an additional body kit with the launch of the process or appeal to HTTP.

Results:

Conclusion

This article is the quest for accelerating the interaction between Python and databases is over. For PostgreSQL, with standard fields and the need for universal access to prices, the best way is to use Quantopian's warp_prism package. If you need to store large amounts of history and a high frequency of requests for a large number of lines, ClickHouse is ideal.

Tags: