KDB + Database: From Finance to Formula 1
KDB +, KX product- This is a widely known in narrow circles, extremely fast, column database designed to store time series and analytical calculations based on them. Initially, it enjoyed (and enjoys) great popularity in the finance industry - it is used by all the top 10 investment banks and many well-known hedge funds, exchanges and other organizations. Recently, KX decided to expand its client base and now offer solutions in other areas where there is a large amount of data sorted by time or in another way - telecom, bioinformatics, production, etc. In particular, they became a partner of the Aston Martin Red Bull Racing team in Formula 1, where they help to collect and process data from car sensors and analyze tests in a wind tunnel. In this article I want to tell
In this article I will try to tell in general what KDB + is, what features and limitations it has, what is its benefit for companies wishing to process large volumes of data. I will not go into the details of the implementation of KDB + and the details of its Q programming language. Both of these topics are very extensive and deserve separate articles. A lot of information on these topics can be found at code.kx.com, including a book on Q - Q For Mortals (see link below).
KX was founded in 1993 by Arthur Whitney, who previously worked at Morgan Stanley Bank on A +, the successor to APL, a very original and popular language in the financial world. Of course, in KX, Arthur continued in the same spirit and created the vector-functional language K, guided by the ideas of radical minimalism. K programs look like a messy set of punctuation marks and special characters, the meaning of characters and functions depends on the context, and each operation carries a lot more meaning than it does in the usual programming languages. Due to this, the K program takes up a minimum of space - several lines can replace pages of text of a verbose language such as Java - and is a super-concentrated implementation of the algorithm.
A function on K that implements most of the LL1 parser generator according to a given grammar:
Arthur also embodied this philosophy of extreme efficiency with a minimum of body movements in KDB +, which appeared in 2003 (I think it’s now clear where the letter K comes from the name) and there is nothing more than an interpreter of the fourth version of the language K. A version more pleasant to the user's eye is added to K K called Q. Q also added support for a specific SQL dialect, QSQL, and in the interpreter, support for tables as a system data type, tools for working with tables in memory and on disk, etc.
Thus, from the user's point of view, KDB + is just a Q interpreter with support for tables and SQL-like LINQ-style expressions from C #. This is the most important difference between KDB + and other databases and its main competitive advantage, which is often overlooked. This is not a database + an auxiliary language that is disabled, but a full-fledged powerful programming language + built-in support for database functions. This difference will play a decisive role in listing all the benefits of KDB +. For instance…
By modern standards, KDB + is just a microscopic size. This is literally one executable file smaller than a megabyte and one small text file with some system functions. Actually - less than one megabyte and for this program companies pay tens of thousands of dollars per year for one processor on the server.
Q is perfect for a variety of tasks. The Q process can serve as a historical database and provide quick access to terabytes of information. For example, we have dozens of historical databases, in some of which one uncompressed day of data takes more than 100 gigabytes. However, with reasonable restrictions, the database query will be executed in tens to hundreds of milliseconds. In general, we have a universal timeout for user requests - 30 seconds - and it works very rarely.
With the same ease, Q can be an in-memory database. Adding new data to tables in memory is so fast that user queries are a limiting factor. The data in the tables is stored in columns, which means that any operation in the column will use the processor cache at full capacity. In addition to this, KX tried to implement all basic operations such as arithmetic through vector processor instructions, maximizing their speed. Q can also perform tasks that are not characteristic of databases - for example, process streaming data and calculate in "real time" (with a delay from tens of milliseconds to several seconds depending on the task) various aggregate functions for financial instruments for different time intervals or build a model of the impact of a completed transaction on the market and carry out its profiling almost immediately after its completion . In such problems, most often the main time delay is not Q, but the need to synchronize data from different sources. High speed is achieved due to the fact that the data and functions that process them are in the same process, and processing is reduced to the execution of several QSQL expressions and joins that are not interpreted, but are executed in binary code. and the need to synchronize data from different sources. High speed is achieved due to the fact that the data and functions that process them are in the same process, and processing is reduced to the execution of several QSQL expressions and joins that are not interpreted, but are executed in binary code. and the need to synchronize data from different sources. High speed is achieved due to the fact that the data and functions that process them are in the same process, and processing is reduced to the execution of several QSQL expressions and joins that are not interpreted, but are executed in binary code.
Finally, any service processes can also be written in Q. For example, Gateway processes that automatically distribute user requests to the necessary databases and servers. The programmer has complete freedom to implement any algorithm for balancing, prioritizing, fault tolerance, access rights, quotas, and generally whatever your heart desires. The main problem here is that you have to implement all this yourself.
For an example, I will list what types of processes we have. All of them are actively used and work together, combining dozens of different databases, processing data from many sources and serving hundreds of users and applications.
Typical architecture for one data type:
Although Q is an interpreted language, it is simultaneously a vector language. This means that many built-in functions, in particular arithmetic, accept arguments of any form - numbers, vectors, matrices, lists, and the programmer is expected to implement the program as operations on arrays. In such a language, if you add two vectors in a million elements, it no longer matters that the language is interpreted, the addition will be performed by a superoptimized binary function. Since the lion's share of the time in Q programs is spent on operations with tables using these basic vectorized functions, the output has a very decent speed that allows you to process a huge amount of data even in one process. This is similar to the math libraries in python - although python itself is a very slow language,
In addition, KX very carefully approached the design of tables and optimizing the work with them. Firstly, several types of indexes are supported, which are supported by built-in functions and can be applied not only to table columns, but also to any vectors - grouping, sorting, uniqueness attribute and special grouping for historical databases. The index is superimposed elementarily and is automatically adjusted when adding elements to the column / vector. Indexes can equally well overlap table columns both in memory and on disk. When executing a QSQL query, indexes are used automatically, if possible. Secondly, work with historical data is done through the OS file mapping mechanism (memory map). Large tables are never loaded into memory, instead, the necessary columns are mapped directly to the memory and only that part of them is actually loaded (indices also help here), which is necessary. There is no difference for the programmer whether the data is in memory or not, the mechanism for working with mmap is completely hidden in the bowels of Q.
KDB + is not a relational database, tables can contain arbitrary data, while the order of the rows in the table does not change when new elements are added and can and should be used when writing queries. This feature is urgently needed for working with time series (data from exchanges, telemetry, event logs), because if the data is sorted by time, then the user does not need to use any SQL tricks to find the first or last row or N rows in the table , determine which line follows the Nth line, etc. Joins of tables are even more simplified, for example, finding for 16,000 transactions VOD.L (Vodafone) the last quote in a table of 500 million elements takes about a second on disk and a dozen milliseconds in memory.
An example of a time join is the quote table is mapped to memory, so there is no need to specify VOD.L in where, the index on the sym column is implicitly used and the fact that the data is sorted by time. Almost all joins in Q are ordinary functions, not part of the select statement:
Finally, it is worth noting that the engineers at KX, starting with Arthur Whitney himself, are really obsessed with efficiency and are making every effort to get the most out of the standard Q functions and optimize the most common usage patterns.
KDB + is popular among businesses primarily due to its exceptional versatility - it serves equally well both as an in-memory base, and as a base for storing terabytes of historical data, and as a platform for data analysis. Due to the fact that data processing occurs directly in the database, a high speed of operation and resource saving are achieved. A full-fledged programming language, integrated with the database functions, allows you to implement on the same platform the whole stack of necessary processes - from receiving data to processing user requests.
A significant drawback of KDB + / Q is its high entry threshold. The language has a strange syntax, some functions are heavily overloaded (value, for example, has about 11 use cases). Most importantly, it requires a radically different approach to writing programs. In a vector language, you need to think all the time in terms of array transformations, implement all cycles through several options of map / reduce functions (called adverbs in Q), never try to save money by replacing vector operations with atomic ones. For example, to find the index of the Nth occurrence of an element in an array, write:
although this looks terribly inefficient by C / Java standards (= creates a Boolean vector, where returns the indices of the true elements in it). But such a record makes the meaning of the expression more understandable and you use fast vector operations instead of slow atomic ones. The conceptual difference between the vector language and the rest is comparable to the difference between the imperative and functional approaches to programming, and you need to be prepared for this.
Some users are also unhappy with QSQL. The fact is that it only looks like real SQL. In fact, it's just an interpreter of SQL-like expressions that does not support query optimization. The user himself must write the optimal queries, and on Q, for which many are not ready. On the other hand, of course, you can always write your own optimal query yourself, and not rely on a black box-optimizer.
As a plus, a book on Q - Q For Mortals is available for free on the company's website , and there are also many other useful materials.
Another big minus is the cost of the license. This is tens of thousands of dollars per year for one CPU. Only large firms can afford such expenses. Recently, KX has made the licensing policy more flexible and provides the ability to pay only for the time of use or rent KDB + in the Google and Amazon clouds. KX also offers to download a free version for non-commercial purposes (32-bit version or 64-bit upon request).
There are quite a few specialized databases built on similar principles - columnar, in-memory, focused on very large amounts of data. The problem is that these are specialized databases. A prime example is Clickhouse. This database has a very similar to KDB + principle of storing data on disk and building the index; it performs some queries faster than KDB +, although not significantly. But even as the Clickhouse database is more specialized than KDB + - web analytics vs arbitrary time series (this difference is very important - because of it, for example, there is no way to use record ordering in Clickhouse). But, most importantly, Clickhouse does not have the universality of KDB +, a language that would allow processing data directly in the database, rather than loading them previously into a separate application, build arbitrary SQL expressions, apply arbitrary functions in the query, create processes not related to the execution of the functions of the historical base. Therefore, it is difficult to compare KDB + with other databases, they may be better in separate use cases or simply better if we are talking about the tasks of classic databases, but I don’t know another equally effective and universal tool for processing temporary data.
To make KDB + easier for people new to technology, KX created libraries for tight integration with Python in a single process. You can either call any python function from Q, or vice versa - call any Q function from Python (in particular QSQL expressions). Libraries convert if necessary (for the sake of efficiency not always) data from the format of one language to the format of another. As a result, Q and Python live in such close symbiosis that the boundaries between them are erased. As a result, a programmer, on the one hand, has full access to numerous useful Python libraries, on the other hand, he gets a fast base for working with big data integrated in Python, which is especially useful for those involved in machine learning or modeling.
Working with Q in Python:
Company website - https://kx.com/
Website for developers - https://code.kx.com/v2/
Q For Mortals book (in English) - https://code.kx.com/q4m3/
Articles on the topic of KDB + / Q applications from kx employees is https://code.kx.com/v2/wp/
In this article I will try to tell in general what KDB + is, what features and limitations it has, what is its benefit for companies wishing to process large volumes of data. I will not go into the details of the implementation of KDB + and the details of its Q programming language. Both of these topics are very extensive and deserve separate articles. A lot of information on these topics can be found at code.kx.com, including a book on Q - Q For Mortals (see link below).
Some terms
- In-memory database. A database that stores data in RAM for faster access. The advantages of such a database are understandable, and the disadvantages are the possibility of data loss, the need to have a lot of memory on the server.
- Column database. A database where data is stored in series, not record by record. The main advantage of such a database is that data from one column is stored together on disk and in memory, which greatly speeds up access to them. There is no need to load columns that are not used in the request. The main disadvantage is that it is difficult to modify and delete records.
- Time series. Data with a column such as date or time. As a rule, ordering in time is important for such data, so that you can easily determine which record precedes or follows the current one, or to apply functions whose result depends on the order of the records. Classic databases are built on a completely different principle - representing a set of records as a set, where the order of records is not defined in principle.
- Vector. In the context, KDB + is a list of elements of the same atomic type, for example, numbers. In other words, an array of elements. Arrays, unlike lists, can be stored compactly and processed using vector processor instructions.
Historical reference
KX was founded in 1993 by Arthur Whitney, who previously worked at Morgan Stanley Bank on A +, the successor to APL, a very original and popular language in the financial world. Of course, in KX, Arthur continued in the same spirit and created the vector-functional language K, guided by the ideas of radical minimalism. K programs look like a messy set of punctuation marks and special characters, the meaning of characters and functions depends on the context, and each operation carries a lot more meaning than it does in the usual programming languages. Due to this, the K program takes up a minimum of space - several lines can replace pages of text of a verbose language such as Java - and is a super-concentrated implementation of the algorithm.
A function on K that implements most of the LL1 parser generator according to a given grammar:
1. pp:{q:{(x;p3(),y)};r:$[-11=@x;$x;11=@x;q[`N;$*x];10=abs@@x;q[`N;x]
2. ($)~*x;(`P;p3 x 1);(1=#x)&11=@*x;pp[{(1#x;$[2=#x;;,:]1_x)}@*x]
3. (?)~*x;(`Q;pp[x 1]);(*)~*x;(`M;pp[x 1]);(+)~*x;(`MP;pp[x 1]);(!)~*x;(`Y;p3 x 1)
4. (2=#x)&(@x 1)in 100 101 107 7 -7h;($[(@x 1)in 100 101 107h;`Ff;`Fi];p3 x 1;pp[*x])
5. (|)~*x;`S,(pp'1_x);2=#x;`C,{@[@[x;-1+#x;{x,")"}];0;"(",]}({$[".s.C"~4#x;6_-2_x;x]}'pp'x);'`pp];
6. $[@r;r;($[1<#r;".s.";""],$*r),$[1<#r;"[",(";"/:1_r),"]";""]]}
Arthur also embodied this philosophy of extreme efficiency with a minimum of body movements in KDB +, which appeared in 2003 (I think it’s now clear where the letter K comes from the name) and there is nothing more than an interpreter of the fourth version of the language K. A version more pleasant to the user's eye is added to K K called Q. Q also added support for a specific SQL dialect, QSQL, and in the interpreter, support for tables as a system data type, tools for working with tables in memory and on disk, etc.
Thus, from the user's point of view, KDB + is just a Q interpreter with support for tables and SQL-like LINQ-style expressions from C #. This is the most important difference between KDB + and other databases and its main competitive advantage, which is often overlooked. This is not a database + an auxiliary language that is disabled, but a full-fledged powerful programming language + built-in support for database functions. This difference will play a decisive role in listing all the benefits of KDB +. For instance…
The size
By modern standards, KDB + is just a microscopic size. This is literally one executable file smaller than a megabyte and one small text file with some system functions. Actually - less than one megabyte and for this program companies pay tens of thousands of dollars per year for one processor on the server.
- This size allows KDB + to feel great on any hardware - from the Pi microcomputer to servers with terabytes of memory. This does not affect the functionality in any way; moreover, Q starts instantly, which allows it to be used including as a scripting language.
- With this size, the Q interpreter is completely placed in the processor cache, which speeds up the execution of programs.
- With this size of the executable file, the Q process takes up negligible memory space; you can run them in hundreds. At the same time, if necessary, Q can operate with tens or hundreds of gigabytes of memory in one process.
Versatility
Q is perfect for a variety of tasks. The Q process can serve as a historical database and provide quick access to terabytes of information. For example, we have dozens of historical databases, in some of which one uncompressed day of data takes more than 100 gigabytes. However, with reasonable restrictions, the database query will be executed in tens to hundreds of milliseconds. In general, we have a universal timeout for user requests - 30 seconds - and it works very rarely.
With the same ease, Q can be an in-memory database. Adding new data to tables in memory is so fast that user queries are a limiting factor. The data in the tables is stored in columns, which means that any operation in the column will use the processor cache at full capacity. In addition to this, KX tried to implement all basic operations such as arithmetic through vector processor instructions, maximizing their speed. Q can also perform tasks that are not characteristic of databases - for example, process streaming data and calculate in "real time" (with a delay from tens of milliseconds to several seconds depending on the task) various aggregate functions for financial instruments for different time intervals or build a model of the impact of a completed transaction on the market and carry out its profiling almost immediately after its completion . In such problems, most often the main time delay is not Q, but the need to synchronize data from different sources. High speed is achieved due to the fact that the data and functions that process them are in the same process, and processing is reduced to the execution of several QSQL expressions and joins that are not interpreted, but are executed in binary code. and the need to synchronize data from different sources. High speed is achieved due to the fact that the data and functions that process them are in the same process, and processing is reduced to the execution of several QSQL expressions and joins that are not interpreted, but are executed in binary code. and the need to synchronize data from different sources. High speed is achieved due to the fact that the data and functions that process them are in the same process, and processing is reduced to the execution of several QSQL expressions and joins that are not interpreted, but are executed in binary code.
Finally, any service processes can also be written in Q. For example, Gateway processes that automatically distribute user requests to the necessary databases and servers. The programmer has complete freedom to implement any algorithm for balancing, prioritizing, fault tolerance, access rights, quotas, and generally whatever your heart desires. The main problem here is that you have to implement all this yourself.
For an example, I will list what types of processes we have. All of them are actively used and work together, combining dozens of different databases, processing data from many sources and serving hundreds of users and applications.
- Connectors (feedhandler) to data sources. These processes usually use external libraries that are loaded in Q. The C-interface in Q is extremely simple and allows you to easily create proxy functions for any C / C ++ library. Q is fast enough to handle, for example, processing the flow of FIX messages from all European stock exchanges simultaneously.
- Tickerplant distributors, which serve as an intermediate link between connectors and consumers. At the same time, they write incoming data in a special binary log, providing resistance for consumers to connection loss or restart.
- In-memory database (rdb). These databases provide the fastest access to raw, fresh data, storing them in memory. As a rule, they accumulate data in tables during the day and zero them at night.
- Persist Database (pdb). These databases provide data storage today in the historical database. As a rule, unlike rdb, they do not store data in memory, but use a special cache on disk for a day and copy the data at midnight to the historical database.
- Historical bases (hdb). These databases provide access to data for previous days, months, and years. Their size (in days) is limited only by the size of the hard drives. Data can be located anywhere, in particular on different disks for faster access. It is possible to compress data using several algorithms to choose from. The database structure is well-documented and simple, the data is stored on a per-unit basis in ordinary files, so that they can be processed, including using the operating system.
- Databases with aggregated information. Various aggregations are stored, usually with, grouped by instrument name and time interval. In-memory databases update their status with every incoming message, and historical ones store pre-calculated data to speed up access to historical data.
- Finally, gateway processes serving applications and users. Q allows you to implement completely asynchronous processing of incoming messages, distributing them among databases, checking access rights, etc. I note that messages are not limited to and most often are not SQL statements, as is the case in other databases. Most often, the SQL expression is hidden in a special function and is constructed based on the parameters requested by the user - time is converted, filtered, data is normalized (for example, the price of shares is equalized if dividends were paid), etc.
Typical architecture for one data type:
Speed
Although Q is an interpreted language, it is simultaneously a vector language. This means that many built-in functions, in particular arithmetic, accept arguments of any form - numbers, vectors, matrices, lists, and the programmer is expected to implement the program as operations on arrays. In such a language, if you add two vectors in a million elements, it no longer matters that the language is interpreted, the addition will be performed by a superoptimized binary function. Since the lion's share of the time in Q programs is spent on operations with tables using these basic vectorized functions, the output has a very decent speed that allows you to process a huge amount of data even in one process. This is similar to the math libraries in python - although python itself is a very slow language,
In addition, KX very carefully approached the design of tables and optimizing the work with them. Firstly, several types of indexes are supported, which are supported by built-in functions and can be applied not only to table columns, but also to any vectors - grouping, sorting, uniqueness attribute and special grouping for historical databases. The index is superimposed elementarily and is automatically adjusted when adding elements to the column / vector. Indexes can equally well overlap table columns both in memory and on disk. When executing a QSQL query, indexes are used automatically, if possible. Secondly, work with historical data is done through the OS file mapping mechanism (memory map). Large tables are never loaded into memory, instead, the necessary columns are mapped directly to the memory and only that part of them is actually loaded (indices also help here), which is necessary. There is no difference for the programmer whether the data is in memory or not, the mechanism for working with mmap is completely hidden in the bowels of Q.
KDB + is not a relational database, tables can contain arbitrary data, while the order of the rows in the table does not change when new elements are added and can and should be used when writing queries. This feature is urgently needed for working with time series (data from exchanges, telemetry, event logs), because if the data is sorted by time, then the user does not need to use any SQL tricks to find the first or last row or N rows in the table , determine which line follows the Nth line, etc. Joins of tables are even more simplified, for example, finding for 16,000 transactions VOD.L (Vodafone) the last quote in a table of 500 million elements takes about a second on disk and a dozen milliseconds in memory.
An example of a time join is the quote table is mapped to memory, so there is no need to specify VOD.L in where, the index on the sym column is implicitly used and the fact that the data is sorted by time. Almost all joins in Q are ordinary functions, not part of the select statement:
1. aj[`sym`time;select from trade where date=2019.03.26, sym=`VOD.L;select from quote where date=2019.03.26]
Finally, it is worth noting that the engineers at KX, starting with Arthur Whitney himself, are really obsessed with efficiency and are making every effort to get the most out of the standard Q functions and optimize the most common usage patterns.
Total
KDB + is popular among businesses primarily due to its exceptional versatility - it serves equally well both as an in-memory base, and as a base for storing terabytes of historical data, and as a platform for data analysis. Due to the fact that data processing occurs directly in the database, a high speed of operation and resource saving are achieved. A full-fledged programming language, integrated with the database functions, allows you to implement on the same platform the whole stack of necessary processes - from receiving data to processing user requests.
additional information
disadvantages
A significant drawback of KDB + / Q is its high entry threshold. The language has a strange syntax, some functions are heavily overloaded (value, for example, has about 11 use cases). Most importantly, it requires a radically different approach to writing programs. In a vector language, you need to think all the time in terms of array transformations, implement all cycles through several options of map / reduce functions (called adverbs in Q), never try to save money by replacing vector operations with atomic ones. For example, to find the index of the Nth occurrence of an element in an array, write:
1. (where element=vector)[N]
although this looks terribly inefficient by C / Java standards (= creates a Boolean vector, where returns the indices of the true elements in it). But such a record makes the meaning of the expression more understandable and you use fast vector operations instead of slow atomic ones. The conceptual difference between the vector language and the rest is comparable to the difference between the imperative and functional approaches to programming, and you need to be prepared for this.
Some users are also unhappy with QSQL. The fact is that it only looks like real SQL. In fact, it's just an interpreter of SQL-like expressions that does not support query optimization. The user himself must write the optimal queries, and on Q, for which many are not ready. On the other hand, of course, you can always write your own optimal query yourself, and not rely on a black box-optimizer.
As a plus, a book on Q - Q For Mortals is available for free on the company's website , and there are also many other useful materials.
Another big minus is the cost of the license. This is tens of thousands of dollars per year for one CPU. Only large firms can afford such expenses. Recently, KX has made the licensing policy more flexible and provides the ability to pay only for the time of use or rent KDB + in the Google and Amazon clouds. KX also offers to download a free version for non-commercial purposes (32-bit version or 64-bit upon request).
Competitors
There are quite a few specialized databases built on similar principles - columnar, in-memory, focused on very large amounts of data. The problem is that these are specialized databases. A prime example is Clickhouse. This database has a very similar to KDB + principle of storing data on disk and building the index; it performs some queries faster than KDB +, although not significantly. But even as the Clickhouse database is more specialized than KDB + - web analytics vs arbitrary time series (this difference is very important - because of it, for example, there is no way to use record ordering in Clickhouse). But, most importantly, Clickhouse does not have the universality of KDB +, a language that would allow processing data directly in the database, rather than loading them previously into a separate application, build arbitrary SQL expressions, apply arbitrary functions in the query, create processes not related to the execution of the functions of the historical base. Therefore, it is difficult to compare KDB + with other databases, they may be better in separate use cases or simply better if we are talking about the tasks of classic databases, but I don’t know another equally effective and universal tool for processing temporary data.
Python Integration
To make KDB + easier for people new to technology, KX created libraries for tight integration with Python in a single process. You can either call any python function from Q, or vice versa - call any Q function from Python (in particular QSQL expressions). Libraries convert if necessary (for the sake of efficiency not always) data from the format of one language to the format of another. As a result, Q and Python live in such close symbiosis that the boundaries between them are erased. As a result, a programmer, on the one hand, has full access to numerous useful Python libraries, on the other hand, he gets a fast base for working with big data integrated in Python, which is especially useful for those involved in machine learning or modeling.
Working with Q in Python:
1. >>> q()
2.q)trade:([]date:();sym:();qty:())
3. q)\
4. >>> q.insert('trade', (date(2006,10,6), 'IBM', 200))
5. k(',0')
6. >>> q.insert('trade', (date(2006,10,6), 'MSFT', 100))
7. k(',1')
References
Company website - https://kx.com/
Website for developers - https://code.kx.com/v2/
Q For Mortals book (in English) - https://code.kx.com/q4m3/
Articles on the topic of KDB + / Q applications from kx employees is https://code.kx.com/v2/wp/