inv2004 October 24, 2013 at 19:18

Programming language and database Q: syntax does not play a role in enterprise

From the sandbox
Tutorial

There was a need to choose a new tariff plan for cellular. To spend about 30 minutes with excel and google-docs, it became clear that nothing sensible could come of this and you could not do without db.

After a little thought, the hand itself typed “q”, as it was the only one available on the computer here and now. What he knew about him: that he launched the first and last time a year ago, for about 30 minutes, for a simple task of parsing and searching a file.

Then there will be a lot of q, namely the ascii follower of a subset of the APL and Scheme languages, namely k and its extension k-sql, degenerated into a product with the name Q - a tight connection of the language and the database built into it.

C:\Users\unknown\Dropbox\j>q
KDB+ 3.0 2013.02.06 Copyright (C) 1993-2013 Kx Systems
w32/ 2()core 2972MB unknown win-d2om7les24v 192.168.1.2 PLAY 2013.05.07

A bit of lyrics: I download the report from the operator’s site in csv and slightly adjust the title:

Сервис;Дата звонка;tel;time;Длит-ть;Баланс до;money;Баланс после;
Входящий звонок внутри группы;22.02.2013 20:38:14;79064014328;00:00:13;0;114,9175;0,0000;114,9175;
Входящий звонок;22.02.2013 20:03:49;79094445182;00:12:05;0;114,9175;0,0000;114,9175;
Исходящий звонок внутри группы;22.02.2013 17:04:39;79064014328;00:01:15;0;115,8175;-0,9000;114,9175;
Исходящий звонок внутри группы;22.02.2013 13:18:22;79064014328;00:01:36;0;116,7175;-0,9000;115,8175;
Списание за услугу Сообщники;22.02.2013 01:35:00;;00:00:00;0;119,3675;-2,6500;116,7175;
Запрос информации;21.02.2013 23:40:42;*102;00:00:01;0;119,3675;0,0000;119,3675;

Then we read the manual, reading a text file 0:, in parallel, you can divide the file into fields by specifying the column format and the separator on the left side, if the separator is a list, then the table columns will be named from the first line. We select only the necessary fields, two times and the table is ready.

Those who are bored with reading about data preparation can jump right into data analysis .

q)clog:select tel,time,money from ("SSSTSSSS";enlist ";") 0: `:tel.txt
q)clog
tel         time         money
---------------------------------
79064014328 00:00:31.000 0,0000
79263883922 00:02:06.000 0,0000
79064014328 00:01:15.000 -0,9000
79064014328 00:01:36.000 -0,9000
            00:00:00.000 -2,6500
*102        00:00:01.000 0,0000
..

Since q is a vector database, then essentially clog is a dictionary, the column name is a list of values.

q)clog.money
`0,0000`0,0000`-0,9000`-0,9000`-2,6500`0,0000`0,0000`0,0000`0,0000`0,0000`-0,..

I’ll prepare the data a bit. It can be seen that money is not in a numerical format, it would be necessary to convert to a number: ssr is an oracle replace. The term $ (cast) deals with various conversions and type conversions, in this case it reads a number from a string:

each is a map

{"F"$ssr[string x;",";"."]} each clog.money

Well, we’ll write it all down to the table using update. There is a small feature. If you use the clog table name, the result of the update function will be a new table with updated values. but you can specify the table name as `clog, then the changes will be saved. We will also make the phone a string, initially “S” is not a string, but a character type.

q)update string tel, {"F"$ssr[string x;",";"."]} each money from `clog
`clog

Almost all the words in this q-sql are ordinary functions with a small portion of syntactic sugar. They can be used separately, for example, where simply converts the bit string to a list of indexes.

comparison works with lists. the result is a bit string from which where it extracts indices, well, and select from these indices extracts the corresponding list items from the table.

q)15<40 10 20 30
1011b
q)where 15<40 10 20 30
0 2 3

The file contains both outgoing and incoming calls and payments of various services, I select the lines where the money was debited and there is a phone number and assign this old name:

q)clog:select from clog where money<0,not tel like ""
q)clog
tel           time         money
--------------------------------
"79064014328" 00:01:15.000 -0.9
"79064014328" 00:01:36.000 -0.9
"79064014328" 00:01:33.000 -0.9
"79104652109" 00:01:23.000 -11.9
"79265996349" 00:00:12.000 -5.95
..

We determine the codes that are used to subsequently classify by operators:

a little later I realized that I needed to pull the code extraction into the function.

q)gcode:1_ 4# / get code from tel
q)gcode each clog.tel
"906"
"906"
"906"
"910"
"926"
..

q)distinct gcode each clog.tel
"906"
"910"
"926"
..

here is a more sql-like entry with exec. exec is the same select, but which does not return the dictionary of the table, but returns the values or value of the result of the query or table.

q)codes:exec distinct gcode each tel from clog
q)codes
"906"
"910"
"926"
..

Next, go to the dictionaries, they are described simply <keys>! <values>. I create a dictionary code <> operator.

q)ops:codes ! `beeline`mts`megafon`beeline`mts`beeline`beeline`mts`moscow
q)ops
"906"| beeline
"910"| mts
"926"| megafon
"909"| beeline
"495"| moscow
..

Many tariffs round up a minute to full, I enter a field for convenience, which will be just a whole number of minutes. I do not save it to the table, I just get the result, since later I will create a view with this field. Time in milliseconds, so I divide by 1000.

q)update ctime:ceiling (time%1000)%60 from clog
tel           time         money ctime
--------------------------------------
"79064014328" 00:01:15.000 -0.9  2
"79064014328" 00:01:36.000 -0.9  2
"79064014328" 00:01:33.000 -0.9  2
"79104652109" 00:01:23.000 -11.9 2
"79265996349" 00:00:12.000 -5.95 1
..

I create a view with an operator and whole minutes, if I wrote t: I would create a table t. I remind you that update saves the original columns.

q)t::update op:ops@gcode each tel, ctime:ceiling (time%1000)%60 from clog
q)t
tel           time         money op      ctime
----------------------------------------------
"79064014328" 00:01:15.000 -0.9  beeline 2
"79064014328" 00:01:36.000 -0.9  beeline 2
"79064014328" 00:01:33.000 -0.9  beeline 2
"79104652109" 00:01:23.000 -11.9 mts     2
"79265996349" 00:00:12.000 -5.95 megafon 1
..

I’ve typed everything I need, I save the result of t to a file just in case, it would be more correct of course to save the clog and description of view`t, but laziness:

q)save `:t
`:t

All of the above is just data preparation, now a little more interesting: parsing.

Let's see who called the most, then the group begins. Grouping is a parameter of the select function, which creates lists for each key entry:

q)select ctime by tel from t
tel          | ctime                                                         ..
-------------| --------------------------------------------------------------..
"74956471602"| ,1                                                            ..
"79031398210"| 7 3                                                           ..
"7903X"      | ,2                                                            ..
"79064014328"| 2 2 2 2 1 2 1 1 1 3 1 2 2 3 1 1 1 1 2 2 3 3 3 1 3 2 1 1 0 2 1 ..
..

then we execute the functions with the parameter in the form of this list, desc is the reverse sorting function, it sorts both the regular list and the table, by default it is sorted by the last column.

q)desc select sum ctime by tel from t
tel          | ctime
-------------| -----
"79064014328"| 126
"79094445182"| 36
"79652650530"| 30
..

Noting that there were many calls to one number, I added the “favorite number” column, a little later I decided to simply indicate this in the operator’s field, assigned a new name to the old view, and “t” is now a new view based on the old one:

q)t2::update op:ops@gcode each tel, ctime:ceiling (time%1000)%60 from clog
q)t::update op:`lub from t2 where tel like "79064014328"
q)t
tel           time         money op      ctime
----------------------------------------------
"79064014328" 00:01:15.000 -0.9  lub     2
"79064014328" 00:01:36.000 -0.9  lub     2
"79064014328" 00:01:33.000 -0.9  lub     2
"79104652109" 00:01:23.000 -11.9 mts     2
"79265996349" 00:00:12.000 -5.95 megafon 1
..

Now it's time to think about money, specifically about megaphone tariffs.

There are 3 kopecks there, it’s easy to describe the function:

q)meg3:{0.03*sum x}

let's see what's with the money for each operator:

q)select meg3 time%1000 by op from t
op     | time
-------| ------
beeline| 111.93
lub    | 148.05
megafon| 29.1
moscow | 0.93
mts    | 24.45

You need to enter tariff options, if the number is `lub, then divide the price by two and add 30p.

q)lub:{$[x=`lub;30+y%2;y]} / [op;time]

That's all, the function for calculating will be the following, here for lub uses currying:

q){lub[x] meg3[y]}

Unfortunately, I did not find how to pass the key and the value of the “by” result to the function, so I format this as a subquery. Since op and time are not two values of a row from a table as in regular db, whole lists (in this case, a list and a list of lists) will be transferred to the function, but the function described by me expects only two parameters: the operator and a list of times, so it is passed to use the eachboth function, which is denoted by '(quotation mark), in essence it is zipWith , but without limiting the number of lists. The request, unlike regular db, is only complicated by ':

q)select money:{lub[x] meg3[y]}'[op;time] from select time%1000 by op from t
money
------
259.98
29.1
0.93
24.45

Summarizing, here you can write both exec sum and sum exec - the result is summed exec or exec sums up the result - it does not play a role:

q)exec sum {lub[x] meg3[y]}'[op;time] from select time%1000 by op from t
314.46

It’s clear how much I would spend switching to this tariff. Now let's calculate another one, where the minute is rounded, and then on the second. You have to count for each specified time, which I do using each:

q)mego:sum {1.20+$[x<=60;0;1.20*(x-60)%60]} each
q)exec sum {lub[x] mego[y]}'[op;time] from select time%1000 by op from t
258.06

Full code:

clog:select tel,time,money from ("SSSTSSSS";enlist ";") 0: `:tel.txt
{"F"$ssr[string x;",";"."]} each clog.money
update string tel, {"F"$ssr[string x;",";"."]} each money from `clog
clog:select from clog where money<0,not tel like ""
gcode:1_ 4#
codes:exec distinct gcode each tel from clog
ops:codes ! `beeline`mts`megafon`beeline`mts`beeline`beeline`mts`moscow
t2::update op:ops@gcode each tel, ctime:ceiling (time%1000)%60 from clog
t::update op:`lub from t2 where tel like "79064014328"
meg3:{0.03*sum x}
mego:sum {1.20+$[x<=60;0;1.20*(x-60)%60]} each
lub:{$[x=`lub;30+y%2;y]}
exec sum {lub[x] meg3[y]}'[op;time] from select time%1000 by op from t
exec sum {lub[x] mego[y]}'[op;time] from select time%1000 by op from t

Making this text was much more difficult than writing these 14 lines. It’s clear that there are no things that are impossible for any other database, but the ease of use and the obviousness of writing some constructions made me write it. At the beginning it was a little difficult to switch from regular sql, but after realizing that the table here stores the data in lists, and functions, as a rule, work with almost any built-in data types, it became much more clear. It is the idiomatic simplicity and ease of implementation of this db, and in fact it is a mixture of scheme and APL that allows you to use this tool effectively. Impressions of this are the APL and functionality, shifted towards sql and databases.

The entire database consists of one q.exe file, ~ 400kb in size. Skeptics will smile after this, but then look at the customer list of this product http://kx.com/end-user-customers.php .
You can play with this by downloading here http://kx.com/software-download.php

Tags:

Programming language and database Q: syntax does not play a role in enterprise

Also popular now: