petropavel October 18, 2016 at 20:41

MariaDB on Google Summer of Code: GSoC16 Summary

A somewhat belated report about MariaDB, our projects at the latest GSoC, our students, ~~kings and cabbage~~ .

Last - 2015th - GSoC we turned out to be very unsuccessful. There were eight students in total, but many failed in mid-summer (at midterm evaluation), and three were from the same university in Cameroon (and obviously from the same course), with excellent applications, but they didn’t do anything, from the word “completely” , well, maybe one line of comment was corrected in a month and a half. And after the failure on midterm, they tried to appeal our decision to Google, and even sent us an email with foggy threats. Like, it’s not good to fail so many students, to spoil the image for yourself, next year Google will not give places.

But Google did not obey and gave them. And this year, probably in contrast, turned out to be extremely successful.

This year we were allocated 10 slots for which 54 students claimed. Considering the sad experience of the past year, we made sure that all the accepted students were from different universities - if they were to hack, then independently of each other. And it was as always - some students wrote to our mailing list in advance, appeared on the IRC, and discussed projects. Two students even made working prototypes. Such active ones, of course, were all accepted. The rest were left overboard, the competition was big. From the accepted ten students it turned out that:

One project was for MHA . MariaDB acts as an umbrella organization and projects can be any in the MariaDB / MySQL / Percona ecosystem
One project was for MaxScale, that is, also not for the server. See above
One student did nothing and was dropped out in July
Something strange happened to the three - they worked well, no problems were foreseen. And suddenly at the end of August Google wrote that some inconsistencies were found in their data, they were suspected of cheating and thrown out of the program. It turned out very ugly, people tried, worked all summer. But we could not influence the situation, although we tried.
Five students went to the end and completed their projects. 50% is a great success for us. About these five - below and in more detail.

Daniil Medvedev: NO PAD collations

As written in the documentation (actually briefly and between the lines), when comparing strings of different lengths, MariaDB Server (and Percona Server and MySQL) supplement the shorter string with spaces until the lengths are equal (that is, of course, nothing is really complemented, but the result is as if supplemented). It looks like this:

MariaDB [test]> select "abcd" > "abc";
+----------------+
| "abcd" > "abc" |
+----------------+
|              1 |
+----------------+
1 row in set (0.00 sec)
MariaDB [test]> select "abc\t" > "abc";
+-----------------+
| "abc\t" > "abc" |
+-----------------+
|               0 |
+-----------------+
1 row in set (0.00 sec)

In the second case, the long line is lexicographically smaller than the short, because there is '\t'less space. This is all well and standard. But not really. In the SQL standard, collation has this property; it can be PADSPACE or NOPAD. In the first case, a short line is achieved by spaces when comparing, in the second, respectively, this does not happen. So it turns out that so far all comparisons have been made as if they were PADSPACE. And now, Daniel Medvedev has made us NO PAD collations. The effect is not always noticeable, but it is:

MariaDB [test]> set collation_connection=utf8_general_nopad_ci;
Query OK, 0 rows affected (0.00 sec)
MariaDB [test]> select "abc\t" > "abc";
+-----------------+
| "abc\t" > "abc" |
+-----------------+
|               1 |
+-----------------+
1 row in set (0.00 sec)

This project was completely completed and now it is already in MariaDB 10.2.2.

Galina Shalygina: Condition pushdown into non-mergeable views

Galina is not a newcomer to us. She is well acquainted with one of the most complex subsystems of MariaDB - the query optimizer and is the author of the implementation of the operator WITH, recursive and non-recursive CTEs .

For this GSoC, Galina chose to optimize the views for which it is impossible to merge with the main query. It’s best, of course, to execute a query with views using the merge method, in which the view is substituted as a macro into the main query. For example, a query like this:

CREATE VIEW v1 AS SELECT a1+b1 AS c1, d1 FROM t1 WHERE e1>5;
SELECT x2 FROM v1, t2 WHERE c1=y2 AND d1=10;

Can be converted to:

SELECT x2 FROM t1, t2 WHERE a1+b1=y2 AND d1=10 AND e1>5 ;

And here the optimizer sees all the tables used, all conditions, and can choose the best plan. But not all views can be optimized in this way, for example, merging cannot be done if the view is used GROUP BY. In such cases, you have to perform the presentation as a separate query, saving the result in a temporary table, and then use it in the main query. As a result, the optimizer does not see the “whole picture” and is forced to consider both requests separately. That is, in the previous example, he would not be able to use the condition d1=10when fetching from the table t1.

This is where the new optimization comes into effect. If the view cannot be substituted into the main query and the optimizer is forced to read it separately using a temporary table, he will now try to use the conditions from the main query when filling out this temporary table, pushing them into the view in some way. That is, in the example above, the optimizer will save the result of such a query in a temporary table:

SELECT a1+b1 AS c1, d1 FROM t1 WHERE e1>5 AND d1=10;

Which can speed up the execution of the request by orders of magnitude.

This work was also completely finished, and it is already available in 10.2.3

Shubham Barai, Sachin Setiya: Arbitrary length UNIQUE constraints

The task was to implement uniqueness constraints of arbitrary length. Usually, when someone writes a tableUNIQUE(a,b,c), MariaDB (and Percona, and MySQL) create an index on the fields a, b, c. Hence the limitation is obtained - the uniqueness of the combination of fields can be guaranteed only when the total length of these fields does not exceed the maximum permissible length of the index key. But this is illogical! An "index" is a way to optimize access to data; it is not at all in the SQL standard. And UNIQUE is a logical restriction of data, directly from the SQL standard, and it is completely unclear why it should suffer due to some limitations of a particular implementation of query acceleration. Our solution was to create another field, hidden, users do not need to show it, and write a hash there of what we are doing UNIQUE for. And in this field you can already create a regular index. When writing in it, you need to look for conflicts and if there is - pull out conflicting records and directly compare the values, the hashes may indeed coincide. Moreover, MyISAM already knows how to do it, even does it - that’s how it worksSELECT DISTINCT. But for user tables this was not available. What I wanted to fix.

A difficult situation has arisen with this task. We have been offering it at GSoC for several years, no one wanted it. And this year, two applications came to her, and both are pretty sensible. Then both students wrote a working (!) Prototype - the first time in my eight years as a mentor at GSoC. And then again, both (!) Said that, they say, it’s too simple, and let’s do it at InnoDB. I had to take both. But I slightly corrected one of them.

Shubham dived into the jungle of InnoDB, and began to deal with its indexes, logs, transactions, recovery, and other magic. And Sachin pretended to forget that inside MyISAM this functionality was already there, threw out its prototype, and began to implement everything at the server level, in a motor-independent way. As a result, both completed the task successfully, the low-level implementation is slightly faster, and the higher-level one works with different engines, allows the optimizer to use these invisible fields to optimize queries, and brings with it other interesting goodies. We have not yet decided what we take - everything is so tasty ... So in 10.2 it most likely will not get there. It will be at 10.3, probably.

Sachin Setiya: Hidden columns

And this is the aforementioned bun. In order to make an invisible field (and it is also virtual, this hash can not be written to disk), we decided to add what is called Invisible columns in Oracle, and Implicitly Hidden in DB2. If it’s simple, then when creating a field, you can indicate that it is “invisible”. After that SELECT *, INSERT INTO table VALUE (...)they will not notice him either. Do not show or, accordingly, do not write the specified values into it. But if you mention this field by name - in any team - then it behaves like a normal visible field. It was conceived to expand the scheme so as not to break working (possibly also with closed source) applications. Hidden fields are added, old queries don’t see them, and new ones call them by name, and everything works.

Only we have somewhat expanded this idea by adding different “levels of invisibility”. The first level is as above, both in Oracle and DB2. Well, there is still zero - everything is visible, it is not interesting. The second level - the field is visible only in SELECTand only if explicitly mentioned by name. That is, it is not visible in INSERT/ UPDATE- its value cannot be changed. And it is not visible in CREATE/ ALTER- it cannot be created or deleted. It is created automatically. An example is a ROWID and other pseudo-fields. And the third level, the field is not visible at all anywhere. This was exactly what was needed to create invisible hash fields. And you can also use them for functional indexes of the type INDEX(a+b)- also, create an invisible virtual field and index it. And you can ... In general, when this bun appeared, the ideas were not long in coming.

As part of the previous task, this, too, is likely to hit only 10.3

Varun Gupta: SQL aggregate functions

Also an idea that has been considered for a long time. But really managed to formulate only to this GSOC. The standard says how to create custom SQL functions, and MariaDB, of course, knows how to:

CREATE FUNCTION COUNT_X(x INT) RETURNS INT
  RETURN (SELECT COUNT(*) FROM data_table WHERE value=x);

But these are ordinary functions. But you cannot create your aggregate functions in SQL. At least not in the standard. There is in Oracle, and in PostgreSQL, and, for example, in HSQLDB too. I wanted MariaDB to be. We thought about the syntax for a long time - there is no standard, everyone has their own way. Compared like others. As a result, they decided not to show off, and did as everyone else - that is, in their own way. Of course, our option is more natural and easiest. Probably those who invented this syntax in Oracle / PostgreSQL / HSQLDB also think that their option is the best.

It will work like this:

CREATE AGGREGATE FUNCTION agg_and(x INT) RETURNS INT
BEGIN
  DECLARE z INT DEFAULT 65535;
  DECLARE CONTINUE HANDLER FOR NOT FOUND RETURN z;
  LOOP
    FETCH GROUP NEXT ROW;
    SET z= (z&x);
  END LOOP;
END

For comparison, in order to understand where the legs grow from, here is a completely written standard non-aggregate function that calculates AND of all values of a column in the table:

CREATE FUNCTION col_and() RETURNS INT
BEGIN
  DECLARE x INT;
  DECLARE z INT DEFAULT 65535;
  DECLARE CONTINUE HANDLER FOR NOT FOUND RETURN z;
  DECLARE cur CURSOR FOR SELECT col FROM data_table;
  OPEN cur;
  LOOP
    FETCH cur INTO x;
    SET z= (z&x);
  END LOOP;
END

That is, the logic is the same, just a special cursor is passed to the aggregate function that goes through all the values of the group.

Unfortunately, this is also unlikely to fall into 10.2. Although the project is also almost complete.

This year, for some reason, there was not a single project completed, say, by three quarters. So to throw and feel sorry and inconvenient. All that they wanted was done. And that's great!

Looking forward to next year ...

Tags: