Solving the problem of the second competition CUBRID it!

    Hi, Habrauser! I bring to your attention a solution to the problem that won the second competition CUBRID it! The essence of the competition is to find the most optimal solution to the SQL problem using Java or PHP. The solution is purely algorithmic, so even if you are not connected with CUBRID and the CUBRID it! Contest, then look under the cut anyway - it can be simply interesting and even useful. Go!

    Challenge of the competition

    There is a database, of course, CUBRID . The task of the competition is to find the value that is most often found in the database. Database tables use data types from a limited list. The value must be cast to string using standard database functions and the value should not consist only of numbers. A detailed statement of the problem can be found here .


    First, we need to get a list of columns and their types in all the tables in our database. To do this, we can use the PDO functions or the "native" CUBRID drivers .

    Getting a list of tables:
    cubrid_schema( $this->conn , CUBRID_SCH_CLASS );

    Having selected the necessary tables, we will make a request for each of them:
    $result = cubrid_execute( $this->conn , "SELECT * FROM {$table['NAME']} LIMIT 0;");
    $column_names = cubrid_column_names($result);
    $column_types = cubrid_column_types($result);

    $column_namesand $column_typesrespectively contain arrays with the names of the table fields and the data type of each of them.

    Next, we need to make a selection for each column. In general, the request will look like this:
    SELECT TO_CHAR(`fieldname`), count(*)
    FROM `tablename`
    GROUP BY `fieldname`

    We must make such a request for each column in the database. The query may vary slightly, depending on the data type of the column. As a result, we get a table of the form “ Value ” - “ Number of values ​​in a column ” sorted in descending order of the number of values. Example query result:

    This means that the value “ CUBRID ” appears in the table field 159 times, and “ 14.3 ” is only 50. The
    query we use is quite simple and very fast, with almost no checks, except in some cases. There are many such requests - one for each column in the database, but don’t be scared, we won’t load all these values ​​into PHP memory, we will work with a pointer to the result of the request. For convenience, we will save pointers to the results of queries in the form of an array.

    Now we must calculate what value is most often found in the tables of our database. To do this, we will take one query result from our array of pointers in a loop and extract the current row from it.

    For example, take three columnsField1 , Field2 , Field3 , from which the query resulted in the following data:

    The query result for field Field1 :

    Query result for field2 :

    Query result for field3 :

    The tables can be of any size, and accordingly the result that is obtained after the query can be a very large table.
    At each step of the cycle, we will read one row of these tables and enter them into a special array, so after the first traversal of the tables the result will look like this:
    Array('CUBRID' => 159, '14,3' => 160, 'HOLA!' => 100)

    At this stage, it is checked that the string does not consist only of numbers, according to the condition of the task.
    The resulting data we put in an array. For clarity, the current result will be shown in the form of a pivot table, which after the first crawl looks like this:
    ValueField1Field2Field3AmountPotential amount

    The data is entered into the array at each new iteration of the loop. A hyphen means that this value has not yet been encountered in this column, and accordingly it can still occur. But if it does occur, then its quantity will not be greater than the number of the previous one found in the same column, since our result tables are sorted in descending order of the number of values. This means that even if the “ CUBRID ” value is next in the “ Field2 ” column , its number will not be more than 160. “Potential amount” is how much this value can maximize, i.e. essentially a “hyphen sum”. The potential quantity for " CUBRID " is determined on the assumption that this value will be found in the next iteration in the fields " Field2"And" Field3 "with the maximum possible number (160 and 100), respectively.
    What do these data give us? They tell us how much this or that value can collect in total. Amount + Potential amount - this is the amount that theoretically could be in the end for the current value. Based on this number, we can make an assumption whether it can be the first in this list, or whether it can already be deleted and not be taken into account in the future. Next we will see how it works.

    After the next step (“ HOLA! ” - 80, “ CUBRID ” - 100, “ 14.3 ” - 10), the table looks as follows:
    ValueField1Field2Field3AmountPotential amount

    Pay attention to the value of “ 14.3 ”: its sum is 170 and potentially it can gain only 80. That is, in the most successful scenario, its number will be 250, which is already less than the current leader ( CUBRID - 259). So this value will never take first place in our list. It serves him right. We can safely remove the “14.3” value from the list and no longer take it into account. This is what we will do with all the meanings that cannot become leaders. Thus, we form the “Top” from those values ​​that have a chance to take first place.

    At the next step (“ 14.3 ” - 50, “ xyz ” - 90, “ xyz”- 5) we will see that now, any value of which there is no“ Top ”will not be able to score more than 145 (50 + 90 + 5), and therefore take the first place. We can stop our search on this and not look at the remaining values ​​in the tables, since these values ​​simply won’t get into the “Top”.
    Now that we have an exact list of candidates for leadership, it’s enough for us to find their number in the columns in which we have a hyphen, i.e. in those in which we may still have meaning. This is done with simple queries like:

    SELECT `Field3`, count(*)
    FROM `tablename`
    WHERE `Field3` = `CUBRID`
    GROUP BY `Field3`;

    By summing the number in different columns for each value, we find out the most common value in the database. Which, in fact, is what we need from the conditions of the problem.

    Thank you for your attention, some interesting links:
    The source code of the solution (GPL v2 license)
    An interesting decision of the winner of the Ekstazi
    competition The task of the competition (Eng)
    The announcement of the competition in Habré (Rus)
    The winners of the contest ( Eng , Rus )
    Documentation CUBRID

    Also popular now: