Group with condition

    Periodically, a task arises that requires grouping the data set with the condition that for attributes not participating in the grouping, you need to take a tuple with a maximum value in one of the fields.

    Let's look at a simple example.
    There is a table:
      article INT(4) ZEROFILL NOT NULL,
      dealer VARCHAR(45) NOT NULL,
      price DECIMAL(8,2) NOT NULL,
      PRIMARY KEY (id))
    ENGINE = InnoDB;

    It is necessary for all article to find a dealer with the maximum price.

    There are several obvious and simple solutions for this task, but I know one of them, which is significantly superior to all others.
    Faced this challenge? Want to see a new way to solve it? I ask for cat.

    Even the official documentation of did not pass this task , and 3 solutions are proposed:
    Before each request I will indicate the index and the time of its execution. The table is filled in for 100,000 entries
            DECLARE i INT;
            SET i = 1;
            WHILE i <= 100000 DO
                INSERT INTO shop (article, dealer, price) VALUES (CEIL(RAND() * 9999), CEIL(RAND() * 999), RAND() * 9999);
                SET i = i + 1;
            END WHILE;

    First idx (article) 2,169 c:

    SELECT article, dealer, price
    FROM   shop s1
    WHERE  price=(SELECT MAX(s2.price)
        FROM shop s2
        WHERE s1.article = s2.article);

    Second idx (article, price) 0.203 c

    SELECT s1.article, dealer, s1.price
    FROM shop s1
    JOIN (
        SELECT article, MAX(price) AS price
        FROM shop
        GROUP BY article
    ) AS s2 ON s1.article = s2.article AND s1.price = s2.price;

    Third idx (article, price) 0.593 c

    SELECT s1.article,, s1.price
    FROM shop s1
    LEFT JOIN shop s2 ON s1.article = s2.article AND s1.price < s2.price
    WHERE s2.article IS NULL;

    Well, now my solution:

    Attention! Use this method at your own risk! In future versions of MySQL, grouping behavior may change.

    This decision is based on the fact that group by for attributes that are not specified group operations and which do not participate in the grouping, takes the first value encountered. Thus, if the data set is pre-sorted, then we get tuples with the desired maximum value.

    4. idx (price) 0.328 c

    SELECT article, dealer, price
    FROM (
        SELECT article, dealer, price 
        FROM shop
        ORDER BY price desc) 
    as t
    GROUP BY article

    Because Since the previous examples were without any sorting, and group by automatically adds it, you must specify ORDER BY NULL so that the data is not further sorted, otherwise the results will not be comparable.
    But why do we need to create an intermediate table, because we can get the sorted data using the index:
    5. idx (article, price) 0.110 c

    SELECT article, dealer, price
    FROM shop use index (idx)
    GROUP BY article DESC

    Bonus Solution:

    The solution was found on the Mitch Dickinson blog . It does not claim to be the fastest, but very original.

    6. idx (article) 0.202 s

    SELECT article, 
           SUBSTRING_INDEX(GROUP_CONCAT(dealer ORDER BY price DESC),',',1) AS dealer,
           MAX(price) AS price
    FROM shop
    GROUP BY article;

    In the comments, dm9 gave another 1 solution, which was described in the documentation for earlier versions:
    SELECT article,
        SUBSTRING( MAX( CONCAT(LPAD(price,6,'0'),dealer) ), 7) AS dealer,
        0.00+LEFT(MAX( CONCAT(LPAD(price,6,'0'),dealer) ), 6) AS price
    FROM   shop
    GROUP BY article;

    In general, the article is devoted not so much to solving a specific problem, but to examining an example of how you can go beyond the framework of generally accepted solutions and try to find alternative ways. For me, databases have always been interesting precisely because there is the opportunity to find interesting optimization methods that can significantly save time and effort.

    PS: An attentive reader probably noticed that methods 4-6 for each article give only 1 supplier with the maximum price, unlike the first methods in which all suppliers are returned. But in solving this problem, I was interested in any of the suppliers, so this problem was insignificant.

    PPS: An alternative method, which is proposed in this article, performs well with average table sizes. With the number of records more than a million, method 2 would be most optimal. Moreover, if the number of records is already so large, I highly recommend that this information be precalculated in separate tables.

    Also popular now: