Dzirtik March 12, 2014 at 11:40

Group with condition

Periodically, a task arises that requires grouping the data set with the condition that for attributes not participating in the grouping, you need to take a tuple with a maximum value in one of the fields.

Let's look at a simple example.
There is a table:

CREATE TABLE IF NOT EXISTS shop (
  id INT NOT NULL AUTO_INCREMENT,
  article INT(4) ZEROFILL NOT NULL,
  dealer VARCHAR(45) NOT NULL,
  price DECIMAL(8,2) NOT NULL,
  PRIMARY KEY (id))
ENGINE = InnoDB;

It is necessary for all article to find a dealer with the maximum price.

There are several obvious and simple solutions for this task, but I know one of them, which is significantly superior to all others.
Faced this challenge? Want to see a new way to solve it? I ask for cat.

Even the official documentation of mysql.com did not pass this task , and 3 solutions are proposed:
Before each request I will indicate the index and the time of its execution. The table is filled in for 100,000 entries

DELIMITER $$
CREATE PROCEDURE InsertRand()
    BEGIN
        DECLARE i INT;
        SET i = 1;
        START TRANSACTION;
        WHILE i <= 100000 DO
            INSERT INTO shop (article, dealer, price) VALUES (CEIL(RAND() * 9999), CEIL(RAND() * 999), RAND() * 9999);
            SET i = i + 1;
        END WHILE;
        COMMIT;
    END$$
DELIMITER ;

First idx (article) 2,169 c:

SELECT article, dealer, price
FROM   shop s1
WHERE  price=(SELECT MAX(s2.price)
    FROM shop s2
    WHERE s1.article = s2.article);

Second idx (article, price) 0.203 c

SELECT s1.article, dealer, s1.price
FROM shop s1
JOIN (
    SELECT article, MAX(price) AS price
    FROM shop
    GROUP BY article
) AS s2 ON s1.article = s2.article AND s1.price = s2.price;

Third idx (article, price) 0.593 c

SELECT s1.article, s1.dealer, s1.price
FROM shop s1
LEFT JOIN shop s2 ON s1.article = s2.article AND s1.price < s2.price
WHERE s2.article IS NULL;

Well, now my solution:

Attention! Use this method at your own risk! In future versions of MySQL, grouping behavior may change.

This decision is based on the fact that group by for attributes that are not specified group operations and which do not participate in the grouping, takes the first value encountered. Thus, if the data set is pre-sorted, then we get tuples with the desired maximum value.

4. idx (price) 0.328 c

SELECT article, dealer, price
FROM (
    SELECT article, dealer, price 
    FROM shop
    ORDER BY price desc) 
as t
GROUP BY article
ORDER BY NULL;

Because Since the previous examples were without any sorting, and group by automatically adds it, you must specify ORDER BY NULL so that the data is not further sorted, otherwise the results will not be comparable.
But why do we need to create an intermediate table, because we can get the sorted data using the index:

5. idx (article, price) 0.110 c

SELECT article, dealer, price
FROM shop use index (idx)
GROUP BY article DESC
ORDER BY NULL;

Bonus Solution:

The solution was found on the Mitch Dickinson blog . It does not claim to be the fastest, but very original.

6. idx (article) 0.202 s

SELECT article, 
       SUBSTRING_INDEX(GROUP_CONCAT(dealer ORDER BY price DESC),',',1) AS dealer,
       MAX(price) AS price
FROM shop
GROUP BY article;

In the comments, dm9 gave another 1 solution, which was described in the documentation for earlier versions:

SELECT article,
    SUBSTRING( MAX( CONCAT(LPAD(price,6,'0'),dealer) ), 7) AS dealer,
    0.00+LEFT(MAX( CONCAT(LPAD(price,6,'0'),dealer) ), 6) AS price
FROM   shop
GROUP BY article;

In general, the article is devoted not so much to solving a specific problem, but to examining an example of how you can go beyond the framework of generally accepted solutions and try to find alternative ways. For me, databases have always been interesting precisely because there is the opportunity to find interesting optimization methods that can significantly save time and effort.

PS: An attentive reader probably noticed that methods 4-6 for each article give only 1 supplier with the maximum price, unlike the first methods in which all suppliers are returned. But in solving this problem, I was interested in any of the suppliers, so this problem was insignificant.

PPS: An alternative method, which is proposed in this article, performs well with average table sizes. With the number of records more than a million, method 2 would be most optimal. Moreover, if the number of records is already so large, I highly recommend that this information be precalculated in separate tables.

Tags: