找回密码
 新猫注册
查看: 562|回复: 0

How to select the first/least/max row per group in SQL

[复制链接]
kernel 发表于 2008-8-9 09:34:27 | 显示全部楼层 |阅读模式
Here are some common SQL problems, all of whichhave related solutions: how do I find the most recent log entry foreach program? How do I find the most popular item from each category?How do I find the top score for each player? In general, these types of“select the extreme from each group” queries can be solved with thesame techniques. I’ll explain how to do that in this article, includingthe harder problem of selecting the top N entries, not just the top 1.
This topic is related to numbering rows, which I just wrote about (see my articles about MySQL-specific and generic techniquesto assign a number to each row in a group). Therefore I’ll use nearlythe same table and data as I used in those articles, with the additionof a price column:
+--------+------------+-------+
| type   | variety    | price |
+--------+------------+-------+
| apple  | gala       |  2.79 |
| apple  | fuji       |  0.24 |
| apple  | limbertwig |  2.87 |
| orange | valencia   |  3.59 |
| orange | navel      |  9.36 |
| pear   | bradford   |  6.05 |
| pear   | bartlett   |  2.14 |
| cherry | bing       |  2.55 |
| cherry | chelan     |  6.33 |
+--------+------------+-------+
Selecting the one maximum row from each groupLet’s say I want to select the most recent log entry for eachprogram, or the most recent changes in an audit table, or something ofthe sort. This question comes up over and over on IRC channels andmailing lists. I’ll re-phrase the question in terms of fruits. I wantto select the cheapest fruit from each type. Here’s the desired result:
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 |
| orange | valencia |  3.59 |
| pear   | bartlett |  2.14 |
| cherry | bing     |  2.55 |
+--------+----------+-------+There are a few common solutions to this problem.  All involve two steps: finding the desired value of price, and then selecting the rest of the row based on that.
One common solution is a so-called self-join. Step one is to groupthe fruits by type (apple, cherry etc) and choose the minimum price:
select type, min(price) as minprice
from fruits
group by type;
+--------+----------+
| type   | minprice |
+--------+----------+
| apple  |     0.24 |
| cherry |     2.55 |
| orange |     3.59 |
| pear   |     2.14 |
+--------+----------+Step two is to select the rest of the row by joining these resultsback to the same table. Since the first query is grouped, it needs tobe put into a subquery so it can be joined against the non-groupedtable:
select f.type, f.variety, f.price
from (
   select type, min(price) as minprice
   from fruits group by type
) as x inner join fruits as f on f.type = x.type and f.price = x.minprice;
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 |
| cherry | bing     |  2.55 |
| orange | valencia |  3.59 |
| pear   | bartlett |  2.14 |
+--------+----------+-------+Another common way to do this is with a correlated subquery. Thiscan be much less efficient, depending on how good your system’s queryoptimizer is. You might find it clearer, though.
select type, variety, price
from fruits
where price = (select min(price) from fruits as f where f.type = fruits.type);
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | fuji     |  0.24 |
| orange | valencia |  3.59 |
| pear   | bartlett |  2.14 |
| cherry | bing     |  2.55 |
+--------+----------+-------+Both queries are logically equivalent, though they may not perform the same.
Select the top N rows from each groupThis is a slightly harder problem to solve.  Finding a single row from each group is easy with SQL’s aggregate functions (MIN(), MAX(),and so on). Finding the first several from each group is not possiblewith that method because aggregate functions only return a singlevalue. Still, it’s possible to do.
Let’s say I want to select the two cheapest fruits from each type.  Here’s a first try:
select type, variety, price
from fruits
where price = (select min(price) from fruits as f where f.type = fruits.type)
   or price = (select min(price) from fruits as f where f.type = fruits.type
      and price > (select min(price) from fruits as f2 where f2.type = fruits.type));
+--------+----------+-------+
| type   | variety  | price |
+--------+----------+-------+
| apple  | gala     |  2.79 |
| apple  | fuji     |  0.24 |
| orange | valencia |  3.59 |
| orange | navel    |  9.36 |
| pear   | bradford |  6.05 |
| pear   | bartlett |  2.14 |
| cherry | bing     |  2.55 |
| cherry | chelan   |  6.33 |
+--------+----------+-------+Yuck! That can be written as a self-join, but it’s just as bad (Ileave it as an exercise for the reader). This gets worse as you go tohigher numbers (top 3, top 4…). There are other ways to phrase thestatement, but they all boil down to the same thing, and they’re allpretty unwieldy and inefficient.
There’s a better way:  select the variety from each type where the variety is no more than the second-cheapest of that type.
select type, variety, price
from fruits
where (
   select count(*) from fruits as f
   where f.type = fruits.type and f.price < fruits.price
) <= 2;This is elegant, and lets you vary N without rewriting your query (avery good thing!), but it’s functionally the same as the previousquery. Both are essentially a quadratic algorithm relative to thenumber of varieties in each type. And again, some query optimizers maynot do well with this and make it quadratic with respect to the numberof rows in the table overall (especially if no useful index isdefined), and the server might get clobbered. Are there better ways?Can it be done with one pass through the data, instead of the manypasses required by a correlated subquery? You know it can, or Iwouldn’t be writing this, now would I?
Use UNIONIf there’s an index on (type, price), and there are many morerecords to eliminate than to include in each group, a more efficientsingle-pass method (especially for MySQL, but also for some other RDBMSs) is to break the queries out separately and put a limit on each, then UNION them all back together.  Here’s the syntax you need for MySQL:
(select * from fruits where type = 'apple' order by price limit 2)
union all
(select * from fruits where type = 'orange' order by price limit 2)
union all
(select * from fruits where type = 'pear' order by price limit 2)
union all
(select * from fruits where type = 'cherry' order by price limit 2)Peter Zaitsev has written in detail about this technique, so I won’t go into it too much more here.  If it suits your purposes, it can be a very good solution.
One note: use UNION ALL, not just UNION.It prevents the server sorting the results to eliminate duplicatesbefore returning them. In this case there will be no duplicates, so I’mtelling the server to skip that (useless, expensive) step.
Do it with user variables on MySQLThe UNION trick is an especially good idea when theresults are a small fraction of the rows in the table and there is anindex that can be used for sorting the rows. Another linear-timetechnique, which might be a good option in cases where you areselecting most of the rows from the table anyway, is user variables.This is MySQL-specific. Please refer to my previous post on how to number rows in MySQL for the gory details of why this works:
set @num := 0, @type := '';

select type, variety, price
from (
   select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
  from fruits
  order by type, price
) as x where x.row_number <= 2;This isn’t one pass through the table, by the way. The subquery isimplemented as a temporary table behind the scenes, so filling it withdata is one pass; then selecting every row from it and applying the WHEREclause is another. However, twice through is still O(n) with respect tothe table size. That’s a lot better than correlated subqueries, whichare O(n2) with respect to the group size — even moderategroup sizes cause bad performance (say there are five varieties of eachfruit. That’s on the order of 25 passes through the table, all told).
One-pass technique on MySQL… maybe?If you want to leave your queries up the the query optimizer’swhims, you can try this technique, which builds no temporary tables andmakes just one pass through:
set @num := 0, @type := '';

select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
from fruits
group by type, price, variety
having row_number <= 2;This theoretically ought to work if MySQL orders by the GROUP BYcriteria, which it sometimes does for efficiency and to produce theexpected results. Does it work? Here’s what it returns on MySQL 5.0.27on Windows:
+--------+----------+-------+------------+--------+
| type   | variety  | price | row_number | dummy  |
+--------+----------+-------+------------+--------+
| apple  | gala     |  2.79 |          1 | apple  |
| apple  | fuji     |  0.24 |          3 | apple  |
| orange | valencia |  3.59 |          1 | orange |
| orange | navel    |  9.36 |          3 | orange |
| pear   | bradford |  6.05 |          1 | pear   |
| pear   | bartlett |  2.14 |          3 | pear   |
| cherry | bing     |  2.55 |          1 | cherry |
| cherry | chelan   |  6.33 |          3 | cherry |
+--------+----------+-------+------------+--------+Look closely… it’s returning rows one and three from each group, andthey’re not numbered in order of increasing price? Huh? But the HAVING clause says the row_number should be no greater than 2!  Here’s what it returns on version 5.0.24a on Ubuntu:
+--------+------------+-------+------------+--------+
| type   | variety    | price | row_number | dummy  |
+--------+------------+-------+------------+--------+
| apple  | fuji       |  0.24 |          1 | apple  |
| apple  | gala       |  2.79 |          1 | apple  |
| apple  | limbertwig |  2.87 |          1 | apple  |
| cherry | bing       |  2.55 |          1 | cherry |
| cherry | chelan     |  6.33 |          1 | cherry |
| orange | valencia   |  3.59 |          1 | orange |
| orange | navel      |  9.36 |          1 | orange |
| pear   | bartlett   |  2.14 |          1 | pear   |
| pear   | bradford   |  6.05 |          1 | pear   |
+--------+------------+-------+------------+--------+Look, this time everything is numbered 1 and every row is returned.  Wonky.  This is exactly what the MySQL manual page on user variables warns about.
This technique is pretty much non-deterministic, because it relieson things that you and I don’t get to control directly, such as whichindexes MySQL decides to use for grouping. However, if you need to useit — and I know there are some folks out there who do, because I’veconsulted for them — you can still tweak it. We’re getting into therealm of really bastardizing SQL, but the results above came from atable without indexes other than the primary key on (type, variety).  What happens if I add an index MySQL can use for grouping?
alter table fruits add key(type, price);Nothing changes, and EXPLAIN shows why: the querydoesn’t use the index I just added. Why? Because the grouping is onthree columns, and the index is only on two. In fact, the query isusing a temp table and filesort anyway, so this is still not achievingthe once-through goal. I can force it to use the index:
set @num := 0, @type := '';

select type, variety, price,
      @num := if(@type = type, @num + 1, 1) as row_number,
      @type := type as dummy
from fruits force index(type)
group by type, price, variety
having row_number <= 2;Let’s see if that works:
+--------+----------+-------+------------+--------+
| type   | variety  | price | row_number | dummy  |
+--------+----------+-------+------------+--------+
| apple  | fuji     |  0.24 |          1 | apple  |
| apple  | gala     |  2.79 |          2 | apple  |
| cherry | bing     |  2.55 |          1 | cherry |
| cherry | chelan   |  6.33 |          2 | cherry |
| orange | valencia |  3.59 |          1 | orange |
| orange | navel    |  9.36 |          2 | orange |
| pear   | bartlett |  2.14 |          1 | pear   |
| pear   | bradford |  6.05 |          2 | pear   |
+--------+----------+-------+------------+--------+Ah, now we’re cooking! It did what I wanted, without a filesort ortemporary table. Another way to do this, by the way, is to take variety out of the GROUP BY so it uses the index on its own.  Because this selects a non-grouped column from a grouped query, this only works if you are running with ONLY_FULL_GROUP_BY mode turned off, which I hope you are not doing without good reason.
Other methodsBe sure to check the comments for user-contributed methods. Thereare some really novel approaches. I always learn so much from yourcomments… thank you!
ConclusionWell, that’s it. I’ve shown you several ways of solving the common“get the extreme row from each group” query, and then moved on to howyou can get the top N rows from each group in various ways. Then I doveinto MySQL-specific techniques which some (including myself, dependingon my mood) would regard as mildly foolish to utterly stupid. But ifyou need the last bit of speed out of your server, you sometimes haveto know when to break the rules. And for those who think this is justMySQL foolishness, it’s not; I’ve seen people desperately do thesetypes of things on other platforms too, such as SQL Server. There arehacks and tweaks on every platform, and people who need to use them.
I hope you’ve enjoyed and profited from this discussion.  If you did, maybe you’d like to subscribe for convenient notification of future articles.  Happy coding!



http://www.xaprb.com/blog/2006/1 ... w-per-group-in-sql/
您需要登录后才可以回帖 登录 | 新猫注册

本版积分规则

手机版|小黑屋|[漫猫]动漫论坛

GMT+8, 2024-11-23 17:31

Powered by Discuz! X3.5

© 2001-2024 Discuz! Team.

快速回复 返回顶部 返回列表