e-nformation
Don't know yet
Content > Programming > SQL > MySQL Top N in each group (group inner limit)

MySQL Top N in each group (group inner limit)

MySQL Top N in each group (group inner limit)

While working on this site , I was confronted with a problem.
In the "Browse Content" I wanted to display a list of categories with each having up to 3
of the latest articles related to it , this turned out to be a harder task than I originally thought.
I'll go ahead and display what I did and link to some solution I encountered on the way(not all fit MySql unfortuneately).

Assuming our DB layout is pretty simple regarding content and categories:
Each article has it's own ID , a field marking the date it was published and a field indicating under which category it was published.
From that we can create the following query:

SELECT
  article_id,
  article_parent_category,
  article_date
FROM
  articles AS a
WHERE
  (
    SELECT
      COUNT(*)
    FROM
      articles AS a2
    WHERE
      a2.article_parent_category, = a.article_parent_category
      AND a2.article_date > a.article_date
   ) < 3


I'll go ahead and explain.
Basically wer'e querying for all the required fields(id, parent category ,date),
Going over the first row wer'e doing a second count query that matches the first row's parent category
and has a larger date value(more recently published), if we have less than 3 rows left that match the same category
with more recent dates we return that row, this process is repeated.
For better understanding let me illustrate.

Let's look at 5 rows with the parent category of 1:
ID | Parent Category | Date
1  | 1               | 3/5/09
2  | 1               | 4/5/09
3  | 1               | 5/5/09
4  | 1               | 6/5/09
5  | 1               | 7/5/09

Going over the first row we try to count how many rows are there that have the same parent category but are more recent , we find 4
this doesn't match our filter since we specified < 3, we move to the second row and count how many rows are there with the same parent category
that are more recent and we get 3, still not good. On the third row we all only have two left (IDs: 4 and 5) , this matches our condition so we return the row,
the same is done with the following rows , the fourth returns 1 as the count query and the fifth returns 0, so we get 3 rows from this category that are the most recent from the whole subset,
This process is done for each category assuming we have more than 5 rows and more than one parent category.

I must note that I have no idea how it scales since it wasn't really tested on large subsets of data , and I know it's pretty basic but it serves my needs , I hope it serves yours as you can taylor it to your needs.

Some recommendations I found while looking for an answer:

  • This site contains more suggestions on how to perform a "Top N Per Group" query , and eventually the one I based my query on Xaprb
  • Another site describing how it can be done on SQL Server 2005 - SQLTeam
Comments