如何从每个组中选择前5% ?

时间:2022-04-12 12:02:58

I have a sample table like this:

我有这样一个示例表:

CREATE TABLE #TEMP(Category VARCHAR(100), Name VARCHAR(100))

INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'John')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Adam')
INSERT INTO #TEMP VALUES('A', 'Lisa')
INSERT INTO #TEMP VALUES('A', 'Lisa')
INSERT INTO #TEMP VALUES('A', 'Bucky')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Lily')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Tom')
INSERT INTO #TEMP VALUES('B', 'Ross')
INSERT INTO #TEMP VALUES('B', 'Ross')
INSERT INTO #TEMP VALUES('B', 'Ross')

SELECT Category, Name, COUNT(Name) Total
FROM #TEMP
GROUP BY Category, Name
ORDER BY Category, Total DESC

DROP TABLE #TEMP

Gives me the following:

给我以下几点:

A   John    6
A   Adam    4
A   Lisa    2
A   Bucky   1
B   Lily    5
B   Tom     4
B   Ross    3

Now, how do I select the TOP 5 PERCENT records from each category assuming each category has more than 100 records (did not show in sample table here)? For instance, in my actual table, it should remove the John record from A and Lily record from B as appropriate (again, I did not show the full table here) to get:

现在,我如何从每个类别中选择前5%的记录,假设每个类别有超过100条记录(在这里的样本表中没有显示)?例如,在我的实际表中,它应该将John记录从A和Lily记录中删除(再次,我没有在这里显示完整的表):

A   Adam    4
A   Lisa    2
A   Bucky   1
B   Tom     4
B   Ross    3

I have been trying to use CTEs and PARTITION BY clauses but cannot seem to achieve what I want. It removes the TOP 5 PERCENT from the overall result but not from each category. Any suggestions?

我一直在尝试使用cte和按子句划分,但似乎无法实现我想要的。它从总体结果中除去前5%,但不是每个类别。有什么建议吗?

4 个解决方案

#1


14  

You could use a CTE (Common Table Expression) paired with the NTILE windowing function - this will slice up your data into as many slices as you need, e.g. in your case, into 20 slices (each 5%).

您可以使用CTE(公共表表达式)和NTILE窗口功能(这将把您的数据分割成您需要的许多片),例如在您的情况中,将数据分割成20片(每片5%)。

;WITH SlicedData AS
(
   SELECT Category, Name, COUNT(Name) Total,
            NTILE(20) OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) AS  'NTile'
   FROM #TEMP
   GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE NTile > 1

This basically groups your data by Category,Name, orders by something else (not sure if COUNT(Name) is really the thing you want here), and then slices it up into 20 pieces, each representing 5% of your data partition. The slice with NTile = 1 is the top 5% slice - just ignore that when selecting from the CTE.

它基本上是按类别、名称、订单等其他东西(不确定计数(名称)是否真的是您想要的)对数据进行分组,然后将其分割成20个部分,每个部分代表您的数据分区的5%。具有NTile = 1的切片是前5%的切片——在CTE中选择时忽略它。

See:

看到的:

for more info

更多信息

#2


1  

Edit: I have added the second solution

编辑:我添加了第二个解决方案

SELECT   b.Id
        ,b.Category
        ,b.Name
        ,b.CategoryNameCount
FROM
(
        SELECT   a.Id   
                ,a.Category
                ,a.Name
                ,COUNT(*)OVER(PARTITION BY a.Category, a.Name) CategoryNameCount
                ,COUNT(*)OVER(PARTITION BY a.Category) CategoryCount
        FROM    #TEMP a
) b
WHERE   b.CategoryCount*5.0/100 > b.CategoryCount*b.CategoryNameCount*1.0/100
ORDER BY b.Category, b.CategoryNameCount DESC, b.Name

Results:

结果:

Id          Category Name       CategoryNameCount
----------- -------- ---------- -----------------
7           A        Adam       4
8           A        Adam       4
9           A        Adam       4
10          A        Adam       4
11          A        Lisa       2
12          A        Lisa       2
13          A        Bucky      1
19          B        Tom        4
20          B        Tom        4
21          B        Tom        4
22          B        Tom        4
23          B        Ross       3
24          B        Ross       3
25          B        Ross       3

or

SELECT   b.Category, b.Name, b.CategoryNameCount
FROM
(
        SELECT  
                 a.Category
                ,a.Name
                ,COUNT(*)OVER(PARTITION BY a.Category, a.Name) CategoryNameCount
                ,COUNT(*)OVER(PARTITION BY a.Category) CategoryCount
        FROM    #TEMP a
) b
WHERE   b.CategoryCount*5.0/100 > b.CategoryCount*b.CategoryNameCount*1.0/100
GROUP BY b.Category, b.Name, b.CategoryNameCount
ORDER BY b.Category, b.CategoryNameCount DESC, b.Name

Results:

结果:

Category Name       CategoryNameCount
-------- ---------- -----------------
A        Adam       4
A        Lisa       2
A        Bucky      1
B        Tom        4
B        Ross       3

#3


1  

select Category,name,CountTotal,RankSeq,(50*CountTotal)/100 from (
select Category,name,COUNT(*)
over (partition by Category,name ) as CountTotal,
ROW_NUMBER()
over (partition by Category,name order by Category) RankSeq from #TEMP
--group by Category,Name 
) temp
where RankSeq <= ((50*CountTotal)/100)
order by Category,Name,RankSeq

Output:

输出:

Category    name     CountTotal RankSeq     50*CountTotal)/100
A           Adam     4          1           2
A           Adam     4          2           2
A           John     6          1           3
A           John     6          2           3
A           John     6          3           3
A           Lisa     2          1           1
B           Lily     5          1           2
B           Lily     5          2           2
B           Ross     3          1           1
B           Tom      4          1           2
B           Tom      4          2           2

I hope this helps :)

我希望这能有所帮助。

#4


0  

;WITH SlicedData AS
(
   SELECT Category, Name, COUNT(Name) Total,
            **PERCENT_RANK() OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) * 100** AS  'Percent'
   FROM #TEMP
   GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE Percent < 5

NTile will not work if number of records is less than your tile number.

如果记录的数量小于磁砖的数量,则NTile不能工作。

#1


14  

You could use a CTE (Common Table Expression) paired with the NTILE windowing function - this will slice up your data into as many slices as you need, e.g. in your case, into 20 slices (each 5%).

您可以使用CTE(公共表表达式)和NTILE窗口功能(这将把您的数据分割成您需要的许多片),例如在您的情况中,将数据分割成20片(每片5%)。

;WITH SlicedData AS
(
   SELECT Category, Name, COUNT(Name) Total,
            NTILE(20) OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) AS  'NTile'
   FROM #TEMP
   GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE NTile > 1

This basically groups your data by Category,Name, orders by something else (not sure if COUNT(Name) is really the thing you want here), and then slices it up into 20 pieces, each representing 5% of your data partition. The slice with NTile = 1 is the top 5% slice - just ignore that when selecting from the CTE.

它基本上是按类别、名称、订单等其他东西(不确定计数(名称)是否真的是您想要的)对数据进行分组,然后将其分割成20个部分,每个部分代表您的数据分区的5%。具有NTile = 1的切片是前5%的切片——在CTE中选择时忽略它。

See:

看到的:

for more info

更多信息

#2


1  

Edit: I have added the second solution

编辑:我添加了第二个解决方案

SELECT   b.Id
        ,b.Category
        ,b.Name
        ,b.CategoryNameCount
FROM
(
        SELECT   a.Id   
                ,a.Category
                ,a.Name
                ,COUNT(*)OVER(PARTITION BY a.Category, a.Name) CategoryNameCount
                ,COUNT(*)OVER(PARTITION BY a.Category) CategoryCount
        FROM    #TEMP a
) b
WHERE   b.CategoryCount*5.0/100 > b.CategoryCount*b.CategoryNameCount*1.0/100
ORDER BY b.Category, b.CategoryNameCount DESC, b.Name

Results:

结果:

Id          Category Name       CategoryNameCount
----------- -------- ---------- -----------------
7           A        Adam       4
8           A        Adam       4
9           A        Adam       4
10          A        Adam       4
11          A        Lisa       2
12          A        Lisa       2
13          A        Bucky      1
19          B        Tom        4
20          B        Tom        4
21          B        Tom        4
22          B        Tom        4
23          B        Ross       3
24          B        Ross       3
25          B        Ross       3

or

SELECT   b.Category, b.Name, b.CategoryNameCount
FROM
(
        SELECT  
                 a.Category
                ,a.Name
                ,COUNT(*)OVER(PARTITION BY a.Category, a.Name) CategoryNameCount
                ,COUNT(*)OVER(PARTITION BY a.Category) CategoryCount
        FROM    #TEMP a
) b
WHERE   b.CategoryCount*5.0/100 > b.CategoryCount*b.CategoryNameCount*1.0/100
GROUP BY b.Category, b.Name, b.CategoryNameCount
ORDER BY b.Category, b.CategoryNameCount DESC, b.Name

Results:

结果:

Category Name       CategoryNameCount
-------- ---------- -----------------
A        Adam       4
A        Lisa       2
A        Bucky      1
B        Tom        4
B        Ross       3

#3


1  

select Category,name,CountTotal,RankSeq,(50*CountTotal)/100 from (
select Category,name,COUNT(*)
over (partition by Category,name ) as CountTotal,
ROW_NUMBER()
over (partition by Category,name order by Category) RankSeq from #TEMP
--group by Category,Name 
) temp
where RankSeq <= ((50*CountTotal)/100)
order by Category,Name,RankSeq

Output:

输出:

Category    name     CountTotal RankSeq     50*CountTotal)/100
A           Adam     4          1           2
A           Adam     4          2           2
A           John     6          1           3
A           John     6          2           3
A           John     6          3           3
A           Lisa     2          1           1
B           Lily     5          1           2
B           Lily     5          2           2
B           Ross     3          1           1
B           Tom      4          1           2
B           Tom      4          2           2

I hope this helps :)

我希望这能有所帮助。

#4


0  

;WITH SlicedData AS
(
   SELECT Category, Name, COUNT(Name) Total,
            **PERCENT_RANK() OVER(PARTITION BY Category ORDER BY COUNT(Name) DESC) * 100** AS  'Percent'
   FROM #TEMP
   GROUP BY Category, Name
)
SELECT *
FROM SlicedData
WHERE Percent < 5

NTile will not work if number of records is less than your tile number.

如果记录的数量小于磁砖的数量,则NTile不能工作。