Recently I had to provide a script to convert a dataset of raw assessment scores into an Oracle table with the scores ordered by percentile rank. This is a common request so I figured a short, non-technical post on percentile ranking might be helpful to a lot of people.
There are three main ways to calculate percentile ranks in Oracle 9i or later:
- (a) Calculate them manually with Joins and Sub-queries;
- (b) Use the CUME_DIST function and format the results as a percentage;
- (c) Use the PERCENT_RANK function.
In this post I'm going to focus on options (b) and (c) because the only people who would reinvent the wheel and use (a) are SQL programmers who don't know (b) and (c) already exist.
CUME_DIST and PERCENT_RANK are built-in Oracle mathematical functions that allow you to rank a value based on its relative standing within a set of values. For example, if you hear someone say that a 1600 SAT score was in the 99th percentile (meaning 99% of all the other scores in that administration of the test were lower) a ranking formula is what tells you so.
The two functions take different approaches to determining a percentile rank and you should understand the basic difference.
CUM_DIST determines a percentile rank by calculating the ratio of the number of rows that have a lesser or equal ranking to the total number of rows in the partition.
PERCENT_RANK determines a percentile rank by setting the lowest value (that is, the first row returned by the query) equal to 0 and assigning all the remaining rows with this formula:
(n-1)/(m-1) where n is the nth row in a partition of m records.
There are other differences between the two functions but a detailed review is way beyond the promised scope of this post. However, you will do well to remember four points:
- The two functions are similar, but (unlike many books lead you to believe) they are not identical. That's why they return different answers when you use them side-by-side in the same query.
- CUME_DIST returns a position of a row and PERCENT_RANK returns a rank of a row.
- CUME_DIST always excludes 0 and PERCENT_RANK always includes it.
- In most cases where your goal is to rank records by percentile, PERCENT_RANK is the function you want to use.
Both functions can be used in two forms: aggregate or analytic. Use the aggregate form when you want to find the percentile rank of one particular record in the database according to some criteria you specify. Use the analytic form when you want to find the percentile ranks of a group of records in the database. You can spot the aggregate form because the SELECT statement will contain a WITHIN GROUP outer table join. The analytic form will use a PARTITION BY clause instead.
The aggregate form uses this syntax:
PERCENT_RANK (expression) WITHIN GROUP
(ORDER BY order_by_clause [ASC|DESC] [NULLS FIRST|LAST] );
Here's a simple example:
SELECT PERCENT_RANK (100000000, 1000000)
WITHIN GROUP (ORDER BY total_gross, star_salary) "Percentile Rank"
from movies WHERE movie_year IN ('2017');
The above SQL statement will return (from a table called Movies) the percentile rank of a particular 2017 movie that grossed $100 million and paid its starring actor $1 million.
The analytic form uses this syntax:
PERCENT_RANK () OVER
([PARTITION BY query_partition_clause] ORDER BY order_by_clause);
Here's an example of that:
SELECT movie_name, movie_year, movie_type, total_gross,
PERCENT_RANK () OVER (PARTITION BY movie_type
ORDER BY total_gross DESC) "Percentile Rank"
WHERE movie_year IN ('2017');
The above SQL statement will return a table listing the percentile rankings of all movies released in 2017 according to their total box office gross.
As a closing note, remember that when using the aggregate form the number and datatypes of expressions inside the first parenthesis must match the number and datatypes of expressions inside the second parenthesis.
Now go rank some data.