Calculating mode in Hive

Mean is overrated. As a statistical measure, and as a way of being (it’s easiest to just be nice!). Just as mean is overrated, mode is overlooked. Sometimes I don’t want the mean of my data, I want the mode, or the median. And for some reason, this is hard to do with Apache Hive. To explain what I mean by this, let’s get our hands dirty with a dataset. (Should I say let’s clean our teeth with a dataset, since the picture I chose below is toothpaste? Nah, that would be weird).

Cleaning our teeth (?) with some data

One of my favourite toothpastes. Sponsor me please, Tom’s!

Consider the case where we have order data for a particular item. Specifically, we have a dataframe with 2 columns: number_of_orders, and item_id. Let’s say for simplicity’s sake that we’re dealing with a single item, say, toothpaste (to yes, clean your teeth with), and each (number_of_orders,item_id) pair tells us the number of times someone has purchased that item. If our population includes myself and my friend Hanna, and I have purchased toothpaste 13 times, and Hanna has purchased toothpaste 55 times, then two rows of our dataframe using Pandas would look like:

(I use Redshift to change the colour of my screen- it seems to affect screenshots too!).

The entire distribution looks like:

The mode in our distribution is the peak we see in the bar chart. This value is 191, so let’s see how we can go about calculating this in Hive.

It’s simple (with Pandas) yet tricky (with Hive)

Why would we want to know the mode of the dataset? Because we want to know what the most common number of purchases is for toothpaste, and we want to avoid our statistic being affected by outliers.

We could easily find the mode if we were using Python and Pandas, by simply doing: df.number_of_orders.mode().

However, if we want to work with a table in Hive, it’s a bit more difficult. Hive has nice functions such as FLOOR(), EXP(), SUM(), SIN(), AVG(), STDDEV(), PERCENTILE() but no “MODE()“. I had to figure it out myself, so here’s what I did.

First, I created a view that gets the count of each occurrence of number of orders, for each item (recall that for simplicity’s sake we’re just working with a single item: toothpaste). This was obtained via the following query:

CREATE OR REPLACE TEMP VIEW new_data AS
SELECT
     COUNT(*) as occurrence,
number_of_orders,
item_id

FROM data
GROUP BY number_of_orders, item_id

Recall that in the example, I’ve bought toothpaste 13 times, and Hanna has bought toothpaste 55 times. Let’s say that her brother has also bought toothpaste 13 times. Assuming these are the only cases where someone has bought toothpaste 13 and 55 times, then 2 rows of our view would look like:

Next we have a query where we select the max number of times of occurrences of a particular value of number_of_orders (in the table above our max should be 2, corresponding to 13 number_of_orders). This first select is:

SELECT
     item_id,
     MAX(occurrence) AS max_occurrence
FROM new_data
GROUP BY item_id

However, just this MAX() select alone would tell us that the max value in the “occurrence” column, but this doesn’t help us. This select alone would give us:

Similarly,

SELECT
     item_id,
     MAX(occurrence) AS max_occurrence,
     number_of_orders
FROM new_data
GROUP BY item_id, number_of_orders

would give us:

This isn’t what we want.

To get our desired result, we perform an inner join of this data with the original data itself, joining on this max occurrence value, so that we can get the value of “number_of_orders” that corresponds to this max. The full query is:

SELECT
     max_df.item_id,
     d.occurrence,
     d.number_of_orders
FROM
    (
    SELECT
         item_id,
         MAX(occurrence) AS max_occurrence
    FROM new_data
    GROUP BY item_id
    ) max_df
INNER JOIN new_data d
ON d.item_id = m.item_id
AND d.occurrence = max_df.max_occurrence

and the resultant table is:

So the most most common number of orders of toothpaste in our dataset is 7, and we’ve seen this 191 times.

Why so difficult?

When I was searching around for a way to calculate mode in Hive, I was surprised at just how difficult it was! But why is it so hard? Maybe it’s because Hive is a relatively new tool, and the developers haven’t gotten to it yet. Maybe we’ll see MODE() as a function in the near future. That’d be great!

Leave a Reply

Your email address will not be published. Required fields are marked *