Target once told a man that his high school aged daughter was pregnant. Their model told them that, based off the man’s purchases, that someone in his household was pregnant. And they wanted to make money off of him- they sent him coupons in the mail for baby products, hoping he’d buy some. He was furious, but Target was right.
This is a classic case of machine learning gone wrong. Not because the model was wrong, but because of how it was used. The steps leading to this family feud were simple:
- Target’s statistician analyzes the purchases of pregnant women.
- He (his last name is Pole) determines that 25 products are enough to assign a shopper a pregnancy prediction score.
- Target puts the model into production, with plans to send coupons to shoppers who are pregnant.
- The daughter goes to Target, buys some of these 25 items. She uses her Dad’s loyalty card.
- Target labels the father’s account with a high pregnancy score.
- Target send coupons to the father, and make him very, very angry. But did he use the coupons?
This was 2012, and while this sort of thing is still done today, companies are a lot more careful about what they recommend. It’s a win-win situation. By recommending relevant, inoffensive items to shoppers, they’ll be delighted, they’ll buy more, and they’ll be more loyal.
You can know a lot about someone based on what they buy.
Building a vegetarian classifier
In this post, I’ll share some of my findings from an analysis of Instacart’s 3 million orders dataset. I was interested in looking at the behaviour of vegetarians, and seeing if I could buy a vegetarian model similar to Target’s pregnancy model.
I’ll give a rather abridged version of the analysis, but if you’re interested, the full Jupyter notebook can be found here.
Step 1: Label the data
The perfect dataset never exists. In a perfect world, I’d have a dataset with user id’s labelled as “vegetarian” or “not vegetarian.” Unfortunately, the world is very much not perfect, so I had to label the data myself.
I noticed there was a meat/seafood department, so I explicitly labelled shoppers as not vegetarian if they had never bought from it. There was also a “deli” department, but I wasn’t sure that every item there was not vegetarian, and I didn’t to spend the time to look at every product in the deli department.
Every shopper that was labelled as “not vegetarian” was then labelled as “vegetarian.” This is a huge assumption, for a few reasons. First, just because someone hasn’t bought from the meat/seafood category at one of Instacart’s retailers doesn’t mean they don’t buy their meat somewhere else. Second, I didn’t filter for shoppers who simply haven’t shopped enough to have bought meat/seafood. I should have, but I didn’t. It’s a proof of concept after all 🙂
Step 2: EDA
With the data labelled, it was time to do some fun stuff. Exploratory data analysis. EDA! (Yes, I know my plotting could improve). Here’s some plots that I found interesting- I looked into a lot more, but these ones seemed to jump out to me. For more figures, feel free to make your way over to the Jupyter notebook.
Number of purchases per day of week, for vegetarians vs non-vegetarians
Instacart doesn’t specify what “order_dow” (order day of week) 0 through 6 are. I don’t want to make any assumptions, but it’s interesting to note that vegetarians order on Day 1 much more than non-vegetarians.
Day of week vs hour of day: a heat map
It seems that vegetarians show a high affinity for shopping around 9/10 am on Day 0 and Day 1. Are vegetarians morning people? I can rationalize this- being a morning person and being a vegetarian are viewed as healthy traits. Maybe these traits go hand in hand.
Department purchase proportions
Here we have the proportion of each group’s purchases, per department. We can see that about 13% of vegetarians’ purchases are from the snack department, and about 12.5% from the beverage department. Non-vegetarians, as a group, make less than 10% of their purchases from these two departments. I guess vegetarians like their snacks and beverages! It’s interesting to see here that non-vegetarians buy more produce than vegetarians. (Note that I excluded the meat/seafood department here. Vegetarians haven’t purchased from it, by definition).
Number of unique departments purchased from
This plot shows the number of departments vegetarians and non-vegetarians in the dataset have purchased from. We can see that vegetarians tend to purchase from less departments (the mean is around 9) compared to non-vegetarians. This makes sense as vegetarians are inherently restricted in their purchases- you wouldn’t see a vegetarian buying from the meat department, obviously!
Average time between orders
It seems like non-vegetarians re-order products more frequently than vegetarians. This is something I wouldn’t necessarily expect. Note the jump at 30 days in the figure- I think Instacart’s data caps average days between orders at 30, so “30 days” can be thought of as “>=30 days”.
Step 3: Building a Model
EDA told us that there really are some subtle differences between vegetarians and non-vegetarians. In terms of what they do “less” of, compared to vegetarians, we know that they buy from less departments and shop less frequently. In terms of what they do “more” of compared to their omnivore counterparts is buy more snacks and beverages, shop more on “Day 1” (whatever this day indicates), and shop earlier (can I count this as “more” still?). This makes me wonder if a model would be able to classify someone as a vegetarian based on these sorts of characteristics.
I used both individual user and user-department features.
- Number of departments purchased from
- Number of unique departments purchased from
- Average time between orders
- Average time of day of order
- Average day of week of order
- Average order size
- Standard deviation of average order size
- Reorder rate
User-department features (this results in a unique feature per department):
- Number of reorders from each department
- Average time between purchases in each department
I found other user-department features to be not so interesting. The reorder user-department feature is correlated with several other similar features (such as number of items purchased from each department).
The only pre-processing that’s really important here is removing the “meat seafood” department from the data. We remove it because we used this column to define if someone is a vegetarian or not. Leaving this column would result in the model learning this- it would learn that someone is a vegetarian if they have purchased from the meat seafood department. We want to uncover other information in the data, so we’ll remove any trace of this department. Using this column to define our shoppers is vegetarian/not is likely problematic, but we have nothing else to work with. Labelled data would be nice!
44% of shoppers are vegetarian, as defined by having not purchased from the meat-seafood department. This is very high, but again, it’s a proof of concept. Let’s pretend that ~40% of the population here is actually vegetarian. I undersampled the majority class, the non-vegetarians, just for fun. SMOTEing (Using the Synthetic Minority Oversampling Technique) also gave similar results (keep reading for these results!).
Choosing a model + the results
I decided to use logistic regression for this binary classification problem. It’s a classic! I’m a believer of Occam’s Razor: the simplest solution is often the best solution. So let’s see how we did.
78% of our vegetarian/non-vegetarian shoppers in the test set are classified correctly. Considering we removed any data that indicates a purchase from the meat department, these results are good. This means that there are other details, we could call them “latencies” in the data, that indicate that someone is a vegetarian. Maybe the deli department contains mostly meat products?
As for other metrics, our recall (of all the positive cases, how many did the classifier identify or recall?) is 76%, and our precision is 79%. Interesting.
I built the logistic regression model using Scikit-learn and StatsModels, so that I could use both LIME and StatsModel’s “Summary” functionality for interpretation purposes. Some features that appear to be importance (and both interpretability methods agree) are average days between orders, number of departments purchased from, and shopper reorder rate. That’s definitely interesting, given that we highlighted these as potentially important features during our EDA process.
It’s definitely interesting that a model so simple as the one outlined here is able to detect if someone is a vegetarian or not. I imagine this would be very beneficial information to have for product recommendations and targeting marketing purposes.
There are several things I’d do with this project if I were to work on it more:
- Label the shoppers better- could I use a survey to ask shoppers if they’re vegetarian or not? Is there any research I could leverage that would tell me what tendencies vegetarians have? This would help me better label the data.
- Minimize false negatives. I’d rather mistake an omnivore for a vegetarian, than think a vegetarian is a meat eater. Think about the case of product recommendations. I don’t want to recommend a steak to someone who doesn’t eat steak. But recommending broccoli to someone won’t offend them, unless they really hate vegetables.
- If time was not an issue, I’d spend more time feature engineering and investigating correlation between variables, and try out some different models. I tried XGBoost in another notebook, and found an increase in 3% accuracy. Not bad, but not great. Maybe I could augment the data with some other data source.