Do you like snacks? You may be a vegetarian

Target once told a man that his high school aged daughter was pregnant. Their model told them that, based off the man’s purchases, that someone in his household was pregnant. And they wanted to make money off of him- they sent him coupons in the mail for baby products, hoping he’d buy some. He was furious, but Target was right.

This is a classic case of machine learning gone wrong. Not because the model was wrong, but because of how it was used. The steps leading to this family feud were simple:

  1. Target’s statistician analyzes the purchases of pregnant women.
  2. He (his last name is Pole) determines that 25 products are enough to assign a shopper a pregnancy prediction score.
  3. Target puts the model into production, with plans to send coupons to shoppers who are pregnant.
  4. The daughter goes to Target, buys some of these 25 items. She uses her Dad’s loyalty card.
  5. Target labels the father’s account with a high pregnancy score.
  6. Target send coupons to the father, and make him very, very angry. But did he use the coupons?

This was 2012, and while this sort of thing is still done today, companies are a lot more careful about what they recommend. It’s a win-win situation. By recommending relevant, inoffensive items to shoppers, they’ll be delighted, they’ll buy more, and they’ll be more loyal.

You can know a lot about someone based on what they buy.

Continue reading “Do you like snacks? You may be a vegetarian”

Made to Stick by Chip & Dan Heath [Book Review]

This is a New York Time’s best-selling book written by two brothers, Dan and Chip Heath. In this book they answer a question that a lot of us think about: why do some ideas gain traction, while others that seem better get ignored and fall to the side?

Here’s what you need to make your ideas SUCCESful and sticky. Make your ideas:

  1. Simple,
  2. Unexpected,
  3. Concrete,
  4. Credible,
  5. Emotional,
  6. and tell a Story.

Use SUCCES as your acronym and guide, and your ideas will sticky.  That is, memorable and interesting. Understandable, and will have a long lasting impact, and even change your audience’s behaviour or opinion.

Continue reading “Made to Stick by Chip & Dan Heath [Book Review]”

Stop mocking me! Unit tests in PySpark using Python’s mock library


Fundamental in software development, and often overlooked by data scientists, but important. In this post, I’ll show how to do unit testing in PySpark using Python’s unittest.mock library. I’ll do this from a data scientist’s perspective- to me that means that I won’t go into the software engineering details. I present just what you need to know.

First, a (semi) relevant clip from Family Guy:

Continue reading “Stop mocking me! Unit tests in PySpark using Python’s mock library”

Customer Lifetime Value in PySpark

Customer lifetime value (CLV) is a metric that represents the monetary value of a customer relationship. It’s used to estimate the total net profit a company can make from a given customer. Some sources say customer lifetime value is the “single most important metric for understanding your customers.” (Ok, the site that says that is “”- they’re not biased at all!). Other sources (yes, I’ll count myself as a source) say that it’s not as clear. Modelling CLV can be very simple if you want it to be- it could simply be the product of average purchase frequency, purchase value, and length of customer relationship. The assumption here is that all these values are static throughout a customer’s lifetime. Assumptions always make things easier, but often too simple. A more complex way of calculating CLV requires calculating retention rate and inferring a decay rate. This more complex equation is:

where n is the length of the relationship (in t units), t is the time period, r is customer retention rate, P is profit a customer contributes in a period, and d is the discount rate (I’m not an expert, but seems like discount rate depends on company size). Customer retention rate r is unique to each customer, and can be calculated from a shopper’s recency (R), monetary (M), and frequency (F) values.

Continue reading “Customer Lifetime Value in PySpark”

Some Thoughts On Similarity Metrics

I spend a lot of time thinking about similarity.

Think about when you meet someone. You want to make an impression on them. You want them to like you. Here’s a tip: find something in common with them. Show them that you two are ­similar. Or to put it another way, you want to show them that you’re not different in some respect. (Often times in math the notion of “similar” is thought of as being “not different”). It doesn’t matter that they’re right wing and you lean left- you both love hockey and soccer, so there you go! You’re still similar. And what if your only thing in common is that you’ve both been to Paris? That could still work- you just need to think of “similarity” as based on having shared common travel destinations.

Mathematically inclined or not, when you meet someone, you essentially go through countless similarity measures, compute, compare, and take the best result. Similarity is a fun concept. You can define it however you want. But some ways are better than others.

Continue reading “Some Thoughts On Similarity Metrics”

Ontario Politics: Sentiment Analysis Predicts Horwath Win

The candidates: Kathleen Wynne (Liberal), Doug Ford (PC Party) and Andrea Horwath (NDP).

Last year I successfully predicted the election of John Horgan of the BC NDP. This prediction was based off sentiment analysis of Reddit users in the r/Vancouver and r/BritishColumbia subreddits, using the VADER package in Python. With the Ontario provincial election coming up on June 7th, I thought it’d be fun to predict who will be elected, according to sentiment analysis. So instead of r/Vancouver and r/BritishColumbia, I took a look at r/Toronto and r/Ontario. Other than that, nothing has changed. (Not even my figures- which maybe should change!) For more details on sentiment analysis and the caveats of this method, please see my post from last year.

Continue reading “Ontario Politics: Sentiment Analysis Predicts Horwath Win”

Group bike(share) Rides in Toronto

This post is a continuation of bike share data analysis in Toronto. See the first post, here. The Jupyter notebook for this post is here.

When I travel to a new city, I always try to take advantage of the bike share program there. Biking is a great way to get around and discover a city as a tourist, and bike shares are so convenient and affordable. I have always travelled with a partner, never on my own. I thought it’d be interesting to take a look at bike share ridership in Toronto, to see if there are tourists who travel like me: as pairs or groups who explore the city by bike share.

Yonge-Dundas square: a popular bike share stop.

Continue reading “Group bike(share) Rides in Toronto”

China’s Social Credit System: Coming for us all?

In 2014, China proposed something that would be a great hackathon project. But this isn’t just a fun data science experiment, something that I’d love to work on. This is a real proposal by the government that will be put into action in 2020. This is their Social Credit System. Think credit score, but not just for loan applications. This is an everything score, and will affect all aspects of life, ranging from buying vacation packages and private school tuition, to boarding planes and high speed trains. With a bad credit score, you’ll be barred from doing any of these.

24 million people live in Shanghai. Imagine how much data that is to deal with! And that’s just about 1.7% of China’s 1.4 billion.

I became extremely aware of the state of technology in China, in particular smart phone use, through interactions with my Chinese friends and colleagues in Hong Kong, and through my trips to the Mainland. In my opinion, they’re ahead of us. There’s a bit of a smart phone obsession, something unmatched in the West. This obsession isn’t a bad thing. In fact, I think it’s great (sometimes). Imagine paying for everything from parking and car insurance to electricity, to groceries and knick knacks at the market, all with your phone, all with one app. In China, this isn’t just possible but it’s the norm. It’s a great use of technology. We all carry our phones with us everyday, why not just do away with our wallet? One less thing to remember when leaving the house.

Continue reading “China’s Social Credit System: Coming for us all?”

Nightlife in Toronto, According to Bike Share Ridership

I came across the blog I Quant NY while listening to an episode of Partially Derivative, where the author, Ben Wellington, was interviewed. He has quite a few posts based on Citi Bike data in New York. Turns out you can learn a lot from bike share ridership!

Being new to Toronto, I thought I’d follow the lead of I Quant NY and do my own analysis based on Toronto’s Bike Share data, which is openly available on Toronto’s Open Data Catalogue. The question I want to answer, as a new Torontonian? Where do people go to party?

Continue reading “Nightlife in Toronto, According to Bike Share Ridership”

Life and Death in Shanghai by Nien Cheng [Book Review]

You need to read this book. This is isn’t made up. It’s a real account of a Nien Cheng’s life during the Cultural Revolution. Nien Cheng was kept imprisoned for over 6 years at Number One Detention house in Shanghai, from 1966 to 1973. During that time she was interrogated, beaten, and left isolated, all while being pushed to “confess” to being an “imperialist spy.” Why this treatment? Because she was western educated, and the widow of the former manager of a foreign firm in Shanghai. She wasn’t guilty of anything, and refused to produce a false confession, despite the quite literal torture. She made it out alive, miraculously, but her daughter wasn’t so lucky- she was beaten to death on the streets of Shanghai by the Red Guard. All because of who her family was. Nien Cheng was never the same after the death of her daughter. It’s so sad.

Continue reading “Life and Death in Shanghai by Nien Cheng [Book Review]”