Target once told a man that his high school aged daughter was pregnant. Their model told them that, based off the man’s purchases, that someone in his household was pregnant. And they wanted to make money off of him- they sent him coupons in the mail for baby products, hoping he’d buy some. He was furious, but Target was right.
This is a classic case of machine learning gone wrong. Not because the model was wrong, but because of how it was used. The steps leading to this family feud were simple:
Target’s statistician analyzes the purchases of pregnant women.
He (his last name is Pole) determines that 25 products are enough to assign a shopper a pregnancy prediction score.
Target puts the model into production, with plans to send coupons to shoppers who are pregnant.
The daughter goes to Target, buys some of these 25 items. She uses her Dad’s loyalty card.
Target labels the father’s account with a high pregnancy score.
Target send coupons to the father, and make him very, very angry. But did he use the coupons?
This was 2012, and while this sort of thing is still done today, companies are a lot more careful about what they recommend. It’s a win-win situation. By recommending relevant, inoffensive items to shoppers, they’ll be delighted, they’ll buy more, and they’ll be more loyal.
You can know a lot about someone based on what they buy.
After answering some wine-related questions with data (aka Part 1), I wanted to see if I could use linear models from scikit-learn to predict wine prices. The features I chose to use where the description (term frequency-inverse document frequency), points, the province the wine is from, and its variety.
I used ridge regression (alpha = 0.1), lasso (alpha = 0.5) and regression with 16 principal components. Ridge regression, with and without standardizing the variables, ended up being the best model of all of these, with a variance score of 0.31 in both cases.
However, the residuals showed heteroskedasticity, so the linear models weren’t able to capture everything that was going on. I’ve embedded the Jupyter notebook below, but you can also check it out here. It’ll be formatted better.
Another week, another data set to play around with. This time I took a look at data set from Kaggle, that was originally from the site “Wine Enthusiast.” There were a few questions I wanted to answer. I’ll list them below, with their answers and the corresponding plots.
What country has the best wine? What country has the worst?
England has the best wines on average- I didn’t expect this! I definitely thought France or the States would come first- isn’t Californian wine supposed to be great? As for worst wines, seems like South Korea isn’t that great at wine making. But still, at an average of just above 80 points, this is still pretty good.
My uncle is a physicist, and recently was asked to speak with a girl who just graduated with a Bachelor’s degree in Engineering Physics. It sounds super impressive, but this girl had no idea what sort of jobs to look for, given her background. It made me think, what if there was a tool that would let you upload your resume, and then you would be recommended jobs you should apply to, based on your qualifications? This would definitely help someone like the girl my uncle talked to. (This may already exist, but it’s a great project regardless!)
The idea of this is two-fold:
First, an application like this could help those on their job hunt, in particular young university graduates who aren’t sure how to make use of their degree. It could help them know what sort of role to look into, and also where to look, geographically.
Second, this could help hiring managers, who know how to describe the role they want to fill, but don’t know the name of the role. This would avoid companies asking for a software engineer, for example, when the job description is really asking for a data analyst, or something along those lines.
This is preliminary work, where I explore the first part of the idea. I use web scraping to create a recommendation system for data scientists seeking new employment. I do this because I’m an aspiring data scientist, so it’s interesting for me to see what jobs “match” my resume best. I’ll use term frequency inverse document frequency with cosine similarity as my distance metric.
I’m a fan of real world applications of mathematical concepts. Abstract algebra? Data security. Navier-Stokes equations? Hurricane modelling. Machine learning? Churn prediction.
Ok, so maybe it isn’t hard to see that ML would be useful in the real world. In fact, the reason I’m interested in ML and data science in general is because working in this field is a great way to apply math concepts to real world problems. I’m always looking out for projects I can do to build my data science skills, and churn prediction is one I’ve been wanting to do for a while. I was fortunate to get some real (and very messy!) data from the Vancouver Symphony Orchestra. I decided to see if I could use ML to predict customer churn.
One problem with machine learning is that it’s hard to explain what’s going on in the models under the hood. Imagine you’re a business analyst trying to explain the outcome of a model to your not-so-techy boss. If your model is a decision tree, that’s fine. It’s essentially a bunch of if-else statements, and that makes it easy for your boss to follow and understand how the model arrived at its given outcome. But a support vector machine? A neural network? Not so much.
Fortunately, there’s LIME, or Local Interpretable Model-Agnostic Explanations. As the documentation states, “Lime is able to explain any black box text classifier, with two or more classes.”
So let’s see how LIME helps us in the case of churn prediction.
Part 1: How good is a guru? A sort of thought experiment
There are so many stock gurus nowadays, whether that’s someone with a following on the order of O(10) or O(10000) people. It’s easy to make a blog post or write an article stating a long or short position on a given stock, and it’s easy for a reader, or thousands of readers, to take said position without a grain of salt. How do we know these gurus are right?
I propose a project where we check the accuracy of so-called guru forecasts. This post is taken from my projects page, but I decided to include it on the blog since it’s quite blog-like.
The idea of the project is as follows. For a given guru, get all the forecasts s/he has made over the entire history of their forecasting career. Get the historical prices of the stocks in question. Check if their predictions were correct. Easy enough, right? Actually, not really.
I’ve seen sentiment analysis of Trump’s tweets done all over the internet. It makes sense- no president, political figure, or anyone for that matter, really, has used Twitter the way Trump does. Reading these analyses of Trump’s tweets, his sentiment is just all over the place. But you don’t need a text processing tool to use that, you can see this by just listening to his speeches or reading the news. It made me think, though, what do the tweets of my Prime Minster, Justin Trudeau, look like? Are his tweets more balanced in their sentiment? Is there a difference between his tweets in English versus his tweets in French? Let’s take a look!
For this analysis I’ll look at some basic tweet features: source, “retweet” and “favourite” count, and overall sentiment, for both English and French tweets. While the source and sentiment of a tweet gives us insight into Trudeau, the retweet and favourite count speaks on behalf of Canadians. Through this analysis we’ll get to learn a bit more about our Prime Minister, and a bit more about ourselves.
This post originally appeared on Medium.com on May 5, 2017.
The candidates: The NDP’s John Horgan, the Liberal’s Christy Clark and the Green Party’s Andrew Weaver. Adapted from the Vancouver Courier. Photo by Dan Toulgoet.
Sentiment analysis of Reddit posts using machine learning techniques suggests that NDP leader John Horgan will win the B.C. election this Tuesday, May 9th. The study suggests that while Reddit users in the r/Vancouver and r/BritishColumbia subreddits post more about Christy Clark and the B.C. Liberal Party, they hold a significantly more positive sentiment towards Horgan and the B.C. New Democratic Party.
What is sentiment analysis?
Sentiment analysis, also known as opinion mining, aims to identify the feeling and attitude of a speaker or writer in a given text. The technique can determine whether a text is positive, negative, or neutral based on the words used by the writer. Words with positive sentiment incline the text to a positive score; negative words, negatives. A sentiment analysis program can therefore quickly and efficiently scan vast numbers of comments, posts, or pages and classify them as expressing ranges of sentiment towards a topic.