I’ve been reading up on investment banking lately for work, also out of interest. The first chapter in Investment Banking: Valuation, Leveraged Buyouts, and Mergers and Acquisitions by Joshua Pearl and Joshua Rosenbaum talks about comparable companies analysis, also called trading comps. It’s one way that investment bankers go about valuing private companies that are about to go public, or those that may merge or be acquired by another company, for example.
Automating comparable companies analysis
The process seems fine, it involves data collection, then essentially comparing the “target” (the company you’d like to value) to those in a “comparables universe.” It’s all manual work, with the most difficult part being amassing the universe of comparable companies. The rest of it essentially involves simple addition, subtraction, multiplication, and division. And it makes me think… why can’t we just automate it with machine learning?
In 2003, Amazon revealed their secret to the world: their item-item based collaborative filtering algorithm! In their paper, Amazon.com Recommendations Item-to-Item Collaborative Filtering, which has ~6000 citations on Google Scholar as of October 2019, they share the details on their recommender system that drove their website’s revenue for years.
There are 3 interesting problems that are solved in this paper. Those are:
how to represent items
how to compute similarity of items, and
how to scale the solution.
And, the most interesting part of this is: they patented it.
Before we discuss the patent, and ask some questions, let’s step through each sub-problem and how they solved it.
Before I became a data scientist, I spent a lot of time Googling “How to get a job as a data scientist” and browsing r/DataScience. Maybe you’re trying to become a data scientist, too, and you’ve somehow landed on my blog. Welcome!
I’m writing this article for 2 reasons:
I want to help you! I was lucky to receive a lot of help from generous members of the data science community when I was first starting out. This article is one way I can give back to people who are in the position I was in. We all have to start somewhere!
I find I often give the same advice over and over again to people who reach out to me to chat, so I’d like to put my thoughts down in one place. Hopefully this article can be helpful to you in some way.
Feel free to reach out to me if you have any further questions and if I can help in any way!
I’m an organizer of A.I. Sorcatic Circles, a machine learning discussion group based in Toronto, focused on reviewing highly technical advances in machine learning literature.
We hold events one to two times a week, and stream our sessions online on our YouTube channel. Our sessions run about an hour and a half long, so this makes for content that’s very long. As part of an effort to produce shorter, more digestible content for YouTube, we came up with The 5 Minute Paper Challenge.
Participating in The 5 Minute Paper Challenge is simple in its description, but complicated in its execution. Simply put, participants were asked to create videos of 5 minutes in length, describing their favourite machine learning paper. It may sound easy, but it really isn’t. I learned the hard way, through spending a few days putting together a “sample” video for our participants to refer to. I thought it’d be a matter of writing a script, pointing a camera at my face, and quick editing. I was wrong.
Take a look at my video below, if you’d like. The paper I chose to explain is called Folding: Why Good Models Sometimes Make Spurious Recommendations. It discusses how matrix-based collaborative filtering systems often embed unrelated groups of users close to each other in the embedding space. The way to rectify it, they say, is to use some sort of goodness metric (like RMSE) along with some badness metric (they propose a “folding” metric), when training your model and evaluating its performance.
I’m the only data scientist at my company. It allows me to have a huge amount of breadth in my work, which is great, but it leaves me few people to really nerd out with. I mean the type of nerding out that’s specific to data science- there’s definitely a lot of nerding out that goes on with respect to other topics, which is also great.
In an effort that appears selfless but is really quite selfish, I decided to start a data science lunch and learn series. The goal of the series is to help my company enhance its data driven decision making across all teams, not just the data science team. The more people thinking critically about data and highlighting potential datasets and business opportunities the better. That’s the selfless part, me volunteering my time to teach my coworkers. The selfish part is that really, I just want people to be on the same page as me when it comes to data science, so I can bounce ideas off of them, and they highlight opportunities in our business that may be useful to explore. It’d be even better if they started to do their own analyses. I wouldn’t mind having my 60 talented colleagues do my work for me.
My idea was to have 5 sessions:
The first would be an intro to the whole topic, since many people don’t even know what data science really means (myself included, sometimes?).
The second lesson, (which at the time of this writing has yet to be taught) will be a discussion of case studies of basic models, and some more cutting edge stuff, with example code in Python. The point being, while the math may be complicated, often times an off the shelf implementation will do the trick.
The third will be an optional lesson with more math, and talking about models. Apparently people want more math (see the image below).
The penultimate lesson will be hands on, with all of us doing some exploratory data analysis together, likely on some grocery dataset, since our company is in the grocery space.
The final lesson will involve us all building a model together, probably to predict if someone will buy peanut butter or not in a given week, or something like that. The reason I choose peanut butter is because I used this example repeatedly in the first lesson, and it seemed to go over well. It’d be nice to tie things together by referring back to the beginning.
Mean is overrated. As a statistical measure, and as a way of being (it’s easiest to just be nice!). Just as mean is overrated, mode is overlooked. Sometimes I don’t want the mean of my data, I want the mode, or the median. And for some reason, this is hard to do with Apache Hive. To explain what I mean by this, let’s get our hands dirty with a dataset. (Should I say let’s clean our teeth with a dataset, since the picture I chose below is toothpaste? Nah, that would be weird).
Cleaning our teeth (?) with some data
Consider the case where we have order data for a particular item. Specifically, we have a dataframe with 2 columns: number_of_orders, and item_id. Let’s say for simplicity’s sake that we’re dealing with a single item, say, toothpaste (to yes, clean your teeth with), and each (number_of_orders,item_id) pair tells us the number of times someone has purchased that item. If our population includes myself and my friend Hanna, and I have purchased toothpaste 13 times, and Hanna has purchased toothpaste 55 times, then two rows of our dataframe using Pandas would look like:
(I use Redshift to change the colour of my screen- it seems to affect screenshots too!).
Target once told a man that his high school aged daughter was pregnant. Their model told them that, based off the man’s purchases, that someone in his household was pregnant. And they wanted to make money off of him- they sent him coupons in the mail for baby products, hoping he’d buy some. He was furious, but Target was right.
This is a classic case of machine learning gone wrong. Not because the model was wrong, but because of how it was used. The steps leading to this family feud were simple:
Target’s statistician analyzes the purchases of pregnant women.
He (his last name is Pole) determines that 25 products are enough to assign a shopper a pregnancy prediction score.
Target puts the model into production, with plans to send coupons to shoppers who are pregnant.
The daughter goes to Target, buys some of these 25 items. She uses her Dad’s loyalty card.
Target labels the father’s account with a high pregnancy score.
Target send coupons to the father, and make him very, very angry. But did he use the coupons?
This was 2012, and while this sort of thing is still done today, companies are a lot more careful about what they recommend. It’s a win-win situation. By recommending relevant, inoffensive items to shoppers, they’ll be delighted, they’ll buy more, and they’ll be more loyal.
You can know a lot about someone based on what they buy.
Fundamental in software development, and often overlooked by data scientists, but important. In this post, I’ll show how to do unit testing in PySpark using Python’s unittest.mock library. I’ll do this from a data scientist’s perspective- to me that means that I won’t go into the software engineering details. I present just what you need to know.
Customer lifetime value (CLV) is a metric that represents the monetary value of a customer relationship. It’s used to estimate the total net profit a company can make from a given customer. Some sources say customer lifetime value is the “single most important metric for understanding your customers.” (Ok, the site that says that is “customerlifetimevalue.co”- they’re not biased at all!). Other sources (yes, I’ll count myself as a source) say that it’s not as clear. Modelling CLV can be very simple if you want it to be- it could simply be the product of average purchase frequency, purchase value, and length of customer relationship. The assumption here is that all these values are static throughout a customer’s lifetime. Assumptions always make things easier, but often too simple. A more complex way of calculating CLV requires calculating retention rate and inferring a decay rate. This more complex equation is:
where n is the length of the relationship (in t units), t is the time period, r is customer retention rate, P is profit a customer contributes in a period, and dis the discount rate (I’m not an expert, but seems like discount rate depends on company size). Customer retention rate r is unique to each customer, and can be calculated from a shopper’s recency (R), monetary (M), and frequency (F) values.
Think about when you meet someone. You want to make an impression on them. You want them to like you. Here’s a tip: find something in common with them. Show them that you two are similar. Or to put it another way, you want to show them that you’re not different in some respect. (Often times in math the notion of “similar” is thought of as being “not different”). It doesn’t matter that they’re right wing and you lean left- you both love hockey and soccer, so there you go! You’re still similar. And what if your only thing in common is that you’ve both been to Paris? That could still work- you just need to think of “similarity” as based on having shared common travel destinations.
Mathematically inclined or not, when you meet someone, you essentially go through countless similarity measures, compute, compare, and take the best result. Similarity is a fun concept. You can define it however you want. But some ways are better than others.