Black Lives Matter! As the diversity and inclusion lead at Aggregate Intellect, I felt that it’s important to make our stance clear. At Aggregate Intellect, we recognize that this is not an issue for the Black community to solve alone. This is an issue for all of us to solve together. As a community of machine learning practitioners, we are all equipped with a unique skill set that allows us to make a difference. This post outlines ways you can contribute as a technologist, and how you can contribute as a general member of society as well. Special thanks to Suhas Pai, our NLP Stream Owner and the director of our weekly newsletter for putting together a large portion of these resources!
I’ve been reading up on investment banking lately for work, also out of interest. The first chapter in Investment Banking: Valuation, Leveraged Buyouts, and Mergers and Acquisitions by Joshua Pearl and Joshua Rosenbaum talks about comparable companies analysis, also called trading comps. It’s one way that investment bankers go about valuing private companies that are about to go public, or those that may merge or be acquired by another company, for example.
Automating comparable companies analysis
The process seems fine, it involves data collection, then essentially comparing the “target” (the company you’d like to value) to those in a “comparables universe.” It’s all manual work, with the most difficult part being amassing the universe of comparable companies. The rest of it essentially involves simple addition, subtraction, multiplication, and division. And it makes me think… why can’t we just automate it with machine learning?
In 2003, Amazon revealed their secret to the world: their item-item based collaborative filtering algorithm! In their paper, Amazon.com Recommendations Item-to-Item Collaborative Filtering, which has ~6000 citations on Google Scholar as of October 2019, they share the details on their recommender system that drove their website’s revenue for years.
There are 3 interesting problems that are solved in this paper. Those are:
- how to represent items
- how to compute similarity of items, and
- how to scale the solution.
And, the most interesting part of this is: they patented it.
Before we discuss the patent, and ask some questions, let’s step through each sub-problem and how they solved it.
I wrote a blog post describing Alibaba’s new recommender system that leverages the popular Transformer architecture. It’s a great example of the intersection of NLP and Recommender Systems. I originally posted it on the AI Socratic Circles blog, which I manage, so please feel free to navigate there if you’d like a full-page experience, and if you’d like to check out some of the other content. For those who prefer to stay on this page, I’ve embedded the original article below. And thanks to the two editors of the post, Susan Shu and Omar Nada.
Enjoy the post!
Before I became a data scientist, I spent a lot of time Googling “How to get a job as a data scientist” and browsing r/DataScience. Maybe you’re trying to become a data scientist, too, and you’ve somehow landed on my blog. Welcome!
I’m writing this article for 2 reasons:
- I want to help you! I was lucky to receive a lot of help from generous members of the data science community when I was first starting out. This article is one way I can give back to people who are in the position I was in. We all have to start somewhere!
- I find I often give the same advice over and over again to people who reach out to me to chat, so I’d like to put my thoughts down in one place. Hopefully this article can be helpful to you in some way.
Feel free to reach out to me if you have any further questions and if I can help in any way!
I’m an organizer of A.I. Sorcatic Circles, a machine learning discussion group based in Toronto, focused on reviewing highly technical advances in machine learning literature.
We hold events one to two times a week, and stream our sessions online on our YouTube channel. Our sessions run about an hour and a half long, so this makes for content that’s very long. As part of an effort to produce shorter, more digestible content for YouTube, we came up with The 5 Minute Paper Challenge.
Participating in The 5 Minute Paper Challenge is simple in its description, but complicated in its execution. Simply put, participants were asked to create videos of 5 minutes in length, describing their favourite machine learning paper. It may sound easy, but it really isn’t. I learned the hard way, through spending a few days putting together a “sample” video for our participants to refer to. I thought it’d be a matter of writing a script, pointing a camera at my face, and quick editing. I was wrong.
Take a look at my video below, if you’d like. The paper I chose to explain is called Folding: Why Good Models Sometimes Make Spurious Recommendations. It discusses how matrix-based collaborative filtering systems often embed unrelated groups of users close to each other in the embedding space. The way to rectify it, they say, is to use some sort of goodness metric (like RMSE) along with some badness metric (they propose a “folding” metric), when training your model and evaluating its performance.
I’m shocked, and I’m flattered.
I’ve noticed that certain posts I’ve made on this blog are routinely read, or at least looked at, over and over again. It’s surprising, and my popular posts are never the ones I expect to be a hit. Those posts are: English to Cantonese Translation, Analyzing Justin Trudeau’s Twitter Sentiment, The Math Behind Bitcoin, Match Your Resume with a Job on Glassdoor, and my most popular post, Fluid Dynamics = Financial Mathematics, also known as “How I am Convincing Myself that my Master’s Degree was Useful.” (Just kidding. I really value my Master’s and enjoyed it a lot).
A lot of people have read my fluid dynamics post, and even commented on it, which I really appreciate. It’s quite humbling to hear that my ideas have resonated with people. What’s not humbling though, is having someone plagiarize my work. Rather than calling it humbling, I’d call it violating. It’s not a nice feeling. And quite frankly, it makes for quite an emotional experience.
The animal rights movement is nothing new, but it’s only in recent years that veganism, the most overt form of animal rights activism, has seen itself become more mainstream. A seemingly unrelated topic, especially prevalent nowadays, is that of human privacy, particularly as it relates to data-driven machine learning algorithms. Data is the new oil in this “machine learning first” paradigm we now live in. And data is like oil, it’s true- the more data you have on someone, the better a model can “know” them. Even better if you know intimate, personal details about their lives. Data is what fuels our models. Yes, fed with more data, models will improve, generally. But is it ethical? No, say many experts and pundits in the field. But what I wonder is, when will this discussion turn to include our animal friends?
I’m the only data scientist at my company. It allows me to have a huge amount of breadth in my work, which is great, but it leaves me few people to really nerd out with. I mean the type of nerding out that’s specific to data science- there’s definitely a lot of nerding out that goes on with respect to other topics, which is also great.
In an effort that appears selfless but is really quite selfish, I decided to start a data science lunch and learn series. The goal of the series is to help my company enhance its data driven decision making across all teams, not just the data science team. The more people thinking critically about data and highlighting potential datasets and business opportunities the better. That’s the selfless part, me volunteering my time to teach my coworkers. The selfish part is that really, I just want people to be on the same page as me when it comes to data science, so I can bounce ideas off of them, and they highlight opportunities in our business that may be useful to explore. It’d be even better if they started to do their own analyses. I wouldn’t mind having my 60 talented colleagues do my work for me.
My idea was to have 5 sessions:
- The first would be an intro to the whole topic, since many people don’t even know what data science really means (myself included, sometimes?).
- The second lesson, (which at the time of this writing has yet to be taught) will be a discussion of case studies of basic models, and some more cutting edge stuff, with example code in Python. The point being, while the math may be complicated, often times an off the shelf implementation will do the trick.
- The third will be an optional lesson with more math, and talking about models. Apparently people want more math (see the image below).
- The penultimate lesson will be hands on, with all of us doing some exploratory data analysis together, likely on some grocery dataset, since our company is in the grocery space.
- The final lesson will involve us all building a model together, probably to predict if someone will buy peanut butter or not in a given week, or something like that. The reason I choose peanut butter is because I used this example repeatedly in the first lesson, and it seemed to go over well. It’d be nice to tie things together by referring back to the beginning.
Mean is overrated. As a statistical measure, and as a way of being (it’s easiest to just be nice!). Just as mean is overrated, mode is overlooked. Sometimes I don’t want the mean of my data, I want the mode, or the median. And for some reason, this is hard to do with Apache Hive. To explain what I mean by this, let’s get our hands dirty with a dataset. (Should I say let’s clean our teeth with a dataset, since the picture I chose below is toothpaste? Nah, that would be weird).
Cleaning our teeth (?) with some data
Consider the case where we have order data for a particular item. Specifically, we have a dataframe with 2 columns: number_of_orders, and item_id. Let’s say for simplicity’s sake that we’re dealing with a single item, say, toothpaste (to yes, clean your teeth with), and each (number_of_orders,item_id) pair tells us the number of times someone has purchased that item. If our population includes myself and my friend Hanna, and I have purchased toothpaste 13 times, and Hanna has purchased toothpaste 55 times, then two rows of our dataframe using Pandas would look like:
(I use Redshift to change the colour of my screen- it seems to affect screenshots too!).