In 2003, Amazon revealed their secret to the world: their item-item based collaborative filtering algorithm! In their paper, Amazon.com Recommendations Item-to-Item Collaborative Filtering, which has ~6000 citations on Google Scholar as of October 2019, they share the details on their recommender system that drove their website’s revenue for years.
There are 3 interesting problems that are solved in this paper. Those are:
how to represent items
how to compute similarity of items, and
how to scale the solution.
And, the most interesting part of this is: they patented it.
Before we discuss the patent, and ask some questions, let’s step through each sub-problem and how they solved it.
A lot of people have read my fluid dynamics post, and even commented on it, which I really appreciate. It’s quite humbling to hear that my ideas have resonated with people. What’s not humbling though, is having someone plagiarize my work. Rather than calling it humbling, I’d call it violating. It’s not a nice feeling. And quite frankly, it makes for quite an emotional experience.
The animal rights movement is nothing new, but it’s only in recent years that veganism, the most overt form of animal rights activism, has seen itself become more mainstream. A seemingly unrelated topic, especially prevalent nowadays, is that of human privacy, particularly as it relates to data-driven machine learning algorithms. Data is the new oil in this “machine learning first” paradigm we now live in. And data is like oil, it’s true- the more data you have on someone, the better a model can “know” them. Even better if you know intimate, personal details about their lives. Data is what fuels our models. Yes, fed with more data, models will improve, generally. But is it ethical? No, say many experts and pundits in the field. But what I wonder is, when will this discussion turn to include our animal friends?
I’m the only data scientist at my company. It allows me to have a huge amount of breadth in my work, which is great, but it leaves me few people to really nerd out with. I mean the type of nerding out that’s specific to data science- there’s definitely a lot of nerding out that goes on with respect to other topics, which is also great.
In an effort that appears selfless but is really quite selfish, I decided to start a data science lunch and learn series. The goal of the series is to help my company enhance its data driven decision making across all teams, not just the data science team. The more people thinking critically about data and highlighting potential datasets and business opportunities the better. That’s the selfless part, me volunteering my time to teach my coworkers. The selfish part is that really, I just want people to be on the same page as me when it comes to data science, so I can bounce ideas off of them, and they highlight opportunities in our business that may be useful to explore. It’d be even better if they started to do their own analyses. I wouldn’t mind having my 60 talented colleagues do my work for me.
My idea was to have 5 sessions:
The first would be an intro to the whole topic, since many people don’t even know what data science really means (myself included, sometimes?).
The second lesson, (which at the time of this writing has yet to be taught) will be a discussion of case studies of basic models, and some more cutting edge stuff, with example code in Python. The point being, while the math may be complicated, often times an off the shelf implementation will do the trick.
The third will be an optional lesson with more math, and talking about models. Apparently people want more math (see the image below).
The penultimate lesson will be hands on, with all of us doing some exploratory data analysis together, likely on some grocery dataset, since our company is in the grocery space.
The final lesson will involve us all building a model together, probably to predict if someone will buy peanut butter or not in a given week, or something like that. The reason I choose peanut butter is because I used this example repeatedly in the first lesson, and it seemed to go over well. It’d be nice to tie things together by referring back to the beginning.
Last year I successfully predicted the election of John Horgan of the BC NDP. This prediction was based off sentiment analysis of Reddit users in the r/Vancouver and r/BritishColumbia subreddits, using the VADER package in Python. With the Ontario provincial election coming up on June 7th, I thought it’d be fun to predict who will be elected, according to sentiment analysis. So instead of r/Vancouver and r/BritishColumbia, I took a look at r/Toronto and r/Ontario. Other than that, nothing has changed. (Not even my figures- which maybe should change!) For more details on sentiment analysis and the caveats of this method, please see my post from last year.
In 2014, China proposed something that would be a great hackathon project. But this isn’t just a fun data science experiment, something that I’d love to work on. This is a real proposal by the government that will be put into action in 2020. This is their Social Credit System. Think credit score, but not just for loan applications. This is an everything score, and will affect all aspects of life, ranging from buying vacation packages and private school tuition, to boarding planes and high speed trains. With a bad credit score, you’ll be barred from doing any of these.
I became extremely aware of the state of technology in China, in particular smart phone use, through interactions with my Chinese friends and colleagues in Hong Kong, and through my trips to the Mainland. In my opinion, they’re ahead of us. There’s a bit of a smart phone obsession, something unmatched in the West. This obsession isn’t a bad thing. In fact, I think it’s great (sometimes). Imagine paying for everything from parking and car insurance to electricity, to groceries and knick knacks at the market, all with your phone, all with one app. In China, this isn’t just possible but it’s the norm. It’s a great use of technology. We all carry our phones with us everyday, why not just do away with our wallet? One less thing to remember when leaving the house.
You need to read this book. This is isn’t made up. It’s a real account of a Nien Cheng’s life during the Cultural Revolution. Nien Cheng was kept imprisoned for over 6 years at Number One Detention house in Shanghai, from 1966 to 1973. During that time she was interrogated, beaten, and left isolated, all while being pushed to “confess” to being an “imperialist spy.” Why this treatment? Because she was western educated, and the widow of the former manager of a foreign firm in Shanghai. She wasn’t guilty of anything, and refused to produce a false confession, despite the quite literal torture. She made it out alive, miraculously, but her daughter wasn’t so lucky- she was beaten to death on the streets of Shanghai by the Red Guard. All because of who her family was. Nien Cheng was never the same after the death of her daughter. It’s so sad.
This post originally appeared on Medium.com on May 5, 2017.
The candidates: The NDP’s John Horgan, the Liberal’s Christy Clark and the Green Party’s Andrew Weaver. Adapted from the Vancouver Courier. Photo by Dan Toulgoet.
Sentiment analysis of Reddit posts using machine learning techniques suggests that NDP leader John Horgan will win the B.C. election this Tuesday, May 9th. The study suggests that while Reddit users in the r/Vancouver and r/BritishColumbia subreddits post more about Christy Clark and the B.C. Liberal Party, they hold a significantly more positive sentiment towards Horgan and the B.C. New Democratic Party.
What is sentiment analysis?
Sentiment analysis, also known as opinion mining, aims to identify the feeling and attitude of a speaker or writer in a given text. The technique can determine whether a text is positive, negative, or neutral based on the words used by the writer. Words with positive sentiment incline the text to a positive score; negative words, negatives. A sentiment analysis program can therefore quickly and efficiently scan vast numbers of comments, posts, or pages and classify them as expressing ranges of sentiment towards a topic.