Wine Exploration + Predicting Price with Linear Models

Abstract:

Exploratory data analysis and visualization of wine from a Kaggle data set. Predicting prices using linear models (lasso, ridge, PCR).

Skills:

Python-based data exploration. Used the pandas, seaborn, matplotlib, scikit-learn. Prediction with lasso, ridge regression, and PCR. Term frequency-inverse document frequency for the features.

Data:

From a Kaggle data set.

Results of Exploration:

There were a few questions I was curious about, and wanted to answer. The Jupyter notebook is here. Check out the results in my blog post.

Results of Linear Models:

Jupyter notebook for the code.

I used ridge regression (alpha = 0.1), lasso (alpha = 0.5) and regression with 16 principal components. Ridge regression, with and without standardizing the variables, ended up being the best model of all of these, with a variance score of 0.31 in both cases.

The residuals show heteroskedasticity for our “best” result with ridge regression- this shows that there’s something going on in the data that the linear models have failed to capture. This means there could be feature engineering that needs to be done, such as transformations, or we need some new predictors.

Data to Add

Useful data to add could be the wine’s vintage (although I think it wouldn’t be too hard to go out and fetch that manually), as well as the sentiment of the description- although I suppose that would be all neutral or positive, but still, it could have an effect on the price.

Limitations

Another thing to consider could be a model for each range of wine- i.e., a wine connoisseur could tell me what price ranges are typical for wines, since I’m not an expert. And that brings up another limitation. I don’t know that much about wine, and I’m not even 100% sure what one of the features in the data set, designation, means. As a result, I left it out of the linear models.

Standardizing the features

It’s interesting to me that standardizing doesn’t have much of an effect, but this makes sense because only points are integer values, the rest of the features are either Tfidf features or one hot vectors. Lasso even did worse with scaling. Furthermore, when you scale data you assume the data are on different scales to start with. Of course if there’s only one quantitative variable then it will be on the same scale as itself!

The Business Side

Now another question I always like to think about: Why even do this? How can you make money from predicting wine prices? A winery can’t change where it’s located, all it can change is its description and the points its wine receives. Points is a bit out of their control too, since wine critics are the ones that assign this value. So this comes down to description. An idea could be to look into the relationship between description and price, and get into the text analytics of it. If we can identify certain words or characteristics of a description associated with expensive wine, then a winery could price its wine accordingly (aka, make it more expensive!) and then make money from this.