Information

Section: Practical task
Goal: Practice the content that was covered in this chapter, in Python.
Time needed: 60 min
Prerequisites: Chapter 1 - Numerical data

Practical task

You will now have the possibility to practice all we have seen in this chapter, on a new dataset.

The dataset is called “Wine quality” and has been found and downloaded here: http://archive.ics.uci.edu/ml/datasets/Wine+Quality.

It contains the characteristics of 6497 wines, and their quality, a grade between 0 and 10 given by 3 wine experts. On the website, the data is separated between white and red wine, but we have grouped them together for the purpose of the exercise. We added an additionnal attribute type, containing the type of wine (red or white).

For the purpose of the exercise, and to train with missing values, some values have been deleted or modified from the original dataset.

  1. Import the dataset, it is called wine.csv.

  2. Get to know the dataset: how it looks like, the different attributes, their distribution.

  3. Detect the missing values and think about how to deal with them.

  4. Use the dataset with all the attributes to predict the attribute quality with a regression model (you can use the functions we used along the chapter).

  5. Get the MAE and plot the predictions vs. the true labels.

  6. Build other models with different attributes. Try to find the combination of attributes which gives the best performance.

  7. Predict the attribute quality with a classification model, get the MAE and plot the results.

  8. Split the dataset on a well chosen attribute, and predict the attribute quality with the new datasets.

Open questions:

  • What can we do with the missing values?

  • How is the performance of the regression model?

  • Explain the difference of performance between the regression and the classification models.

  • Explain the performance we obtain by splitting the dataset.