Information

Section: Conclusion
Goal: Get an overview of what was covered in this chapter.
Time needed: 20 min
Prerequisites: Chapter 1 - Numerical data

Conclusion

Checklist

With this chapter covering the main misusages that can be done with numerical data, we got some insight on how to properly use the data we have, what preprocessing steps should be done on them before using them for a data science project.

Here is a checklist that can help you to check if your data are of good quality and if you are using them properly:

  • Data collection

    • You know where the data come from.

    • You know how, where and when the data have been collected.

    • You know what changes have been done on the dataset before you used it.

    • You know the meaning of each attribute in the dataset.

  • Data preprocessing

    • You understand the distribution of the attributes in the dataset.

    • There are no (obvious) wrong values in the dataset.

    • There are no missing values in the dataset.

    • Each attribute is formatted under the right type (numerical, category, etc…).

  • Data usage

    • You know the type of problem that you want to solve (regression, classification or other).

    • You know which attributes are important for the problem you want to solve.

    • The dataset corresponds to the environnement the model will be used in (time period, geographical area, representation of different classes, etc…).

If you check all items in this checklist, it is likely that you are starting your data science project on a good basis, congratulations!

Quiz

from IPython.display import IFrame
IFrame("https://h5p.org/h5p/embed/784273", "900", "800")

References:

The datasets used in this quiz are:

  • UFO sightings: https://www.kaggle.com/NUFORC/ufo-sightings#scrubbed.csv

  • Wine reviews (1): https://www.kaggle.com/zynicide/wine-reviews

  • Wine quality (2): http://archive.ics.uci.edu/ml/datasets/Wine+Quality