Information

Section: The Pandas library
Goal: Get a first idea of the basics of the Pandas library in Python.
Time needed: 10 min
Prerequisites: Curiosity

The Pandas library

The pandas library provides data structures and data analysis tools in Python. We will mainly use it for our machine learning experiment.

All the functions, operations, etc. used in the exercises are explained in the exercises themselves. This Notebook is just a small introduction on how the library looks like.

The official documentation of pandas is available here: https://pandas.pydata.org/pandas-docs/stable/index.html. If ever in doubt on how to use a function or which function to use, you can always refer to the documentation.

1 Import the library

Before using it, we need to import the library. Usually, it is done as follow:

import pandas as pd

We can now access all the pandas tools with the prefixe pd.

2 The DataFrame type

In the exercises, we will store and use the datasets using the DataFrame type. A dataframe is a two-dimensional data structure with labeled rows and columns. We will refer to the columns as attributes and to the rows as indexes.

import numpy as np # we use the numpy library to create a fake dataframe of random numbers

df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df
A B C D
0 -1.142820 -1.078205 1.142203 0.137574
1 -0.041211 -1.114113 -1.032257 1.108194
2 -1.818744 -0.577860 0.609326 0.396509
3 -0.169994 1.381553 -0.238063 1.766694
4 -0.635453 -0.666143 -0.330815 -1.404346
5 0.257478 -0.578234 -0.691365 0.038256

We can do many operations on dataframes. Here is an example on how to do it. The operation info() returns basic information about the dataframe.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       6 non-null      float64
 1   B       6 non-null      float64
 2   C       6 non-null      float64
 3   D       6 non-null      float64
dtypes: float64(4)
memory usage: 320.0 bytes

3 Attributes

We can access a row of the dataframe (an attribute) with its name:

df['A']
0   -1.142820
1   -0.041211
2   -1.818744
3   -0.169994
4   -0.635453
5    0.257478
Name: A, dtype: float64

It is possible to make operations on one attribute of the dataframe. For example, we change the type of an attribute with the operation astype().

# If you only want to print the columns 'A' as integer, do the following:
df['A'].astype('int')
0   -1
1    0
2   -1
3    0
4    0
5    0
Name: A, dtype: int64
df['A']
0   -1.142820
1   -0.041211
2   -1.818744
3   -0.169994
4   -0.635453
5    0.257478
Name: A, dtype: float64
# If you want to change the type of the column 'A' in the 'df' dataframe, do the following:
df['A'] = df['A'].astype('int')
df['A']
0   -1
1    0
2   -1
3    0
4    0
5    0
Name: A, dtype: int64

We get the type of the attribute like this:

df['A'].dtype
dtype('int64')

4 Copy a dataframe

Sometimes, we want to copy the existing dataframe into a new one, to make operations on it without modifying the initial data. Use the function copy() to copy a dataframe.

df2 = df.copy()
df2
A B C D
0 -1 -1.078205 1.142203 0.137574
1 0 -1.114113 -1.032257 1.108194
2 -1 -0.577860 0.609326 0.396509
3 0 1.381553 -0.238063 1.766694
4 0 -0.666143 -0.330815 -1.404346
5 0 -0.578234 -0.691365 0.038256

Now you can try anything on df2 without the fear of losing the initial data, which are still in df.

5 More insight

Here is a very helpful cheat sheet about Pandas and Python that contains most of the tools we will need in this course, developped by datacamp.com.