Information
Section: The Pandas library
Goal: Get a first idea of the basics of the Pandas library in Python.
Time needed: 10 min
Prerequisites: Curiosity
The Pandas library¶
The pandas
library provides data structures and data analysis tools in Python. We will mainly use it for our machine learning experiment.
All the functions, operations, etc. used in the exercises are explained in the exercises themselves. This Notebook is just a small introduction on how the library looks like.
The official documentation of pandas is available here: https://pandas.pydata.org/pandas-docs/stable/index.html. If ever in doubt on how to use a function or which function to use, you can always refer to the documentation.
1 Import the library ¶
Before using it, we need to import the library. Usually, it is done as follow:
import pandas as pd
We can now access all the pandas
tools with the prefixe pd
.
2 The DataFrame type ¶
In the exercises, we will store and use the datasets using the DataFrame
type. A dataframe is a two-dimensional data structure with labeled rows and columns. We will refer to the columns as attributes
and to the rows as indexes
.
import numpy as np # we use the numpy library to create a fake dataframe of random numbers
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
df
A | B | C | D | |
---|---|---|---|---|
0 | -1.142820 | -1.078205 | 1.142203 | 0.137574 |
1 | -0.041211 | -1.114113 | -1.032257 | 1.108194 |
2 | -1.818744 | -0.577860 | 0.609326 | 0.396509 |
3 | -0.169994 | 1.381553 | -0.238063 | 1.766694 |
4 | -0.635453 | -0.666143 | -0.330815 | -1.404346 |
5 | 0.257478 | -0.578234 | -0.691365 | 0.038256 |
We can do many operations on dataframes. Here is an example on how to do it. The operation info()
returns basic information about the dataframe.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 6 non-null float64
1 B 6 non-null float64
2 C 6 non-null float64
3 D 6 non-null float64
dtypes: float64(4)
memory usage: 320.0 bytes
3 Attributes ¶
We can access a row of the dataframe (an attribute) with its name:
df['A']
0 -1.142820
1 -0.041211
2 -1.818744
3 -0.169994
4 -0.635453
5 0.257478
Name: A, dtype: float64
It is possible to make operations on one attribute of the dataframe. For example, we change the type of an attribute with the operation astype()
.
# If you only want to print the columns 'A' as integer, do the following:
df['A'].astype('int')
0 -1
1 0
2 -1
3 0
4 0
5 0
Name: A, dtype: int64
df['A']
0 -1.142820
1 -0.041211
2 -1.818744
3 -0.169994
4 -0.635453
5 0.257478
Name: A, dtype: float64
# If you want to change the type of the column 'A' in the 'df' dataframe, do the following:
df['A'] = df['A'].astype('int')
df['A']
0 -1
1 0
2 -1
3 0
4 0
5 0
Name: A, dtype: int64
We get the type of the attribute like this:
df['A'].dtype
dtype('int64')
4 Copy a dataframe ¶
Sometimes, we want to copy the existing dataframe into a new one, to make operations on it without modifying the initial data. Use the function copy()
to copy a dataframe.
df2 = df.copy()
df2
A | B | C | D | |
---|---|---|---|---|
0 | -1 | -1.078205 | 1.142203 | 0.137574 |
1 | 0 | -1.114113 | -1.032257 | 1.108194 |
2 | -1 | -0.577860 | 0.609326 | 0.396509 |
3 | 0 | 1.381553 | -0.238063 | 1.766694 |
4 | 0 | -0.666143 | -0.330815 | -1.404346 |
5 | 0 | -0.578234 | -0.691365 | 0.038256 |
Now you can try anything on df2
without the fear of losing the initial data, which are still in df
.
5 More insight¶
Here is a very helpful cheat sheet about Pandas and Python that contains most of the tools we will need in this course, developped by datacamp.com.