```{admonition} Information
__Section__: Understand the data  
__Goal__: Understand the attributes of the AIS data.  
__Time needed__: 30 min  
__Prerequisites__: Introduction about machine learning experiments
```

# Understand the data

## Import the data <a class="anchor" id="import-data"></a>

```{toggle} Advanced level
Before anything else, we will take a look at the data we will work with in this section. Let's start by importing the datasets, using the function [read_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) from the library [Pandas](https://pandas.pydata.org/). We will use two datasets containing AIS data.
```

In [4]:
import pandas as pd

dynamic_data = pd.read_csv('./dynamic_data.csv')
static_data = pd.read_csv('./static_data.csv')

The data are imported from the U.S. Marine Cadastre website and are publicly available here: https://marinecadastre.gov/ais/. The datasets used here are a slightly modified version of the raw datasets available on the website, to match the needs of this course. For more information about the modifications done to these datasets, visit [this page](./../../appendix/A-1-1-AIS-modifications.html).

When dealing with any data related problem, the first step is to fully understand the data we are using. This means knowing what the data represent and understand the meaning of each attribute.

In addition, some important points must be known by the scientist using the data for a problem, specifically:
* where the data come from: which open organism, company, person, collected the data in the first place.
* when the data have been collected: most data can vary over time (for example, the number of passengers using a transportation service will vary over the years in an expanding city).
* the range of the data: for example, for geographical data, the area of collection has to be known.
* how the data have been acquired: the data might have been recorded with sensors or hand tools, collected by person through forms, automatically created with a software, ...
* if the data have been previously modified: the organism that collected the data in the first place might have done a work of preprocessing before handing them over to further partners.

## AIS data in general <a class="anchor" id="ais-data"></a>

Most navigating vessel today in the world must be equiped with an AIS (Automatic Identification System) transponder which sends information about the ships's status and position at regular intervals of time (every 1 to 3 minutes). The data is collected by coastal stations and other ships, message after message. A message is sent at a certain timestamp by a certain ship, and can contain either static or dynamic information about the ship and the trip it is currently on.

The dynamic information are information that may change every time a new message is sent, such as the speed of the ship, its heading, or its position (latitude and longitude). The static information stay the same for the whole trip of one ship, for example the identification of the ship (MMSI), the destination information, the type of the ship, ...

In [1]:
from IPython.display import IFrame
IFrame("https://www.youtube.com/embed/LQyGorGLYGU", "560", "315")

In this part, you have the role of a data scientist in charge of examining and working with AIS data collected from a coastal station next to Seattle (United States). The data you are using have been collected by the U.S. Marine Cadastre on the 1st of January 2017, on the area [UTM10](https://marinecadastre.gov/AIS/AIS%20Documents/UTMZoneMap2014.png) (west coast of the United States and Canada). The exact latitude range is ``[32.20937 ; 49.89074]`` and the longitude range is ``[-125.99859 ; -120.00242]``, which is comprised in this area:

![text](1-1-data_area.JPG)

This area, being on the coast, contains several harbours, which makes it interesting for our use.

## Dynamic data <a class="anchor" id="dynamic-data"></a>

We start by printing a part of the dataset, to visualize it. The data you have to work with look like this:

```{toggle} Advanced level
The function [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) allows you to print the first elements of a pandas DataFrame.
```

In [5]:
dynamic_data.head()

Unnamed: 0,MMSI,BaseDateTime,LAT,LON,SOG,COG,Heading,VesselName,IMO,CallSign,VesselType,Status,Length,Width,Draft,TripID,DepTime,ArrTime,DepLat,DepLon,ArrLat,ArrLon,DepCountry,DepCity,ArrCountry,ArrCity
0,367114690,2017-01-01 00:00:06,48.51094,-122.60705,0.0,-49.6,511.0,,,,,under way using engine,,,,1,2017-01-01 00:00:06,2017-01-01 02:40:45,48.51094,-122.60705,48.51095,-122.60705,US,Anacortes,US,Anacortes
1,367479990,2017-01-01 00:00:03,48.15891,-122.67268,0.1,10.1,353.0,WSF KENNEWICK,IMO9618331,WDF6991,1012.0,moored,83.39,19.5,3.2,2,2017-01-01 00:00:03,2017-01-01 02:40:44,48.15891,-122.67268,48.11099,-122.75885,US,Coupeville,US,Port Townsend
2,368319000,2017-01-01 00:00:08,43.34576,-124.32142,0.0,32.8,173.0,,,,,engaged in fishing,,,,3,2017-01-01 00:00:08,2017-01-01 02:44:48,43.34576,-124.32142,43.34578,-124.32141,US,Barview,US,Barview
3,367154100,2017-01-01 00:00:15,46.74264,-124.93125,6.8,6.0,352.0,,,,,undefined,,,,4,2017-01-01 00:00:15,2017-01-01 02:33:28,46.74264,-124.93125,47.02928,-124.95153,US,Ocean Shores,US,Ocean Shores
4,367446870,2017-01-01 00:00:59,48.5132,-122.60718,0.0,23.2,511.0,,,,,,,,,5,2017-01-01 00:00:59,2017-01-01 02:42:54,48.5132,-122.60718,48.51318,-122.60699,US,Anacortes,US,Anacortes


This dataset contains a mix of dynamic and static attributes, as explained earlier. The mention _artificially created from original data_ means that this attribute was added by us to the data downloaded.

Static attributes:

+ __MMSI__: unique 9-digit [identification code](https://www.navcen.uscg.gov/?pageName=mtMmsi) of the ship - numeric
+ __VesselName__: name of the ship - string
+ __IMO__: unique 7-digit [international identification number](https://imonumbers.lrfairplay.com/), that remains unchanged after the transfer of the ship's registration to another country - numeric
+ __CallSign__: unique callsign of the ship - string
+ __VesselType__: type of the ship, numerically coded, see [here](https://coast.noaa.gov/data/marinecadastre/ais/VesselTypeCodes2018.pdf) for details - numeric
+ __Length__: length of the ship, in meters - numeric
+ __Width__: width of the ship, in meters - numeric
+ __Draft__: vertical distance between the waterline and the bottom of the hull of the ship, in meters. For one ship, varies with the load of the ship and the density of the water - numeric
+ __TripID__: (_artificially created from original data_) unique ID for the trip - numeric
+ __DepTime__: (_artificially created from original data_) departure time for the trip - datetime
+ __ArrTime__: (_artificially created from original data_) arrival time for the trip - datetime
+ __DepLat__: (_artificially created from original data_) departure latitude for the trip - numeric
+ __DepLon__: (_artificially created from original data_) departure longitude for the trip - numeric
+ __ArrLat__: (_artificially created from original data_) arrival latitude for the trip - numeric
+ __ArrLon__: (_artificially created from original data_) arrival longitude for the trip - numeric
+ __DepCountry__: (_artificially created from original data_) departure country for the trip - string
+ __DepCity__: (_artificially created from original data_) departure city for the trip - string
+ __ArrCountry__: (_artificially created from original data_) arrival country for the trip - string
+ __ArrCity__: (_artificially created from original data_) arrival city for the trip - string

Dynamic attributes:

+ __BaseDateTime__: timestamp of the AIS message - datetime
+ __LAT__: latitude of the ship (in degree: [-90 ; 90], negative value represents South, 91 indicates 'not available') - numeric
+ __LON__: longitude of the ship (in degree: [-180 ; 180], negative value represents West, 181 indicates 'not available') - numeric
+ __SOG__: speed over ground, in knots - numeric
+ __COG__: course over ground, direction relative to the absolute North (in degree: [0 ; 359]) - numeric
+ __Heading__: heading of the ship (in degree: [0 ; 359], 511 indicates 'not available') - numeric
+ __Status__: status of the ship - string

In [2]:
from IPython.display import IFrame
IFrame("https://www.youtube.com/embed/3M7mxBimT70", "560", "315")

## Static data <a class="anchor" id="static-data"></a>

From the AIS data collected on the Marine Cadastre Website, we have created a smaller dataset containing only the static data of the trips. This will allow you to compare and analyze the trips as entities. Let's have a look at these data.

In [6]:
static_data.head()

Unnamed: 0,TripID,MMSI,MeanSOG,VesselName,IMO,CallSign,VesselType,Length,Width,Draft,Cargo,DepTime,ArrTime,DepLat,DepLon,ArrLat,ArrLon,DepCountry,DepCity,ArrCountry,ArrCity,Duration
0,1,367114690,0.0,,,,,,,,,2017-01-01 00:00:06,2017-01-01 02:40:45,48.51094,-122.60705,48.51095,-122.60705,US,Anacortes,US,Anacortes,0 days 02:40:39
1,2,367479990,6.536585,WSF KENNEWICK,IMO9618331,WDF6991,1012.0,83.39,19.5,3.2,,2017-01-01 00:00:03,2017-01-01 02:40:44,48.15891,-122.67268,48.11099,-122.75885,US,Coupeville,US,Port Townsend,0 days 02:40:41
2,3,368319000,0.000758,,,,,,,,,2017-01-01 00:00:08,2017-01-01 02:44:48,43.34576,-124.32142,43.34578,-124.32141,US,Barview,US,Barview,0 days 02:44:40
3,4,367154100,6.871111,,,,,,,,,2017-01-01 00:00:15,2017-01-01 02:33:28,46.74264,-124.93125,47.02928,-124.95153,US,Ocean Shores,US,Ocean Shores,0 days 02:33:13
4,5,367446870,0.0,,,,,,,,,2017-01-01 00:00:59,2017-01-01 02:42:54,48.5132,-122.60718,48.51318,-122.60699,US,Anacortes,US,Anacortes,0 days 02:41:55


As you can see, most of the attributes are the same as for the dynamic data, only the dynamic attributes (like timestamp, latitude, longitude, speed and heading) have disappeared.

Two attributes are new here:
+ __MeanSOG__: the mean of the value of the SOG attribute for all the points in the trip - numeric
+ __Duration__: the total duration of the tracked trip - numeric

## Comparison of dynamic and static datasets <a class="anchor" id="compare-data"></a>

This dataset is built from the same data as the dynamic dataset and therefore contains the same data.

The only difference is that it contains one instance for each trip, instead of one instance for each AIS message.

We can verify that by printing the information of the dataset:

In [7]:
static_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1520 entries, 0 to 1519
Data columns (total 22 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   TripID      1520 non-null   int64  
 1   MMSI        1520 non-null   int64  
 2   MeanSOG     1520 non-null   float64
 3   VesselName  1442 non-null   object 
 4   IMO         538 non-null    object 
 5   CallSign    1137 non-null   object 
 6   VesselType  1287 non-null   float64
 7   Length      1220 non-null   float64
 8   Width       911 non-null    float64
 9   Draft       496 non-null    float64
 10  Cargo       378 non-null    float64
 11  DepTime     1520 non-null   object 
 12  ArrTime     1520 non-null   object 
 13  DepLat      1520 non-null   float64
 14  DepLon      1520 non-null   float64
 15  ArrLat      1520 non-null   float64
 16  ArrLon      1520 non-null   float64
 17  DepCountry  1520 non-null   object 
 18  DepCity     1520 non-null   object 
 19  ArrCountry  1520 non-null  

Where we see that the dataset contains 1520 entries, the number of different trips in the dynamic dataset.

```{toggle} Advanced level
We can verify that by comparing the MMSIs of the ships tracked in both datasets. For that, we collect the unique values of the attribute MMSI in both datasets with the function [unique()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html).

We could then print the two lists and visually compare them, but they both contain 1520 values and it would be a long work. Instead, we will check the difference between both by using two loops and printing the elements that are present in one list and not in the other.

We do that two times: ``diff1`` contains the elements present in the dynamic dataset but not in the static one, and ``diff2`` contains the elements present in the static dataset but not in the dynamic one.
```

In [8]:
# Get the unique MMSI values in both datasets
dynamic_mmsi = dynamic_data['MMSI'].unique()
static_mmsi = static_data['MMSI'].unique()

# Print the two lists
print('dynamic: ' + str(dynamic_mmsi))
print('static: ' + str(static_mmsi))

# Check for elements in dynamic_mmsi that are not in static_mmsi
diff1 = []
for element in dynamic_mmsi:
    if element not in static_mmsi:
        diff1.append(element)
print('diff1: ' + str(diff1))
        
# Check for elements in static_mmsi that are not in dynamic_mmsi
diff2 = []
for element in static_mmsi:
    if element not in dynamic_mmsi:
        diff2.append(element)
print('diff2: ' + str(diff2))

dynamic: [367114690 367479990 368319000 ... 367417230 316021258 316021259]
static: [367114690 367479990 368319000 ... 367417230 316021258 316021259]
diff1: []
diff2: []


```{toggle} Advanced level
As we see, the two lists containing the differences are empty, meaning both the dynamic and static lists are identical.

To easily understand the difference between the datasets, let's have a look at their lengths with the function [len()](https://docs.python.org/3/library/functions.html#len).
```

In [9]:
len(dynamic_data)

100000

In [10]:
len(static_data)

1520

```{toggle} Advanced level
Finally, we can understand that each element in the static dataset contains the information of one trip in the dynamic data by printing the number of different MMSI values in the dynamic dataset:
```

In [11]:
len(dynamic_data['MMSI'].unique())

1520

```{toggle} Advanced level
We have the same number of unique MMSIs in the dynamic dataset than the number of instances in the static dataset.
```

## More information and quiz <a class="anchor" id="quiz"></a>

For more information about AIS data:
* [US coast guard page about AIS](https://www.navcen.uscg.gov/?pageName=AISmain)
* [all about AIS](http://www.allaboutais.com/index.php/en/)
* [official report on AIS data](https://www.itu.int/dms_pubrec/itu-r/rec/m/R-REC-M.1371-5-201402-I!!PDF-E.pdf)

In [3]:
from IPython.display import IFrame
IFrame("https://h5p.org/h5p/embed/741872", "694", "600")