Information
Section: Understand the data
Goal: Understand the attributes of the AIS data.
Time needed: 30 min
Prerequisites: Introduction about machine learning experiments
Understand the data¶
Import the data ¶
Before anything else, we will take a look at the data we will work with in this section. Let’s start by importing the datasets, using the function read_csv() from the library Pandas. We will use two datasets containing AIS data.
import pandas as pd
dynamic_data = pd.read_csv('./dynamic_data.csv')
static_data = pd.read_csv('./static_data.csv')
The data are imported from the U.S. Marine Cadastre website and are publicly available here: https://marinecadastre.gov/ais/. The datasets used here are a slightly modified version of the raw datasets available on the website, to match the needs of this course. For more information about the modifications done to these datasets, visit this page.
When dealing with any data related problem, the first step is to fully understand the data we are using. This means knowing what the data represent and understand the meaning of each attribute.
In addition, some important points must be known by the scientist using the data for a problem, specifically:
where the data come from: which open organism, company, person, collected the data in the first place.
when the data have been collected: most data can vary over time (for example, the number of passengers using a transportation service will vary over the years in an expanding city).
the range of the data: for example, for geographical data, the area of collection has to be known.
how the data have been acquired: the data might have been recorded with sensors or hand tools, collected by person through forms, automatically created with a software, …
if the data have been previously modified: the organism that collected the data in the first place might have done a work of preprocessing before handing them over to further partners.
AIS data in general ¶
Most navigating vessel today in the world must be equiped with an AIS (Automatic Identification System) transponder which sends information about the ships’s status and position at regular intervals of time (every 1 to 3 minutes). The data is collected by coastal stations and other ships, message after message. A message is sent at a certain timestamp by a certain ship, and can contain either static or dynamic information about the ship and the trip it is currently on.
The dynamic information are information that may change every time a new message is sent, such as the speed of the ship, its heading, or its position (latitude and longitude). The static information stay the same for the whole trip of one ship, for example the identification of the ship (MMSI), the destination information, the type of the ship, …
from IPython.display import IFrame
IFrame("https://www.youtube.com/embed/LQyGorGLYGU", "560", "315")
In this part, you have the role of a data scientist in charge of examining and working with AIS data collected from a coastal station next to Seattle (United States). The data you are using have been collected by the U.S. Marine Cadastre on the 1st of January 2017, on the area UTM10 (west coast of the United States and Canada). The exact latitude range is [32.20937 ; 49.89074]
and the longitude range is [-125.99859 ; -120.00242]
, which is comprised in this area:
This area, being on the coast, contains several harbours, which makes it interesting for our use.
Dynamic data ¶
We start by printing a part of the dataset, to visualize it. The data you have to work with look like this:
The function head() allows you to print the first elements of a pandas DataFrame.
dynamic_data.head()
MMSI | BaseDateTime | LAT | LON | SOG | COG | Heading | VesselName | IMO | CallSign | ... | DepTime | ArrTime | DepLat | DepLon | ArrLat | ArrLon | DepCountry | DepCity | ArrCountry | ArrCity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 367114690 | 2017-01-01 00:00:06 | 48.51094 | -122.60705 | 0.0 | -49.6 | 511.0 | NaN | NaN | NaN | ... | 2017-01-01 00:00:06 | 2017-01-01 02:40:45 | 48.51094 | -122.60705 | 48.51095 | -122.60705 | US | Anacortes | US | Anacortes |
1 | 367479990 | 2017-01-01 00:00:03 | 48.15891 | -122.67268 | 0.1 | 10.1 | 353.0 | WSF KENNEWICK | IMO9618331 | WDF6991 | ... | 2017-01-01 00:00:03 | 2017-01-01 02:40:44 | 48.15891 | -122.67268 | 48.11099 | -122.75885 | US | Coupeville | US | Port Townsend |
2 | 368319000 | 2017-01-01 00:00:08 | 43.34576 | -124.32142 | 0.0 | 32.8 | 173.0 | NaN | NaN | NaN | ... | 2017-01-01 00:00:08 | 2017-01-01 02:44:48 | 43.34576 | -124.32142 | 43.34578 | -124.32141 | US | Barview | US | Barview |
3 | 367154100 | 2017-01-01 00:00:15 | 46.74264 | -124.93125 | 6.8 | 6.0 | 352.0 | NaN | NaN | NaN | ... | 2017-01-01 00:00:15 | 2017-01-01 02:33:28 | 46.74264 | -124.93125 | 47.02928 | -124.95153 | US | Ocean Shores | US | Ocean Shores |
4 | 367446870 | 2017-01-01 00:00:59 | 48.51320 | -122.60718 | 0.0 | 23.2 | 511.0 | NaN | NaN | NaN | ... | 2017-01-01 00:00:59 | 2017-01-01 02:42:54 | 48.51320 | -122.60718 | 48.51318 | -122.60699 | US | Anacortes | US | Anacortes |
5 rows × 26 columns
This dataset contains a mix of dynamic and static attributes, as explained earlier. The mention artificially created from original data means that this attribute was added by us to the data downloaded.
Static attributes:
MMSI: unique 9-digit identification code of the ship - numeric
VesselName: name of the ship - string
IMO: unique 7-digit international identification number, that remains unchanged after the transfer of the ship’s registration to another country - numeric
CallSign: unique callsign of the ship - string
VesselType: type of the ship, numerically coded, see here for details - numeric
Length: length of the ship, in meters - numeric
Width: width of the ship, in meters - numeric
Draft: vertical distance between the waterline and the bottom of the hull of the ship, in meters. For one ship, varies with the load of the ship and the density of the water - numeric
TripID: (artificially created from original data) unique ID for the trip - numeric
DepTime: (artificially created from original data) departure time for the trip - datetime
ArrTime: (artificially created from original data) arrival time for the trip - datetime
DepLat: (artificially created from original data) departure latitude for the trip - numeric
DepLon: (artificially created from original data) departure longitude for the trip - numeric
ArrLat: (artificially created from original data) arrival latitude for the trip - numeric
ArrLon: (artificially created from original data) arrival longitude for the trip - numeric
DepCountry: (artificially created from original data) departure country for the trip - string
DepCity: (artificially created from original data) departure city for the trip - string
ArrCountry: (artificially created from original data) arrival country for the trip - string
ArrCity: (artificially created from original data) arrival city for the trip - string
Dynamic attributes:
BaseDateTime: timestamp of the AIS message - datetime
LAT: latitude of the ship (in degree: [-90 ; 90], negative value represents South, 91 indicates ‘not available’) - numeric
LON: longitude of the ship (in degree: [-180 ; 180], negative value represents West, 181 indicates ‘not available’) - numeric
SOG: speed over ground, in knots - numeric
COG: course over ground, direction relative to the absolute North (in degree: [0 ; 359]) - numeric
Heading: heading of the ship (in degree: [0 ; 359], 511 indicates ‘not available’) - numeric
Status: status of the ship - string
from IPython.display import IFrame
IFrame("https://www.youtube.com/embed/3M7mxBimT70", "560", "315")
Static data ¶
From the AIS data collected on the Marine Cadastre Website, we have created a smaller dataset containing only the static data of the trips. This will allow you to compare and analyze the trips as entities. Let’s have a look at these data.
static_data.head()
TripID | MMSI | MeanSOG | VesselName | IMO | CallSign | VesselType | Length | Width | Draft | ... | ArrTime | DepLat | DepLon | ArrLat | ArrLon | DepCountry | DepCity | ArrCountry | ArrCity | Duration | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 367114690 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2017-01-01 02:40:45 | 48.51094 | -122.60705 | 48.51095 | -122.60705 | US | Anacortes | US | Anacortes | 0 days 02:40:39 |
1 | 2 | 367479990 | 6.536585 | WSF KENNEWICK | IMO9618331 | WDF6991 | 1012.0 | 83.39 | 19.5 | 3.2 | ... | 2017-01-01 02:40:44 | 48.15891 | -122.67268 | 48.11099 | -122.75885 | US | Coupeville | US | Port Townsend | 0 days 02:40:41 |
2 | 3 | 368319000 | 0.000758 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2017-01-01 02:44:48 | 43.34576 | -124.32142 | 43.34578 | -124.32141 | US | Barview | US | Barview | 0 days 02:44:40 |
3 | 4 | 367154100 | 6.871111 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2017-01-01 02:33:28 | 46.74264 | -124.93125 | 47.02928 | -124.95153 | US | Ocean Shores | US | Ocean Shores | 0 days 02:33:13 |
4 | 5 | 367446870 | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2017-01-01 02:42:54 | 48.51320 | -122.60718 | 48.51318 | -122.60699 | US | Anacortes | US | Anacortes | 0 days 02:41:55 |
5 rows × 22 columns
As you can see, most of the attributes are the same as for the dynamic data, only the dynamic attributes (like timestamp, latitude, longitude, speed and heading) have disappeared.
Two attributes are new here:
MeanSOG: the mean of the value of the SOG attribute for all the points in the trip - numeric
Duration: the total duration of the tracked trip - numeric
Comparison of dynamic and static datasets ¶
This dataset is built from the same data as the dynamic dataset and therefore contains the same data.
The only difference is that it contains one instance for each trip, instead of one instance for each AIS message.
We can verify that by printing the information of the dataset:
static_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1520 entries, 0 to 1519
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TripID 1520 non-null int64
1 MMSI 1520 non-null int64
2 MeanSOG 1520 non-null float64
3 VesselName 1442 non-null object
4 IMO 538 non-null object
5 CallSign 1137 non-null object
6 VesselType 1287 non-null float64
7 Length 1220 non-null float64
8 Width 911 non-null float64
9 Draft 496 non-null float64
10 Cargo 378 non-null float64
11 DepTime 1520 non-null object
12 ArrTime 1520 non-null object
13 DepLat 1520 non-null float64
14 DepLon 1520 non-null float64
15 ArrLat 1520 non-null float64
16 ArrLon 1520 non-null float64
17 DepCountry 1520 non-null object
18 DepCity 1520 non-null object
19 ArrCountry 1520 non-null object
20 ArrCity 1520 non-null object
21 Duration 1520 non-null object
dtypes: float64(10), int64(2), object(10)
memory usage: 261.4+ KB
Where we see that the dataset contains 1520 entries, the number of different trips in the dynamic dataset.
We can verify that by comparing the MMSIs of the ships tracked in both datasets. For that, we collect the unique values of the attribute MMSI in both datasets with the function unique().
We could then print the two lists and visually compare them, but they both contain 1520 values and it would be a long work. Instead, we will check the difference between both by using two loops and printing the elements that are present in one list and not in the other.
We do that two times: diff1
contains the elements present in the dynamic dataset but not in the static one, and diff2
contains the elements present in the static dataset but not in the dynamic one.
# Get the unique MMSI values in both datasets
dynamic_mmsi = dynamic_data['MMSI'].unique()
static_mmsi = static_data['MMSI'].unique()
# Print the two lists
print('dynamic: ' + str(dynamic_mmsi))
print('static: ' + str(static_mmsi))
# Check for elements in dynamic_mmsi that are not in static_mmsi
diff1 = []
for element in dynamic_mmsi:
if element not in static_mmsi:
diff1.append(element)
print('diff1: ' + str(diff1))
# Check for elements in static_mmsi that are not in dynamic_mmsi
diff2 = []
for element in static_mmsi:
if element not in dynamic_mmsi:
diff2.append(element)
print('diff2: ' + str(diff2))
dynamic: [367114690 367479990 368319000 ... 367417230 316021258 316021259]
static: [367114690 367479990 368319000 ... 367417230 316021258 316021259]
diff1: []
diff2: []
As we see, the two lists containing the differences are empty, meaning both the dynamic and static lists are identical.
To easily understand the difference between the datasets, let’s have a look at their lengths with the function len().
len(dynamic_data)
100000
len(static_data)
1520
Finally, we can understand that each element in the static dataset contains the information of one trip in the dynamic data by printing the number of different MMSI values in the dynamic dataset:
len(dynamic_data['MMSI'].unique())
1520
We have the same number of unique MMSIs in the dynamic dataset than the number of instances in the static dataset.
More information and quiz ¶
For more information about AIS data:
from IPython.display import IFrame
IFrame("https://h5p.org/h5p/embed/741872", "694", "600")