Modification to raw AIS data from U.S. Marine Cadastre¶
Description of the modifications done to the raw data¶
The original AIS data were downloaded through the open access website of the U.S. Marine Cadastre (https://marinecadastre.gov/ais/).
The raw dataset looks like this:
The data is missing information to be able to work with separated trips: it contains only the AIS messages.
The first step is to create the attribute TripID
. For that, we group the AIS messages according to their MMSI
, to make sure that they belong to the same ship. As one ship might travel several times in the area on several days, we split the trips that are recorded on a different day.
Once we have the attribute TripID
, we can sort the values of each trip according to their timestamp, and collect the departure and arrival information (time, latitude and longitude). This information is saved in the attributes DepTime
, ArrTime
, DepLat
, DepLon
, ArrLat
, ArrLon
.
Finally, we use the latitude and longitude information to retrieve the country and city of departure and arrival. This creates the attributes DepCountry
, DepCity
, ArrCountry
, ArrCity
.
With all these modification, the dynamic dataset is created and now looks like this:
From this dynamic dataset, we create the static dataset: we retrieve the static information of each trip and create a new row for each trip in the static dataset. We simply reuse the information for the attributes TripID
, MMSI
, VesselName
, IMO
, CallSign
, VesselType
, Length
, Width
, Draft
, Cargo
, DepTime
, ArrTime
, DepLat
, DepLon
, ArrLat
, ArrLon
, DepCountry
, DepCity
, ArrCountry
, ArrCity
.
For the attributes MeanSOG
and Duration
, we calculate them afterwards. MeanSOG
is created by taking the mean of the SOG
attribute for all the AIS messages of the trip. Duration
is simply the difference between ArrTime
and DepTime
.
The static dataset looks like this:
Code for creation of dynamic dataset from raw data¶
# Add the filename to the variable file_in
file_in = ''
# The path of the new file
file_out = ''
import pandas as pd
import reverse_geocoder as rg
def reverseGeocode(lat, lon):
'''
This function returns the city and country names from latitude and longitude coordinates.
'''
coordinates = (lat, lon)
result = rg.search(coordinates)
# result[0] is an OrderedDict containing 'lat', 'lon', 'name', 'admin1', 'admin2', 'cc'
return result[0]
# Load raw data
data = pd.read_csv(file_in, nrows = 100000)
# Transform timestamp into usable datetime type
data['BaseDateTime'] = pd.to_datetime(data['BaseDateTime'])
# Create a list of different MMSI
MMSI_list = data['MMSI'].unique()
# Initiate TripID attribute to 0 for all rows
data['TripID'] = 0
'''
The following loop iterates over all the rows of the dataset to add the TripID information:
A new TripID is create for every different MMSI, and in one MMSI, a trip is split up if the shi
travels on two different days.
'''
tripid = 0
for MMSI in MMSI_list:
date = pd.to_datetime('01.01.2000', format = '%d.%m.%Y') # fake date to have the first row enter the if
for index, row in data.loc[data['MMSI'] == MMSI].iterrows(): # iterate over the messages of one MMSI
if (row['BaseDateTime'].day != date.day
or row['BaseDateTime'].month != date.month
or row['BaseDateTime'].year != date.year): # different day: different trip
date = row['BaseDateTime'] # keep the date to compare for later
tripid = tripid + 1
data.loc[index, 'TripID'] = tripid # add the TripID number to the row
TripID_list = data['TripID'].unique()
# Initiate the following attributes to 0
data['DepTime'] = 0
data['ArrTime'] = 0
data['DepLat'] = 0
data['DepLon'] = 0
data['ArrLat'] = 0
data['ArrLon'] = 0
'''
The following loop iterates over each trip (with different value of TripID) to add the departure and
arrival information. The function sort_values() allows to access easily the first and last timestamps
of the trip.
'''
for TripID in TripID_list: # iterate over each trip
this_trip = data.loc[data['TripID'] == TripID].sort_values('BaseDateTime')
departure_time = this_trip.iloc[0]['BaseDateTime']
departure_index = this_trip.index[0]
arrival_time = this_trip.iloc[-1]['BaseDateTime']
arrival_index = this_trip.index[-1]
data.loc[data['TripID'] == TripID, 'DepTime'] = departure_time
data.loc[data['TripID'] == TripID, 'ArrTime'] = arrival_time
data.loc[data['TripID'] == TripID, 'DepLat'] = data.loc[departure_index, 'LAT']
data.loc[data['TripID'] == TripID, 'DepLon'] = data.loc[departure_index, 'LON']
data.loc[data['TripID'] == TripID, 'ArrLat'] = data.loc[arrival_index, 'LAT']
data.loc[data['TripID'] == TripID, 'ArrLon'] = data.loc[arrival_index, 'LON']
# Initiate the following attributes
data['DepCountry'] = '?'
data['DepCity'] = '?'
data['ArrCountry'] = '?'
data['ArrCity'] = '?'
'''
The following loop iterates over each trip (with different value of TripID), gets the departure and
arrival latitude and longitudes values and gets the corresponding city and country. This information
is added in the attributes 'DepCountry', 'DepCity', 'ArrCountry', 'ArrCity'.
'''
for TripID in TripID_list: # iterate over each trip
this_trip = data.loc[data['TripID'] == TripID] # initiate the values of the variables for this trip
dep_lat = this_trip.iloc[0]['DepLat']
dep_lon = this_trip.iloc[0]['DepLon']
departure = reverseGeocode(dep_lat, dep_lon)
data.loc[data['TripID'] == TripID, 'DepCity'] = departure['name']
data.loc[data['TripID'] == TripID, 'DepCountry'] = departure['cc']
arr_lat = this_trip.iloc[0]['ArrLat']
arr_lon = this_trip.iloc[0]['ArrLon']
arrival = reverseGeocode(arr_lat, arr_lon)
data.loc[data['TripID'] == TripID, 'ArrCity'] = arrival['name']
data.loc[data['TripID'] == TripID, 'ArrCountry'] = arrival['cc']
# Save new dataset
data.to_csv(file_out)
Code for creation of static dataset from dynamic dataset¶
file_in = '' # the input file is the dynamic dataset
file_out = ''
import pandas as pd
columns = ['TripID', 'MMSI', 'MeanSOG', 'VesselName', 'IMO', 'CallSign', 'VesselType',
'Length', 'Width', 'Draft', 'Cargo', 'DepTime', 'ArrTime', 'DepLat', 'DepLon',
'ArrLat', 'ArrLon', 'DepCountry', 'DepCity', 'ArrCountry', 'ArrCity', 'Duration']
# Create new DataFrame with the wanted columns
static_data = pd.DataFrame(columns = columns)
# Remove MeanSOG and Duration from columns because we have to create these attributes in the following loop
columns.remove('MeanSOG')
columns.remove('Duration')
# Change DepTime and ArrTime type to be able to calculate the Duration later
data['DepTime'] = pd.to_datetime(data['DepTime'])
data['ArrTime'] = pd.to_datetime(data['ArrTime'])
i = 0
for tripid in data['TripID'].unique(): # iterate over trips and create one row for each trip
first_row = data.loc[data['TripID'] == tripid].iloc[0]
for attribute in columns:
# Fill the new dataset with the value of the attribute for the first row
# (the static attributes don't change for one trip)
static_data.loc[i, attribute] = first_row[attribute]
# For MeanSOG: take the mean of all the rows of the same trip
df_tripid = data.loc[data['TripID'] == tripid]
static_data.loc[i, 'MeanSOG'] = df_tripid['SOG'].mean()
i = i + 1
static_data['Duration'] = static_data['ArrTime'] - static_data['DepTime']
# Save new dataset
static_data.to_csv(file_out)