Modification to raw AIS data from U.S. Marine Cadastre

Description of the modifications done to the raw data

The original AIS data were downloaded through the open access website of the U.S. Marine Cadastre (https://marinecadastre.gov/ais/).

The raw dataset looks like this:

text

The data is missing information to be able to work with separated trips: it contains only the AIS messages.

The first step is to create the attribute TripID. For that, we group the AIS messages according to their MMSI, to make sure that they belong to the same ship. As one ship might travel several times in the area on several days, we split the trips that are recorded on a different day.

Once we have the attribute TripID, we can sort the values of each trip according to their timestamp, and collect the departure and arrival information (time, latitude and longitude). This information is saved in the attributes DepTime, ArrTime, DepLat, DepLon, ArrLat, ArrLon.

Finally, we use the latitude and longitude information to retrieve the country and city of departure and arrival. This creates the attributes DepCountry, DepCity, ArrCountry, ArrCity.

With all these modification, the dynamic dataset is created and now looks like this:

text

From this dynamic dataset, we create the static dataset: we retrieve the static information of each trip and create a new row for each trip in the static dataset. We simply reuse the information for the attributes TripID, MMSI, VesselName, IMO, CallSign, VesselType, Length, Width, Draft, Cargo, DepTime, ArrTime, DepLat, DepLon, ArrLat, ArrLon, DepCountry, DepCity, ArrCountry, ArrCity.

For the attributes MeanSOG and Duration, we calculate them afterwards. MeanSOG is created by taking the mean of the SOG attribute for all the AIS messages of the trip. Duration is simply the difference between ArrTime and DepTime.

The static dataset looks like this:

text

Code for creation of dynamic dataset from raw data

# Add the filename to the variable file_in
file_in = ''
# The path of the new file
file_out = ''


import pandas as pd
import reverse_geocoder as rg

def reverseGeocode(lat, lon):
    '''
    This function returns the city and country names from  latitude and longitude coordinates.
    '''
    coordinates = (lat, lon)
    result = rg.search(coordinates)
    
    # result[0] is an OrderedDict containing 'lat', 'lon', 'name', 'admin1', 'admin2', 'cc'
    return result[0]


# Load raw data
data = pd.read_csv(file_in, nrows = 100000)

# Transform timestamp into usable datetime type
data['BaseDateTime'] = pd.to_datetime(data['BaseDateTime'])

# Create a list of different MMSI
MMSI_list = data['MMSI'].unique()

# Initiate TripID attribute to 0 for all rows
data['TripID'] = 0

'''
The following loop iterates over all the rows of the dataset to add the TripID information:
A new TripID is create for every different MMSI, and in one MMSI, a trip is split up if the shi
travels on two different days.
'''
tripid = 0
for MMSI in MMSI_list:
    date = pd.to_datetime('01.01.2000', format = '%d.%m.%Y') # fake date to have the first row enter the if
    for index, row in data.loc[data['MMSI'] == MMSI].iterrows(): # iterate over the messages of one MMSI
        if (row['BaseDateTime'].day != date.day
            or row['BaseDateTime'].month != date.month
            or row['BaseDateTime'].year != date.year): # different day: different trip
            date = row['BaseDateTime'] # keep the date to compare for later
            tripid = tripid + 1
        data.loc[index, 'TripID'] = tripid # add the TripID number to the row
        

TripID_list = data['TripID'].unique()

# Initiate the following attributes to 0
data['DepTime'] = 0
data['ArrTime'] = 0
data['DepLat'] = 0
data['DepLon'] = 0
data['ArrLat'] = 0
data['ArrLon'] = 0

'''
The following loop iterates over each trip (with different value of TripID) to add the departure and
arrival information. The function sort_values() allows to access easily the first and last timestamps
of the trip.
'''
for TripID in TripID_list: # iterate over each trip
    this_trip = data.loc[data['TripID'] == TripID].sort_values('BaseDateTime')
    
    departure_time = this_trip.iloc[0]['BaseDateTime']
    departure_index = this_trip.index[0]
    arrival_time = this_trip.iloc[-1]['BaseDateTime']
    arrival_index = this_trip.index[-1]
    
    data.loc[data['TripID'] == TripID, 'DepTime'] = departure_time
    data.loc[data['TripID'] == TripID, 'ArrTime'] = arrival_time
    data.loc[data['TripID'] == TripID, 'DepLat'] = data.loc[departure_index, 'LAT']
    data.loc[data['TripID'] == TripID, 'DepLon'] = data.loc[departure_index, 'LON']
    data.loc[data['TripID'] == TripID, 'ArrLat'] = data.loc[arrival_index, 'LAT']
    data.loc[data['TripID'] == TripID, 'ArrLon'] = data.loc[arrival_index, 'LON']

# Initiate the following attributes
data['DepCountry'] = '?'
data['DepCity'] = '?'
data['ArrCountry'] = '?'
data['ArrCity'] = '?'

'''
The following loop iterates over each trip (with different value of TripID), gets the departure and
arrival latitude and longitudes values and gets the corresponding city and country. This information
is added in the attributes 'DepCountry', 'DepCity', 'ArrCountry', 'ArrCity'.
'''
for TripID in TripID_list: # iterate over each trip
    this_trip = data.loc[data['TripID'] == TripID] # initiate the values of the variables for this trip
    
    dep_lat = this_trip.iloc[0]['DepLat']
    dep_lon = this_trip.iloc[0]['DepLon']
    departure = reverseGeocode(dep_lat, dep_lon)    
    data.loc[data['TripID'] == TripID, 'DepCity'] = departure['name']
    data.loc[data['TripID'] == TripID, 'DepCountry'] = departure['cc']
    
    arr_lat = this_trip.iloc[0]['ArrLat']
    arr_lon = this_trip.iloc[0]['ArrLon']
    arrival = reverseGeocode(arr_lat, arr_lon)
    data.loc[data['TripID'] == TripID, 'ArrCity'] = arrival['name']
    data.loc[data['TripID'] == TripID, 'ArrCountry'] = arrival['cc']

    
# Save new dataset
data.to_csv(file_out)

Code for creation of static dataset from dynamic dataset

file_in = '' # the input file is the dynamic dataset
file_out = ''


import pandas as pd

columns = ['TripID', 'MMSI', 'MeanSOG', 'VesselName', 'IMO', 'CallSign', 'VesselType',
           'Length', 'Width', 'Draft', 'Cargo', 'DepTime', 'ArrTime', 'DepLat', 'DepLon',
           'ArrLat', 'ArrLon', 'DepCountry', 'DepCity', 'ArrCountry', 'ArrCity', 'Duration']

# Create new DataFrame with the wanted columns
static_data = pd.DataFrame(columns = columns)

# Remove MeanSOG and Duration from columns because we have to create these attributes in the following loop
columns.remove('MeanSOG')
columns.remove('Duration')

# Change DepTime and ArrTime type to be able to calculate the Duration later
data['DepTime'] = pd.to_datetime(data['DepTime'])
data['ArrTime'] = pd.to_datetime(data['ArrTime'])

i = 0
for tripid in data['TripID'].unique(): # iterate over trips and create one row for each trip
    
    first_row = data.loc[data['TripID'] == tripid].iloc[0]
    
    for attribute in columns:
        # Fill the new dataset with the value of the attribute for the first row
        # (the static attributes don't change for one trip)
        static_data.loc[i, attribute] = first_row[attribute]
        
    # For MeanSOG: take the mean of all the rows of the same trip
    df_tripid = data.loc[data['TripID'] == tripid]
    static_data.loc[i, 'MeanSOG'] = df_tripid['SOG'].mean()
    
    i = i + 1

static_data['Duration'] = static_data['ArrTime'] - static_data['DepTime']


# Save new dataset
static_data.to_csv(file_out)