Skip to content

I was assigned with this project at university machine learning course during my Erasmus exchange at Åbo Akademi University in Finland.

Notifications You must be signed in to change notification settings

Ton201/ML_Miniproject1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Miniproject 1

Introduction

In this project I was assigned to create a model that will predict the fuel consumption of a cruise ship base on data from several sensors. Data were provided in CSV files of raw sensor outputs. Units of fuel consumption are left open.

Data preprocessing

Each CSV file contains two columns: time and value. Time is in .NET DateTime.Ticks format and needs to be converted to human readable format in order to properly aggregate the data from all sensors.

Defining the taarget variable

To determine the optimal time window to predict the fuel consumption, I aggregated records by day and counted the number of records. That gave me 54 days with records from all sensors. With minimul number of records per day for one sensor was 5894. I chose the window for predicting the consumption to be 1 hour. I will consider the fuel density and fuel volumetri flow to be constant.

Feature extraction

Most of the features will be represented as mean values during the one hour time window. For GPS data it would not make much sense, therefore the GPS data will be represented by minimum and maximum values on set interval. Also the GPS values will need to be converted to numerice values. The signes will be added acording to the following table.
Direction Sign
North +
South -
East +
West -

Joining the fuel consumption table with data from sensors gives final dataset with 682 data points.

Prepairing data for training models

I decide not to shuffle the data before spliting. Keeping subsets chronological helps to keep validation and testing metrics more relevant towards use on new data. Trainning subset cosints of 477 data points. Validation and testing subset consist of 102 and 103 data points respectively.

Standardization

Since the first model to train will be ridge regression, I will use feature standardization. I experimented with "MinMax" normalisation and standardisation. Each benefiting different model.

Model selection and training

I decided to use RMSE as the evaluation metric. The R2 score was also used for it's interpretation benefits.

Ridge regression

Ridge regression was chosen as base line model, since linear model is the simpliest one. Ridge regression also utilises the regularization parameter on top of that. The best obtained validation RMSE and R2 were 0.3257 and 0.7459 respectively. This result was obtained using regularisation parameter alpha of 15.3878.

title

SVM

SVM was chosen as the second model for it capabilities of explorign non-linear relations. The best obtained validation RMSE was 0.1261 and validation R2 was 0.9617. This result was obtained using regularisation parameter C of 9.7667.

title

This model was the better out of these two. Therefore it was tested on test dataset. The obtained RMSE was 0.1613 and R2 was 0.9080.

Conclusion

The best model was SVM regresion model. R2 score of 0.9080 is quite good. Althougth it can be expected that the model performace will drop with newer data. Possible options to further improve model performance are dimensionality reduction and adding distance between starting and ending GPS coordinates in time window. The dimensionality reduction could be done using PCA as the features were not filtered and some of them are correlated for sure (e.g. trackDegreeMagnetic and trackDegreeTrue).

About

I was assigned with this project at university machine learning course during my Erasmus exchange at Åbo Akademi University in Finland.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published