Driver Telematics Analysis

Exploration and Analysis of Driver Telematics Analysis Kaggle competition data

Shannon Rush

Competition Summary

The goal of this competition is to identify which driving trips were taken by a unique driver. Each anonymized driver is associated with a set of trips, the majority of which were driven by the driver and the remainder were not.

The Data

The data is provided as CSV files (200 for each driver) and contain X and Y position data for the car, each record representing one second of time. All trips begin at the 0,0 position.

Initial Exploration

First we'll look at one route made by one driver.

route1 <- read.csv("data/original/drivers/1/1.csv")
dim(route1)

## [1] 863   2

head(route1)

##      x     y
## 1  0.0   0.0
## 2 18.6 -11.1
## 3 36.1 -21.9
## 4 53.7 -32.6
## 5 70.1 -42.8
## 6 86.5 -52.6

This route lasted 863 seconds and consists of x and y positions as expected. Now we'll roughly plot this single route:

plot(route1)

plot of chunk unnamed-chunk-2

Possible Features

This route travels out from the initial point, backtracks over the exact same route then continues to another destination. Some possible features we could engineer are:

Frequency of stops
Duration of stops
Number of repeated traveling points
- backtracking
Total time traveling
Total time stopped
Total meters traveled

Now let's pretty up this plot...

library(ggplot2)
ggplot(route1, aes(x=x, y=y)) + geom_line(colour=2)

plot of chunk unnamed-chunk-3

add more routes

files <- list.files(path="data/original/drivers/1")
p <- ggplot()
for (file in files) {
    p <- p + geom_path(data=read.csv(paste0("data/original/drivers/1/",file)), aes(x, y), color=unlist(strsplit(file,"[.]"))[1])
}
p

plot of chunk unnamed-chunk-4

The goal is to label each route with a probability between 0 (is a planted false trip) and 1 (driven by associated driver). Two ways to approach this problem are supervised and unsupervised learning techniques.

With supervised learning each driver's associated routes would be labeled as 1 and a random sampling of routes from other drivers would be labeled as 0. A model would be trained using all the drivers iteratively and then the whole dataset would act as the unlabeled data for the submission.

Unsupervised learning would use the entirety of the unlabeled dataset and cluster into driver groups.

Possibly a mixture of the two approaches could be used, with an unsupervised pass of each group giving a better probability rating than a blind 1 or 0 assumption before the supervised pass.

With any approach feature engineering will be the key to this problem. What are the variables that tell us how to identify an individual's driving habits?

Approach 1: Basic Supervision

Non-sophisticated labeling
Features:
- Stop Frequencies
- Average Stop Duration
- Total Time Stopped
- Total Time Traveling
- Total Trip Time
- Average Travel Speed
- Total Distance Traveled
- Max Distance From Start Point
- Trip Similarity To Other Trips By Driver