Introduction

For those of you that don’t know me personally, I managed to run collegiately for 6 years (thanks Covid?). Over that time I logged a lot of miles and a lot Garmin activities. I still run quite a bit, but my barn burning days are behind me. I’d like to build a dashboard to get some insights to my running trends during that time as a sort of “last hoorah”, but sadly, a lot of my running data is missing and messy. I think cleaning it up will make for a great project to test my skills! Follow along here as I clean up and fill-in my running data using various techniques such as pulling outside data sources and training some ML models to predict missing values.

Data Read-in and Initial Exploration

Below is a first look at my running data from college. Over the span of 6 years, I went for a run at least 1,975 times!

import pandas as pd

data_path = "Activities 20"
df = pd.read_csv(data_path + "17.csv")
for i in range (18,23):
     df = pd.concat([df, pd.read_csv(data_path + str(i) + ".csv")])
df = df.reset_index().drop(columns = 'index')
df
Activity TypeDateFavoriteTitleDistanceCaloriesTimeAvg HRMax HRAvg Run Cadence...Min TempSurface IntervalDecompressionBest Lap TimeNumber of LapsMax TempMoving TimeElapsed TimeMin ElevationMax Elevation
0Running2017-12-31 10:48:59FalseSevierville Running13.621,46201:29:3100175...0.00:00No00:00.00140.001:29:3001:32:189581,181
1Running2017-12-30 08:44:33FalseSevierville Running5.8363100:41:2300176...0.00:00No00:00.0060.000:41:2500:41:351,0011,174
2Running2017-12-29 11:21:54FalseSevierville Running8.2288100:50:4500176...0.00:00No00:00.00100.000:50:4500:51:279681,167
3Running2017-12-29 11:06:10FalseSevierville Running1.9720900:13:4200174...0.00:00No00:00.0020.000:13:4100:14:011,0281,178
4Running2017-12-28 06:24:02FalseMoss Point Running7.3779700:52:0400173...0.00:00No00:00.0080.000:52:0300:52:1664135
..................................................................
1970Running2022-01-05 08:38:41FalseStarkville Running11.501,01401:19:58147163178...57.20:00No01:19:57.54175.201:19:4801:22:26223402
1971Running2022-01-04 10:30:42FalseNoxubee County Running11.0185101:07:48145169183...51.80:00No02:01.191469.801:07:3901:23:25130201
1972Running2022-01-03 10:29:42FalseStarkville Running15.011,34901:43:07148171179...37.40:00No01:43:06.81177.001:43:0401:51:4089243
1973Running2022-01-02 09:07:55FalseLeon County Running8.0168800:56:25141156177...77.00:00No56:24.51186.000:56:2200:58:52-94167
1974Running2022-01-01 09:45:26FalseTallahassee Running7.0059300:51:11136159175...75.20:00No51:10.60184.200:51:0600:56:12-59205

1975 rows × 38 columns

Let’s have a look at the columns and datatypes of our dataset.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1975 entries, 0 to 1974
Data columns (total 38 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Activity Type             1975 non-null   object 
 1   Date                      1975 non-null   object 
 2   Favorite                  1975 non-null   bool   
 3   Title                     1975 non-null   object 
 4   Distance                  1975 non-null   float64
 5   Calories                  1975 non-null   object 
 6   Time                      1975 non-null   object 
 7   Avg HR                    1975 non-null   int64  
 8   Max HR                    1975 non-null   int64  
 9   Avg Run Cadence           1975 non-null   object 
 10  Max Run Cadence           1975 non-null   object 
 11  Avg Pace                  1975 non-null   object 
 12  Best Pace                 1975 non-null   object 
 13  Total Ascent              1975 non-null   object 
 14  Total Descent             1975 non-null   object 
 15  Avg Stride Length         1975 non-null   float64
 16  Avg Vertical Ratio        1975 non-null   float64
 17  Avg Vertical Oscillation  1975 non-null   float64
 18  Avg Ground Contact Time   1975 non-null   int64  
 19  Training Stress Score®    1975 non-null   float64
 20  Avg Power                 1975 non-null   int64  
 21  Max Power                 1975 non-null   int64  
 22  Grit                      1975 non-null   float64
 23  Flow                      1975 non-null   float64
 24  Avg. Swolf                1975 non-null   int64  
 25  Avg Stroke Rate           1975 non-null   int64  
 26  Total Reps                1975 non-null   int64  
 27  Dive Time                 1975 non-null   object 
 28  Min Temp                  1975 non-null   float64
 29  Surface Interval          1975 non-null   object 
 30  Decompression             1975 non-null   object 
 31  Best Lap Time             1975 non-null   object 
 32  Number of Laps            1975 non-null   object 
 33  Max Temp                  1975 non-null   float64
 34  Moving Time               1975 non-null   object 
 35  Elapsed Time              1975 non-null   object 
 36  Min Elevation             1975 non-null   object 
 37  Max Elevation             1975 non-null   object 
dtypes: bool(1), float64(9), int64(8), object(20)
memory usage: 573.0+ KB

As you may have guessed, after reading Garmin’s documentation, many of the data’s attributes are not useful to us as they are not metrics taken for runs such as Max Power (a cycling metric) and Avg Stroke Rate (a swimming metric). In the cleanup and feature engineering section, we’ll drop those and many others that aren’t helpful for understanding my running performances.

The first issue with this dataset is that the Avg HR and Max HR columns are populated with some zeros (see table above), and I assure you that my heart was beating faster than that! The Max Temp and Min Temp columns also contain some zeroes. This is because I didn’t have a fancy watch in the beginning of college that logged those metrics. Because of this we can assume that the 0’s populating those four columns are actually NULL / missing values. Many of our columns should contain numerical data but instead contain strings such as Min Elevation, so we are going to fix that in the next section too.

Data Cleaning and Feature Engineering

First thing we need to do is drop those extrenious columns and correct the datatypes for our remaining columns.

import numpy as np

#Drop extranious columns
cols_to_keep = ['Date', 'Title', 'Time', 'Avg Pace', 'Best Pace', 'Distance',
       'Calories', 'Avg HR', 'Max HR', 'Avg Run Cadence',
       'Max Run Cadence', 'Total Ascent',
       'Total Descent', 'Avg Stride Length', 'Min Temp', 'Max Temp', 'Min Elevation',
       'Max Elevation']

df = df[cols_to_keep]

#Replace missing values with NaN for easy pandas manipulation
df = df.replace('--', np.nan)  #String Garmin uses in place of NaN
df = df.replace(0.0, np.nan)
df = df.replace(0, np.nan)

#Remove commas so we can convert these columns to numerical data
cols_to_clean = ['Calories', 'Total Ascent', 'Total Descent', 'Min Elevation', 'Max Elevation']
df[cols_to_clean] = df[cols_to_clean].replace({',':''}, regex=True)

#Conversion of columns to floats for use in models
def float_convert(col):
    df[col] = df[col].astype(float)

float_convert(cols_to_keep[5:])

There are a few important columns that are written in a time format that is useful for humans but not machines. Let’s engineer some new features using them.

#Drop activities saved by accident
df = df[df['Avg Pace'].notna()]
df = df[df['Best Pace'].notna()]

#Convert values to float representing an equal amount of time in minutes
df['Total Run Time'] = [60 * float(x.split(':')[0]) + float(x.split(':')[1]) + (float(x.split(':')[2].split('.')[0])/60) for x in df['Time']]
df.drop(columns = 'Time', inplace = True)
df['Avg Pace'] = [float(x.split(':')[0]) + float(x.split(':')[1]) / 60 for x in df['Avg Pace']]
df['Best Pace'] = [float(x.split(':')[0]) + float(x.split(':')[1]) / 60 for x in df['Best Pace']]

#My college running days ended on the date below
df['Date'] = pd.to_datetime(pd.to_datetime(df['Date']).dt.strftime('%Y-%m-%d'))
df = df[df['Date'] < np.datetime64("2022-05-15")]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1883 entries, 0 to 1974
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   Date               1883 non-null   datetime64[ns]
 1   Title              1883 non-null   object        
 2   Avg Pace           1883 non-null   float64       
 3   Best Pace          1883 non-null   float64       
 4   Distance           1883 non-null   float64       
 5   Calories           1883 non-null   float64       
 6   Avg HR             607 non-null    float64       
 7   Max HR             607 non-null    float64       
 8   Avg Run Cadence    1883 non-null   float64       
 9   Max Run Cadence    1883 non-null   float64       
 10  Total Ascent       1855 non-null   float64       
 11  Total Descent      1861 non-null   float64       
 12  Avg Stride Length  1883 non-null   float64       
 13  Min Temp           607 non-null    float64       
 14  Max Temp           607 non-null    float64       
 15  Min Elevation      1875 non-null   float64       
 16  Max Elevation      1880 non-null   float64       
 17  Total Run Time     1883 non-null   float64       
dtypes: datetime64[ns](1), float64(16), object(1)
memory usage: 279.5+ KB

Over half of the values in the Avg HR, Max HR, Min Temp and Max Temp columns are NULL. Remember, I’m doing this, so I can get a better understanding of trends in my running data over the years I ran in college. I wan’t to create some visualizations with this data in the future, and all these values are important in getting a “big picture” look at my runnning trends. To deal with the missing data, we have four options:

  1. Drop the rows that are missing data.
  2. Fill NULL rows with some sort of common value (oftentimes the median of the column in question).
  3. Bring in an outside data source.
  4. Create a predictive model.

Option 1 is not going to work here as that would eliminate nearly two thirds of my data. Option 2 works OK for the columns that are missing only a few features, but it would definitely take away from the richness of the data and make for some boring / unhelpful visualizations if we used it for all the missing values in the dataset. But, option 3 can work great for filling in the tempurature data as it is easy to find weather data, and option 4 is the way to go for fixing the HR data.

#Using Option 2 to infill missing data with only a few NULLs
cols_with_few_nan = ['Total Ascent', 'Total Descent','Min Elevation', 'Max Elevation']
df[cols_with_few_nan] = df[cols_with_few_nan].fillna(df[cols_with_few_nan].median())

Bringing in Some Outside Help

Unfortunately, Garmin uses a somewhat cryptic system to log the location of runs. It usually titles each activity as either the county or city name plus “Running” with no other geolocation data to go along with it. To help us get started, let’s look at where my runs occured that are missing tempurature data.

from collections import Counter

run_locations = Counter(df[df['Min Temp'].isna()]['Title'])
sorted(run_locations.items(), key=lambda x:x[1], reverse = True)[:10]
[('Oktibbeha County Running', 349),
 ('Starkville Running', 245),
 ('Flowood Running', 127),
 ('Jackson County Running', 120),
 ('Moss Point Running', 64),
 ('Mobile County Running', 50),
 ('Boulder County Running', 43),
 ('Lucedale Running', 31),
 ('Oktibbeha County - Running', 21),
 ('Boulder Running', 21)]

Despite differing titles, the vast majority of these samples occur either in my old college town of Starkville, MS or very close to it, and nearly all the rest occur somewhere in Mississippi or in the South. Because we don’t have a way to convert these titles to a more specific location without getting really messy, I believe it will suffice to use weather data for Starkville, MS as a proxy for all the missing values we have.

df_weather = pd.read_csv("Weather Data.csv")
df_weather = df_weather[['NAME','DATE','TMAX','TMIN']].reset_index()
df_weather.drop(columns = 'index', inplace = True)

df_weather['Date'] = pd.to_datetime(df_weather['DATE'])
df_weather['Min Temp'] = df_weather['TMIN']
df_weather['Max Temp'] = df_weather['TMAX']

#Dataset contains weather reports from several locations surrounding Starkville, so we can group them together. 
df_weather = df_weather[['Date', 'Min Temp', 'Max Temp']].groupby(by = ['Date']).mean()

#Perform inner join, giving us a 1:1 ratio of dates to tempuratures
df = df.drop(columns = ['Min Temp', 'Max Temp']).merge(df_weather, on = 'Date', how = 'inner')

#Infill any remaining missing tempurature values with the median
cols_with_few_nan = ['Min Temp', 'Max Temp']
df[cols_with_few_nan] = df[cols_with_few_nan].fillna(df[cols_with_few_nan].median())

There you have it, our filled in tempurature data. Now, we need to build a model(s) that can effectively populate the missing values in our Max HR and Avg HR columns.

Fitting a model to fill-in our missing data

Let’s train and evaluate some regression models to fill in all that missing heart rate data. In the end we will have built two models, one to predict the Avg HR columns and another to predict Max HR.

#Select subset of data with no missing values for training
df_train = df.dropna()

#Training features
X_train = df_train[['Avg Pace', 'Best Pace', 'Distance', 'Calories',
       'Avg Run Cadence', 'Max Run Cadence',
       'Total Ascent', 'Total Descent', 'Avg Stride Length', 'Min Elevation',
       'Max Elevation', 'Total Run Time', 'Min Temp', 'Max Temp']]


y_avg = df_train['Avg HR']
y_max = df_train['Max HR']

Becuase of my running domain knowledge, I have an idea of what features will be usefull for predicting the Max HR and Avg HR columns of our data, but I’m a fan of letting scikit-learn decide what features are best for me. Let’s select the best 5 features.

from sklearn.feature_selection import SelectKBest, f_regression

#The best features to predict Avg HR are not necessarily the best to predict Max HR
kb_average = SelectKBest(f_regression, k=5).fit(X_train, y_avg)
kb_max = SelectKBest(f_regression, k=5).fit(X_train, y_max)

X_avg = kb_average.transform(X_train)
X_max = kb_max.transform(X_train)

Now we can use those extracted features to train several regression models and evaluate using cross validation to pick the best one for our two prediction tasks. The evaluation metrics we’ll use are Mean Absolute Error (MAE) and Mean Squared Error (MSE).

from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.svm import SVR
from sklearn.metrics import make_scorer, mean_squared_error
import statistics

def cv(model, X, y, model_name):
    score = cross_validate(model, X, y, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error'))
    print("\nModel: ", model_name)
    print("Test Mean Absolute Error: ", statistics.mean(score['test_neg_mean_absolute_error']), 
          "\nTest Mean Sqaured Error: ", statistics.mean(score['test_neg_mean_squared_error']))

lasso = Lasso(alpha = 0.1)
reg = LinearRegression()
regr = SVR(C=1.0, epsilon=0.2)
rfr = RandomForestRegressor(max_depth=10)

data_list = [(X_avg, y_avg, "HR Avg"), (X_max, y_max, "HR Max")]
model_dict = {"Lasso Regression":lasso, "Linear Regression":reg, "SVR":regr, "RF Regressor":rfr}

for data in data_list:
    print('\n############################\n',data[2])
    for model in model_dict.keys():
        cv(model_dict[model], data[0], data[1], model)
############################
 HR Avg

Model:  Lasso Regression
Test Mean Absolute Error:  -3.4721680220384203 
Test Mean Sqaured Error:  -21.528693269323934

Model:  Linear Regression
Test Mean Absolute Error:  -3.4965374931940674 
Test Mean Sqaured Error:  -21.794449648893643

Model:  SVR
Test Mean Absolute Error:  -6.503686070502235 
Test Mean Sqaured Error:  -64.17778768337854

Model:  RF Regressor
Test Mean Absolute Error:  -3.6666079473371216 
Test Mean Sqaured Error:  -22.542642038967205

############################
 HR Max

Model:  Lasso Regression
Test Mean Absolute Error:  -6.310069863144789 
Test Mean Sqaured Error:  -67.01035553775584

Model:  Linear Regression
Test Mean Absolute Error:  -6.332889188711009 
Test Mean Sqaured Error:  -67.28914455377931

Model:  SVR
Test Mean Absolute Error:  -7.571264852476413 
Test Mean Sqaured Error:  -91.18257365781352

Model:  RF Regressor
Test Mean Absolute Error:  -5.748229756457544 
Test Mean Sqaured Error:  -57.09265923926437

Lasso regression is the best model for predicting the Average Heart Rate of my runs while Random Forest Regressor is the best at predicting the Max Heart Rate.

Fitting final models

Let’s fit the best performing models using their entire respective training sets and predict on the samples that are missing HR data.

lasso_avg = Lasso(alpha = 0.1)
lasso_avg.fit(X_avg, y_avg)

rfr_max = RandomForestRegressor(max_depth=10)
rfr_max.fit(X_max, y_max)

#Select predictive features from entire dataset
X_full = df[['Avg Pace', 'Best Pace', 'Distance', 'Calories',
       'Avg Run Cadence', 'Max Run Cadence',
       'Total Ascent', 'Total Descent', 'Avg Stride Length', 'Min Elevation',
       'Max Elevation', 'Total Run Time', 'Min Temp', 'Max Temp']].to_numpy()

#Predict for all samples and infill rows that are missing values
df['Max HR'] = df['Max HR'].combine_first(pd.Series(rfr_max.predict(kb_max.transform(X_full)).tolist()))
df['Avg HR'] = df['Avg HR'].combine_first(pd.Series(lasso_avg.predict(kb_average.transform(X_full)).tolist()))

df.to_csv('Running_Data_Clean.csv')
df
DateTitleAvg PaceBest PaceDistanceCaloriesAvg HRMax HRAvg Run CadenceMax Run CadenceTotal AscentTotal DescentAvg Stride LengthMin ElevationMax ElevationTotal Run TimeMin TempMax Temp
02017-12-31Sevierville Running6.5666675.98333313.621462.0179.348833175.218101175.0187.0381.0425.01.40958.01181.089.51666722.046.0
12017-12-30Sevierville Running7.1000006.5333335.83631.0151.453714168.844806176.0191.0169.09.01.291001.01174.041.38333327.049.0
22017-12-29Sevierville Running6.1666675.3500008.22881.0163.870123176.137507176.0191.0285.0184.01.47968.01167.050.75000028.044.0
32017-12-29Sevierville Running6.9500006.3166671.97209.0139.974450164.446402174.0191.048.0181.01.341028.01178.013.70000028.044.0
42017-12-28Moss Point Running7.0666676.3833337.37797.0156.137539171.490540173.0185.0182.0195.01.3264.0135.052.06666725.042.0
.........................................................
18782022-01-05Starkville Running6.9500006.31666711.501014.0147.000000163.000000178.0201.0420.0404.01.30223.0402.079.96666733.055.0
18792022-01-04Noxubee County Running6.1666674.48333311.01851.0145.000000169.000000183.0232.0289.0246.01.42130.0201.067.80000030.040.5
18802022-01-03Starkville Running6.8666675.65000015.011349.0148.000000171.000000179.0190.0807.0774.01.3189.0243.0103.11666729.551.0
18812022-01-02Leon County Running7.0500005.7166678.01688.0141.000000156.000000177.0188.0810.0978.01.29-94.0167.056.41666730.065.0
18822022-01-01Tallahassee Running7.3166675.3000007.00593.0136.000000159.000000175.0186.0801.0863.01.26-59.0205.051.18333361.579.0

1883 rows × 18 columns

There you have it, a NULL-free, clean dataset. We are dashboard ready now. Check out my next post to see what I can make with this in Tablaeu.