Introduction
For those of you that don’t know me personally, I managed to run collegiately for 6 years (thanks Covid?). Over that time I logged a lot of miles and a lot of Garmin activities. I still run quite a bit, but my barn burning days are behind me. I’d like to build a dashboard to get some insights to my running trends during that time as a sort of “last hoorah”, but sadly, a lot of my running data is missing and messy. I think cleaning it up will make for a great project to test my skills! Follow along here as I clean up and fill-in my running data using various techniques such as pulling outside data sources and training some ML models to predict missing values.
Data Read-in and Initial Exploration
Below is a first look at my running data from college. Over the span of 6 years, I went for a run at least 1,975 times!
import pandas as pd
data_path = "Activities 20"
df = pd.read_csv(data_path + "17.csv")
for i in range (18,23):
df = pd.concat([df, pd.read_csv(data_path + str(i) + ".csv")])
df = df.reset_index().drop(columns = 'index')
df
Activity Type | Date | Favorite | Title | Distance | Calories | Time | Avg HR | Max HR | Avg Run Cadence | ... | Min Temp | Surface Interval | Decompression | Best Lap Time | Number of Laps | Max Temp | Moving Time | Elapsed Time | Min Elevation | Max Elevation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Running | 2017-12-31 10:48:59 | False | Sevierville Running | 13.62 | 1,462 | 01:29:31 | 0 | 0 | 175 | ... | 0.0 | 0:00 | No | 00:00.00 | 14 | 0.0 | 01:29:30 | 01:32:18 | 958 | 1,181 |
1 | Running | 2017-12-30 08:44:33 | False | Sevierville Running | 5.83 | 631 | 00:41:23 | 0 | 0 | 176 | ... | 0.0 | 0:00 | No | 00:00.00 | 6 | 0.0 | 00:41:25 | 00:41:35 | 1,001 | 1,174 |
2 | Running | 2017-12-29 11:21:54 | False | Sevierville Running | 8.22 | 881 | 00:50:45 | 0 | 0 | 176 | ... | 0.0 | 0:00 | No | 00:00.00 | 10 | 0.0 | 00:50:45 | 00:51:27 | 968 | 1,167 |
3 | Running | 2017-12-29 11:06:10 | False | Sevierville Running | 1.97 | 209 | 00:13:42 | 0 | 0 | 174 | ... | 0.0 | 0:00 | No | 00:00.00 | 2 | 0.0 | 00:13:41 | 00:14:01 | 1,028 | 1,178 |
4 | Running | 2017-12-28 06:24:02 | False | Moss Point Running | 7.37 | 797 | 00:52:04 | 0 | 0 | 173 | ... | 0.0 | 0:00 | No | 00:00.00 | 8 | 0.0 | 00:52:03 | 00:52:16 | 64 | 135 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1970 | Running | 2022-01-05 08:38:41 | False | Starkville Running | 11.50 | 1,014 | 01:19:58 | 147 | 163 | 178 | ... | 57.2 | 0:00 | No | 01:19:57.54 | 1 | 75.2 | 01:19:48 | 01:22:26 | 223 | 402 |
1971 | Running | 2022-01-04 10:30:42 | False | Noxubee County Running | 11.01 | 851 | 01:07:48 | 145 | 169 | 183 | ... | 51.8 | 0:00 | No | 02:01.19 | 14 | 69.8 | 01:07:39 | 01:23:25 | 130 | 201 |
1972 | Running | 2022-01-03 10:29:42 | False | Starkville Running | 15.01 | 1,349 | 01:43:07 | 148 | 171 | 179 | ... | 37.4 | 0:00 | No | 01:43:06.81 | 1 | 77.0 | 01:43:04 | 01:51:40 | 89 | 243 |
1973 | Running | 2022-01-02 09:07:55 | False | Leon County Running | 8.01 | 688 | 00:56:25 | 141 | 156 | 177 | ... | 77.0 | 0:00 | No | 56:24.51 | 1 | 86.0 | 00:56:22 | 00:58:52 | -94 | 167 |
1974 | Running | 2022-01-01 09:45:26 | False | Tallahassee Running | 7.00 | 593 | 00:51:11 | 136 | 159 | 175 | ... | 75.2 | 0:00 | No | 51:10.60 | 1 | 84.2 | 00:51:06 | 00:56:12 | -59 | 205 |
1975 rows × 38 columns
Let’s have a look at the columns and datatypes of our dataset.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1975 entries, 0 to 1974
Data columns (total 38 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Activity Type 1975 non-null object
1 Date 1975 non-null object
2 Favorite 1975 non-null bool
3 Title 1975 non-null object
4 Distance 1975 non-null float64
5 Calories 1975 non-null object
6 Time 1975 non-null object
7 Avg HR 1975 non-null int64
8 Max HR 1975 non-null int64
9 Avg Run Cadence 1975 non-null object
10 Max Run Cadence 1975 non-null object
11 Avg Pace 1975 non-null object
12 Best Pace 1975 non-null object
13 Total Ascent 1975 non-null object
14 Total Descent 1975 non-null object
15 Avg Stride Length 1975 non-null float64
16 Avg Vertical Ratio 1975 non-null float64
17 Avg Vertical Oscillation 1975 non-null float64
18 Avg Ground Contact Time 1975 non-null int64
19 Training Stress Score® 1975 non-null float64
20 Avg Power 1975 non-null int64
21 Max Power 1975 non-null int64
22 Grit 1975 non-null float64
23 Flow 1975 non-null float64
24 Avg. Swolf 1975 non-null int64
25 Avg Stroke Rate 1975 non-null int64
26 Total Reps 1975 non-null int64
27 Dive Time 1975 non-null object
28 Min Temp 1975 non-null float64
29 Surface Interval 1975 non-null object
30 Decompression 1975 non-null object
31 Best Lap Time 1975 non-null object
32 Number of Laps 1975 non-null object
33 Max Temp 1975 non-null float64
34 Moving Time 1975 non-null object
35 Elapsed Time 1975 non-null object
36 Min Elevation 1975 non-null object
37 Max Elevation 1975 non-null object
dtypes: bool(1), float64(9), int64(8), object(20)
memory usage: 573.0+ KB
As you may have guessed, after reading Garmin’s documentation, many of the data’s attributes are not useful to us as they are not metrics taken for runs such as Max Power (a cycling metric) and Avg Stroke Rate (a swimming metric). In the cleanup and feature engineering section, we’ll drop those and many others that aren’t helpful for understanding my running performances.
The first issue with this dataset is that the Avg HR and Max HR columns are populated with some zeros (see table above), and I assure you that my heart was beating faster than that! The Max Temp and Min Temp columns also contain some zeroes. This is because I didn’t have a fancy watch in the beginning of college that logged those metrics. Because of this we can assume that the 0’s populating those four columns are actually NULL / missing values. Many of our columns should contain numerical data but instead contain strings such as Min Elevation, so we are going to fix that in the next section too.
Data Cleaning and Feature Engineering
First thing we need to do is drop those extraneous columns and correct the data types for our remaining columns.
import numpy as np
#Drop extraneous columns
cols_to_keep = ['Date', 'Title', 'Time', 'Avg Pace', 'Best Pace', 'Distance',
'Calories', 'Avg HR', 'Max HR', 'Avg Run Cadence',
'Max Run Cadence', 'Total Ascent',
'Total Descent', 'Avg Stride Length', 'Min Temp', 'Max Temp', 'Min Elevation',
'Max Elevation']
df = df[cols_to_keep]
#Replace missing values with NaN for easy pandas manipulation
df = df.replace('--', np.nan) #String Garmin uses in place of NaN
df = df.replace(0.0, np.nan)
df = df.replace(0, np.nan)
#Remove commas so we can convert these columns to numerical data
cols_to_clean = ['Calories', 'Total Ascent', 'Total Descent', 'Min Elevation', 'Max Elevation']
df[cols_to_clean] = df[cols_to_clean].replace({',':''}, regex=True)
#Conversion of columns to floats for use in models
def float_convert(col):
df[col] = df[col].astype(float)
float_convert(cols_to_keep[5:])
There are a few important columns that are written in a time format that is useful for humans but not machines. Let’s engineer some new features using them.
#Drop activities saved by accident
df = df[df['Avg Pace'].notna()]
df = df[df['Best Pace'].notna()]
#Convert values to float representing an equal amount of time in minutes
df['Total Run Time'] = [60 * float(x.split(':')[0]) + float(x.split(':')[1]) + (float(x.split(':')[2].split('.')[0])/60) for x in df['Time']]
df.drop(columns = 'Time', inplace = True)
df['Avg Pace'] = [float(x.split(':')[0]) + float(x.split(':')[1]) / 60 for x in df['Avg Pace']]
df['Best Pace'] = [float(x.split(':')[0]) + float(x.split(':')[1]) / 60 for x in df['Best Pace']]
#My college running days ended on the date below
df['Date'] = pd.to_datetime(pd.to_datetime(df['Date']).dt.strftime('%Y-%m-%d'))
df = df[df['Date'] < np.datetime64("2022-05-15")]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1883 entries, 0 to 1974
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 1883 non-null datetime64[ns]
1 Title 1883 non-null object
2 Avg Pace 1883 non-null float64
3 Best Pace 1883 non-null float64
4 Distance 1883 non-null float64
5 Calories 1883 non-null float64
6 Avg HR 607 non-null float64
7 Max HR 607 non-null float64
8 Avg Run Cadence 1883 non-null float64
9 Max Run Cadence 1883 non-null float64
10 Total Ascent 1855 non-null float64
11 Total Descent 1861 non-null float64
12 Avg Stride Length 1883 non-null float64
13 Min Temp 607 non-null float64
14 Max Temp 607 non-null float64
15 Min Elevation 1875 non-null float64
16 Max Elevation 1880 non-null float64
17 Total Run Time 1883 non-null float64
dtypes: datetime64[ns](1), float64(16), object(1)
memory usage: 279.5+ KB
Over half of the values in the Avg HR, Max HR, Min Temp and Max Temp columns are NULL. Remember, I’m doing this, so I can get a better understanding of trends in my running data over the years I ran in college. I want to create some visualizations with this data in the future, and all these values are important in getting a “big picture” look at my running trends. To deal with the missing data, we have four options:
- Drop the rows that are missing data.
- Fill NULL rows with some sort of common value (oftentimes the median of the column in question).
- Bring in an outside data source.
- Create a predictive model.
Option 1 is not going to work here as that would eliminate nearly two thirds of my data. Option 2 works OK for the columns that are missing only a few features, but it would definitely take away from the richness of the data and make for some boring / unhelpful visualizations if we used it for all the missing values in the dataset. But, option 3 can work great for filling in the temperature data as it is easy to find weather data, and option 4 is the way to go for fixing the HR data.
#Using Option 2 to infill missing data with only a few NULLs
cols_with_few_nan = ['Total Ascent', 'Total Descent','Min Elevation', 'Max Elevation']
df[cols_with_few_nan] = df[cols_with_few_nan].fillna(df[cols_with_few_nan].median())
Bringing in Some Outside Help
Unfortunately, Garmin uses a somewhat cryptic system to log the location of runs. It usually titles each activity as either the county or city name plus “Running” with no other geolocation data to go along with it. To help us get started, let’s look at where my runs occurred that are missing temperature data.
from collections import Counter
run_locations = Counter(df[df['Min Temp'].isna()]['Title'])
sorted(run_locations.items(), key=lambda x:x[1], reverse = True)[:10]
[('Oktibbeha County Running', 349),
('Starkville Running', 245),
('Flowood Running', 127),
('Jackson County Running', 120),
('Moss Point Running', 64),
('Mobile County Running', 50),
('Boulder County Running', 43),
('Lucedale Running', 31),
('Oktibbeha County - Running', 21),
('Boulder Running', 21)]
Despite differing titles, the vast majority of these samples occur either in my old college town of Starkville, MS or very close to it, and nearly all the rest occur somewhere in Mississippi or in the South. Because we don’t have a way to convert these titles to a more specific location without getting really messy, I believe it will suffice to use weather data for Starkville, MS as a proxy for all the missing values we have.
df_weather = pd.read_csv("Weather Data.csv")
df_weather = df_weather[['NAME','DATE','TMAX','TMIN']].reset_index()
df_weather.drop(columns = 'index', inplace = True)
df_weather['Date'] = pd.to_datetime(df_weather['DATE'])
df_weather['Min Temp'] = df_weather['TMIN']
df_weather['Max Temp'] = df_weather['TMAX']
#Dataset contains weather reports from several locations surrounding Starkville, so we can group them together.
df_weather = df_weather[['Date', 'Min Temp', 'Max Temp']].groupby(by = ['Date']).mean()
#Perform inner join, giving us a 1:1 ratio of dates to temperatures
df = df.drop(columns = ['Min Temp', 'Max Temp']).merge(df_weather, on = 'Date', how = 'inner')
#Infill any remaining missing temperature values with the median
cols_with_few_nan = ['Min Temp', 'Max Temp']
df[cols_with_few_nan] = df[cols_with_few_nan].fillna(df[cols_with_few_nan].median())
There you have it, our filled in temperature data. Now, we need to build a model(s) that can effectively populate the missing values in our Max HR and Avg HR columns.
Fitting a model to fill-in our missing data
Let’s train and evaluate some regression models to fill in all that missing heart rate data. In the end we will have built two models, one to predict the Avg HR columns and another to predict Max HR.
#Select subset of data with no missing values for training
df_train = df.dropna()
#Training features
X_train = df_train[['Avg Pace', 'Best Pace', 'Distance', 'Calories',
'Avg Run Cadence', 'Max Run Cadence',
'Total Ascent', 'Total Descent', 'Avg Stride Length', 'Min Elevation',
'Max Elevation', 'Total Run Time', 'Min Temp', 'Max Temp']]
y_avg = df_train['Avg HR']
y_max = df_train['Max HR']
Because of my running domain knowledge, I have an idea of what features will be useful for predicting the Max HR and Avg HR columns of our data, but I’m a fan of letting scikit-learn decide what features are best for me. Let’s select the best 5 features.
from sklearn.feature_selection import SelectKBest, f_regression
#The best features to predict Avg HR are not necessarily the best to predict Max HR
kb_average = SelectKBest(f_regression, k=5).fit(X_train, y_avg)
kb_max = SelectKBest(f_regression, k=5).fit(X_train, y_max)
X_avg = kb_average.transform(X_train)
X_max = kb_max.transform(X_train)
Now we can use those extracted features to train several regression models and evaluate using cross validation to pick the best one for our two prediction tasks. The evaluation metrics we’ll use are Mean Absolute Error (MAE) and Mean Squared Error (MSE).
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.svm import SVR
from sklearn.metrics import make_scorer, mean_squared_error
import statistics
def cv(model, X, y, model_name):
score = cross_validate(model, X, y, cv=5, scoring=('neg_mean_absolute_error', 'neg_mean_squared_error'))
print("\nModel: ", model_name)
print("Test Mean Absolute Error: ", statistics.mean(score['test_neg_mean_absolute_error']),
"\nTest Mean Squared Error: ", statistics.mean(score['test_neg_mean_squared_error']))
lasso = Lasso(alpha = 0.1)
reg = LinearRegression()
regr = SVR(C=1.0, epsilon=0.2)
rfr = RandomForestRegressor(max_depth=10)
data_list = [(X_avg, y_avg, "HR Avg"), (X_max, y_max, "HR Max")]
model_dict = {"Lasso Regression":lasso, "Linear Regression":reg, "SVR":regr, "RF Regressor":rfr}
for data in data_list:
print('\n############################\n',data[2])
for model in model_dict.keys():
cv(model_dict[model], data[0], data[1], model)
############################
HR Avg
Model: Lasso Regression
Test Mean Absolute Error: -3.4721680220384203
Test Mean Squared Error: -21.528693269323934
Model: Linear Regression
Test Mean Absolute Error: -3.4965374931940674
Test Mean Squared Error: -21.794449648893643
Model: SVR
Test Mean Absolute Error: -6.503686070502235
Test Mean Squared Error: -64.17778768337854
Model: RF Regressor
Test Mean Absolute Error: -3.6666079473371216
Test Mean Squared Error: -22.542642038967205
############################
HR Max
Model: Lasso Regression
Test Mean Absolute Error: -6.310069863144789
Test Mean Squared Error: -67.01035553775584
Model: Linear Regression
Test Mean Absolute Error: -6.332889188711009
Test Mean Squared Error: -67.28914455377931
Model: SVR
Test Mean Absolute Error: -7.571264852476413
Test Mean Squared Error: -91.18257365781352
Model: RF Regressor
Test Mean Absolute Error: -5.748229756457544
Test Mean Squared Error: -57.09265923926437
Lasso regression is the best model for predicting the Average Heart Rate of my runs while Random Forest Regressor is the best at predicting the Max Heart Rate.
Fitting final models
Let’s fit the best performing models using their entire respective training sets and predict on the samples that are missing HR data.
lasso_avg = Lasso(alpha = 0.1)
lasso_avg.fit(X_avg, y_avg)
rfr_max = RandomForestRegressor(max_depth=10)
rfr_max.fit(X_max, y_max)
#Select predictive features from entire dataset
X_full = df[['Avg Pace', 'Best Pace', 'Distance', 'Calories',
'Avg Run Cadence', 'Max Run Cadence',
'Total Ascent', 'Total Descent', 'Avg Stride Length', 'Min Elevation',
'Max Elevation', 'Total Run Time', 'Min Temp', 'Max Temp']].to_numpy()
#Predict for all samples and infill rows that are missing values
df['Max HR'] = df['Max HR'].combine_first(pd.Series(rfr_max.predict(kb_max.transform(X_full)).tolist()))
df['Avg HR'] = df['Avg HR'].combine_first(pd.Series(lasso_avg.predict(kb_average.transform(X_full)).tolist()))
df.to_csv('Running_Data_Clean.csv')
df
Date | Title | Avg Pace | Best Pace | Distance | Calories | Avg HR | Max HR | Avg Run Cadence | Max Run Cadence | Total Ascent | Total Descent | Avg Stride Length | Min Elevation | Max Elevation | Total Run Time | Min Temp | Max Temp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017-12-31 | Sevierville Running | 6.566667 | 5.983333 | 13.62 | 1462.0 | 179.348833 | 175.218101 | 175.0 | 187.0 | 381.0 | 425.0 | 1.40 | 958.0 | 1181.0 | 89.516667 | 22.0 | 46.0 |
1 | 2017-12-30 | Sevierville Running | 7.100000 | 6.533333 | 5.83 | 631.0 | 151.453714 | 168.844806 | 176.0 | 191.0 | 169.0 | 9.0 | 1.29 | 1001.0 | 1174.0 | 41.383333 | 27.0 | 49.0 |
2 | 2017-12-29 | Sevierville Running | 6.166667 | 5.350000 | 8.22 | 881.0 | 163.870123 | 176.137507 | 176.0 | 191.0 | 285.0 | 184.0 | 1.47 | 968.0 | 1167.0 | 50.750000 | 28.0 | 44.0 |
3 | 2017-12-29 | Sevierville Running | 6.950000 | 6.316667 | 1.97 | 209.0 | 139.974450 | 164.446402 | 174.0 | 191.0 | 48.0 | 181.0 | 1.34 | 1028.0 | 1178.0 | 13.700000 | 28.0 | 44.0 |
4 | 2017-12-28 | Moss Point Running | 7.066667 | 6.383333 | 7.37 | 797.0 | 156.137539 | 171.490540 | 173.0 | 185.0 | 182.0 | 195.0 | 1.32 | 64.0 | 135.0 | 52.066667 | 25.0 | 42.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1878 | 2022-01-05 | Starkville Running | 6.950000 | 6.316667 | 11.50 | 1014.0 | 147.000000 | 163.000000 | 178.0 | 201.0 | 420.0 | 404.0 | 1.30 | 223.0 | 402.0 | 79.966667 | 33.0 | 55.0 |
1879 | 2022-01-04 | Noxubee County Running | 6.166667 | 4.483333 | 11.01 | 851.0 | 145.000000 | 169.000000 | 183.0 | 232.0 | 289.0 | 246.0 | 1.42 | 130.0 | 201.0 | 67.800000 | 30.0 | 40.5 |
1880 | 2022-01-03 | Starkville Running | 6.866667 | 5.650000 | 15.01 | 1349.0 | 148.000000 | 171.000000 | 179.0 | 190.0 | 807.0 | 774.0 | 1.31 | 89.0 | 243.0 | 103.116667 | 29.5 | 51.0 |
1881 | 2022-01-02 | Leon County Running | 7.050000 | 5.716667 | 8.01 | 688.0 | 141.000000 | 156.000000 | 177.0 | 188.0 | 810.0 | 978.0 | 1.29 | -94.0 | 167.0 | 56.416667 | 30.0 | 65.0 |
1882 | 2022-01-01 | Tallahassee Running | 7.316667 | 5.300000 | 7.00 | 593.0 | 136.000000 | 159.000000 | 175.0 | 186.0 | 801.0 | 863.0 | 1.26 | -59.0 | 205.0 | 51.183333 | 61.5 | 79.0 |
1883 rows × 18 columns
There you have it, a NULL-free, clean dataset. We are dashboard ready now. Check out my next post to see what I can make with this in Tableau.