Using Machine Learning to predict COVID19 Case Numbers

Matthew · June 22, 2020, 7:11am

The body of data on COVID-19 grows daily, and with it, the scope of possible data-driven conclusions that could be drawn from it. In this article I describe how I used the SKLearn open source Machine Learning library and the COVID19 Dataset from the Royal Society DELVE Initiative to try and understand why some countries have been more affected than others and how the implementation of certain measures may have helped. The Royal Society DELVE initiative has put together a global dataset on COVID19. The dataset has been consolidated from multiple sources and includes 62 fields for each country that range from the weather, Non Pharmaceutical Interventions (NPI) to Case, Test and Mobility numbers.

This article continues on from COVID19 Dataset from the Royal Society DELVE Initiative where we explored the dataset and looked at how various parameters and interventions affected the number of cases.

Machine Learning is a subset of artificial intelligence that allows a computer to learn and make predictions without being explicitly programmed to do so. A machine learning model is trained by feeding it data that has a correct answer attached. This training data allows the model to connect patterns in the data to the correct answer. Once the model has been trained it can be given test data that has no answers attached. The model then predicts the answers based on the training it received.

I have used a Support Vector Machine Regression (SVMR) model to predict new cases per day from some of the fields in the dataset.

It is worth noting that I used cases_total in the training data and this was to tether the model to reality so that when changes are made to the data, for example to the face mask policy, that one change should not have a large effect on the predictions unless that change is significant. Put another way because there is a direct correlation between new cases per day and total cases including that field in the training data will effectively make it more difficult for any other field to change the predicted number of cases per day.

The fields used in the training data were.

DATE
npi_school_closing
npi_workplace_closing
npi_cancel_public_events
npi_gatherings_restrictions
npi_close_public_transport
npi_stay_at_home
npi_internal_movement_restrictions
npi_international_travel_controls
npi_income_support
npi_debt_relief
npi_fiscal_measures
npi_international_support
npi_public_information
npi_testing_policy
npi_contact_tracing
npi_healthcare_investment
npi_vaccine_investment
npi_stringency_index
npi_masks
cases_total
tests_total
tests_new
tests_total_per_thousand
tests_new_per_thousand
tests_new_smoothed
tests_new_smoothed_per_thousand
stats_population
stats_population_density
stats_median_age
stats_gdp_per_capita
mobility_retail_recreation
mobility_grocery_pharmacy
mobility_parks
mobility_transit_stations
mobility_workplaces
mobility_residential
mobility_travel_driving
mobility_travel_transit
mobility_travel_walking
stats_hospital_beds_per_1000
stats_smoking
stats_population_urban
stats_population_school_age
weather_precipitation_mean
weather_humidity_mean
weather_sw_radiation_mean
weather_temperature_mean
weather_temperature_min
weather_temperature_max
weather_wind_speed_mean

These fields were not used.

deaths_total
deaths_new
cases_total_per_million
cases_new_per_million
deaths_total_per_million
deaths_new_per_million
deaths_excess_daily_avg
deaths_excess_weekly
cases_days_since_first
deaths_days_since_first

New cases per day were predicted for the UK and Thailand. When making predictions for the UK all other countries except the UK were used to train the model and when making predictions for Thailand all other countries except Thailand were used to train the model. Nothing was changed in the data, this is a look at how well the model can predict new cases per day based on the actual data for the UK and Thailand.

The graphs below show the predictions from the SVMR based on the actual data for the UK and Thailand. For both countries the predictions are fairly close to what actually happened the predictions are a little low at the beginning of the outbreak but start to converge towards the end.
svr-uk-actual
svr-thai-actual

Now that we have a baseline I wanted to look at how face mask use and temperature can affect the predicted new cases per day. For these predictions I changed the UK face mask policy to mandatory everywhere and changed the humidity and temperatures to be comparable, but not the same, as Thailand. For Thailand I did the opposite and changed the face mask policy to none and reduced the temperatures and humidity to be similar but not the same as the UK.

The graphs below show the effect of changing the face mask policies. For the UK the predicted new cases per day has decreased. When compared to the predictions from the actual data above the numbers have dropped from around 4000 at the peak to around 3500.
For Thailand the predicted new cases per day has increased. When compared to the predictions from the actual data above the numbers have increased from around 125 to around 180.
svr-uk-fm
svr-thai-fm

The graphs below show the effect of changing the weather as detailed above. For the UK the predicted new cases per day has decreased. When compared to the predictions from the actual data the numbers have dropped from around 4000 at the peak to around 3250.
For Thailand the predicted new cases per day has increased. When compared to the predictions from the actual data above the numbers have increased from around 125 to around 225.
svr-uk-weather
svr-thai-weather

The graphs below show the effect of changing the weather and the face mask policy as detailed above. For the UK the predicted new cases per day has decreased. When compared to the predictions from the actual data the numbers have dropped from around 4000 at the peak to around 3000.
For Thailand the predicted new cases per day has increased. When compared to the predictions from the actual data above the numbers have increased from around 125 to around 300.
svr-uk-weather-fm
svr-thai-weather-fm

There may be other factors at play but I believe this data set shows that face mask use and temperature and humidity have an effect on the number of cases. As outlined above I included the total cases in the training data to help suppress any changes in the predictions as the face mask policy and weather was changed. But when the temperature and humidity was increased for the UK the numbers went down. When the face mask policy was changed to mandatory everywhere for the UK the numbers also went down. The opposite happened for Thailand where when the weather was changed to be similar to the UK, cold and dry, the numbers went up and when the face mask policy was changed to none the numbers went up.

Python 3.7 Support Vector Machine Regression

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, preprocessing
import pandas as pd
from matplotlib import style
import math
from sklearn.model_selection import TimeSeriesSplit,GridSearchCV,train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import make_scorer
import datetime as dt
from sklearn import neighbors
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
style.use("ggplot")

# Did not use these features,
#'deaths_total', 'deaths_new', 'cases_total_per_million', 'cases_new_per_million', 'deaths_total_per_million', 'deaths_new_per_million','deaths_excess_daily_avg', 'deaths_excess_weekly', 'cases_days_since_first', 'deaths_days_since_first', 

FEATURES =  [
             'DATE','npi_school_closing', 'npi_workplace_closing', 'npi_cancel_public_events',	'npi_gatherings_restrictions','npi_close_public_transport',
             'npi_stay_at_home', 'npi_internal_movement_restrictions', 'npi_international_travel_controls','npi_income_support','npi_debt_relief', 'npi_fiscal_measures',
             'npi_international_support', 'npi_public_information', 'npi_testing_policy', 'npi_contact_tracing', 'npi_healthcare_investment', 'npi_vaccine_investment',
             'npi_stringency_index', 'npi_masks', 'cases_total', 'tests_total', 'tests_new', 'tests_total_per_thousand', 'tests_new_per_thousand', 'tests_new_smoothed', 'tests_new_smoothed_per_thousand',
             'stats_population', 'stats_population_density','stats_median_age', 'stats_gdp_per_capita', 'mobility_retail_recreation', 'mobility_grocery_pharmacy',
             'mobility_parks', 'mobility_transit_stations', 'mobility_workplaces', 'mobility_residential', 'mobility_travel_driving', 'mobility_travel_transit', 'mobility_travel_walking',
             'stats_hospital_beds_per_1000','stats_smoking', 'stats_population_urban', 'stats_population_school_age', 'weather_precipitation_mean', 'weather_humidity_mean',
             'weather_sw_radiation_mean','weather_temperature_mean', 'weather_temperature_min','weather_temperature_max','weather_wind_speed_mean'
             ]

def Build_Data_Set():    
    data_df = pd.read_csv("combined_dataset_latest-ukmod.csv")
    dates = (data_df["DATE"].values.tolist())
    data_df['DATE'] = pd.to_datetime(data_df['DATE'])
    data_df['DATE']= (data_df['DATE'] - dt.datetime(1970,1,1)).dt.total_seconds()
    
    scaler = preprocessing.StandardScaler()
    data_df[['DATE', 'npi_school_closing', 'npi_workplace_closing', 'npi_cancel_public_events',	'npi_gatherings_restrictions','npi_close_public_transport',
             'npi_stay_at_home', 'npi_internal_movement_restrictions', 'npi_international_travel_controls','npi_income_support','npi_debt_relief', 'npi_fiscal_measures',
             'npi_international_support', 'npi_public_information', 'npi_testing_policy', 'npi_contact_tracing', 'npi_healthcare_investment', 'npi_vaccine_investment',
             'npi_stringency_index', 'npi_masks', 'cases_total', 'tests_total', 'tests_new', 'tests_total_per_thousand', 'tests_new_per_thousand', 'tests_new_smoothed', 'tests_new_smoothed_per_thousand',
             'stats_population', 'stats_population_density','stats_median_age', 'stats_gdp_per_capita', 'mobility_retail_recreation', 'mobility_grocery_pharmacy',
             'mobility_parks', 'mobility_transit_stations', 'mobility_workplaces', 'mobility_residential', 'mobility_travel_driving', 'mobility_travel_transit', 'mobility_travel_walking',
             'stats_hospital_beds_per_1000','stats_smoking', 'stats_population_urban', 'stats_population_school_age', 'weather_precipitation_mean', 'weather_humidity_mean',
             'weather_sw_radiation_mean','weather_temperature_mean', 'weather_temperature_min','weather_temperature_max',
             'weather_wind_speed_mean']] = scaler.fit_transform(data_df[['DATE', 'npi_school_closing', 'npi_workplace_closing', 'npi_cancel_public_events',	'npi_gatherings_restrictions',
             'npi_close_public_transport','npi_stay_at_home', 'npi_internal_movement_restrictions', 'npi_international_travel_controls','npi_income_support','npi_debt_relief', 'npi_fiscal_measures',
             'npi_international_support', 'npi_public_information', 'npi_testing_policy', 'npi_contact_tracing', 'npi_healthcare_investment', 'npi_vaccine_investment',
             'npi_stringency_index', 'npi_masks', 'cases_total', 'tests_total', 'tests_new', 'tests_total_per_thousand', 'tests_new_per_thousand', 'tests_new_smoothed', 'tests_new_smoothed_per_thousand',
             'stats_population', 'stats_population_density','stats_median_age', 'stats_gdp_per_capita', 'mobility_retail_recreation', 'mobility_grocery_pharmacy',
             'mobility_parks', 'mobility_transit_stations', 'mobility_workplaces', 'mobility_residential', 'mobility_travel_driving', 'mobility_travel_transit', 'mobility_travel_walking',
             'stats_hospital_beds_per_1000','stats_smoking', 'stats_population_urban', 'stats_population_school_age', 'weather_precipitation_mean', 'weather_humidity_mean', 'weather_sw_radiation_mean',
             'weather_temperature_mean', 'weather_temperature_min','weather_temperature_max','weather_wind_speed_mean']])
    

    #Shuffled data when tuning
    #data_df = data_df.reindex(np.random.permutation(data_df.index))
    #data_df.fillna(data_df.mean(), inplace=True)
    data_df.fillna(0, inplace=True)
    
    X = np.array(data_df[FEATURES].values)

    y = (data_df["cases_new"].values.tolist())

    #Split the data for training and test
##    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, shuffle=True, random_state=42)
##    X_train = preprocessing.scale(X_train)
##    X_test = preprocessing.scale(X_test)
  
    X = preprocessing.scale(X)
    
    return X,y,dates


def Analysis():

    test_size = 169
    X, y, dates = Build_Data_Set()

##    #Tuning          
##    parameters = [{'kernel': ['rbf'], 'epsilon': [0.01, 0.1, 3, 5], 'gamma': [0.00001,0.005,0.01,0.02],'C': [100,1000,5000]}]
##    scorer = make_scorer(mean_squared_error, greater_is_better=False)
##    print("Tuning hyper-parameters")
##    svr = GridSearchCV(svm.SVR(), parameters, cv = 3, scoring=scorer, n_jobs=-1)
##    svr.fit(X, y)
##
##    # Checking the score for all parameters
##    print("Grid scores on training set:")
##    means = svr.cv_results_['mean_test_score']
##    stds = svr.cv_results_['std_test_score']
##    for mean, std, params in zip(means, stds, svr.cv_results_['params']):
##        print("%0.3f (+/-%0.03f) for %r"% (mean, std * 2, params))


    regr = svm.SVR(kernel='rbf', C=1000, gamma=0.01, epsilon=0.1)
    regr.fit(X[:-test_size],y[:-test_size])

    #KNN_ = neighbors.KNeighborsRegressor(algorithm='kd_tree', leaf_size=120, p=10, n_neighbors=500, weights='distance').fit(X[:-test_size],y[:-test_size])
    #dtree = DecisionTreeRegressor()
    #dtree.fit(X[:-test_size],y[:-test_size])
    #model_rf = RandomForestRegressor(n_estimators=1000)
    #model_rf.fit(X[:-test_size],y[:-test_size]) 
       
    correct_count = 0
    predictions_list = []
    actuals_list = []
    dates_list = []
    pred_total=0
     
    for x in range(1, test_size+1):
        predictions = regr.predict([X[-x]])
        #predictions = KNN_.predict([X[-x]])
        #predictions = dtree.predict([X[-x]])
        #predictions= model_rf.predict([X[-x]])
        #error = predictions-y[-x]
        #print(error)
        
        print("The Prediction ", predictions)
        print("The Actual ", y[-x])
        predictions_list.append(math.floor(predictions[0]))
        pred_total += math.floor(predictions[0])
        actuals_list.append(y[-x])
        dates_list.append(dates[-x])
        
        allowed_error = y[-x] * 0.2
        
        if abs(predictions) <= 5 and y[-x] <= 5:
            correct_count += 1
        elif abs(predictions - y[-x]) <= allowed_error:
            correct_count += 1
    predictions_list.reverse()
    actuals_list.reverse()
    dates_list.reverse()
    print("Accuracy:", (correct_count/test_size) * 100.00)
    print("Total Predictions", pred_total)
    print("R2 score : %.2f" % r2_score(actuals_list,predictions_list))
    print("Mean squared error: %.2f" % mean_squared_error(actuals_list,predictions_list))

    er = []
    g = 0
    for i in range(len(actuals_list)):
        print( "actual=", actuals_list[i], " observed=", predictions_list[i])
        x = (actuals_list[i] - predictions_list[i]) **2
        er.append(x)
        g = g + x
        
    x = 0
    for i in range(len(er)):
       x = x + er[i]

    print ("MSE", x / len(er))

    v = np.var(er)
    print ("variance", v)

    print ("average of errors ", np.mean(er))

    m = np.mean(actuals_list)
    print ("average of observed values", m)

    y = 0
    for i in range(len(actuals_list)):
        y = y + ((actuals_list[i] - m) ** 2)

    print ("total sum of squares ", y)
    print ("ẗotal sum of residuals ", g)
    print ("r2 calculated", 1 - (g / y))
  
    plt.xticks(rotation=90)
    predg = plt.scatter(dates_list, predictions_list, c='red', alpha=0.5)
    actg = plt.scatter(dates_list, actuals_list, c='blue', alpha=0.5)
    plt.legend((predg, actg),
           ('Predicted', 'Actual'),
           scatterpoints=1,
           loc='upper left',
           ncol=1,
           fontsize=12)
    plt.title("UK Predicted per Day)
    plt.xlabel("Dates")
    plt.ylabel("New Cases per Day")
    plt.show()

    
Analysis()![svr-thai-actual|690x377](upload://e9Kggp245C9dmDMJXJkv2RlFKIx.png)