The COVID-19 pandemic has produced vast quantities of publicly available data that holds a prominent global interest. While the integrity and accuracy of the data has been scrutinized and questioned from nation to nation, the data still provides a great jumping point to explore some fundamental data science tools.
In this post, we will explore how nations compared over their first 40 days of COVID-19 exposure for both infections and mortality. The data set for national infections and deaths is large enough that a brute force curve by curve comparison would be a challenge. One strategy is to group nations that followed similar infection and mortality curves together and compare the groups rather than the specific nations. One of the common algorithms for executing this type of analysis is k-means clustering.
K-means clustering is a type of unsupervised machine learning that analyzes a data set and assigns k
cluster centers such that 1) the cluster center is the arithmetic mean of all points in the cluster and 2) all points are closer to their own cluster center than any other cluster center. K-means then yields k
sets of data that can be used to represent the entire data set under study. Further to this, new data can be analyzed and assigned to these previously determined clusters and anything learned from the original clusters can be assumed to apply to the new data.
The graphs on this page are produced with plotly and are interactive. For example, clicking lines in the legend can turn data sets on and off; double-clicking turns the others off. Hovering over points on the lines will show the country's name and totals. The scale can be toggled between linear and logarithmic.
This plot represents how countries are assigned to 4 k-means clusters based on their first 40-days of reporting more than 50 COVID-19 infections. Interestingly, of the 121 countries in the data set, 105 of them were assigned to cluster 0. The remaining 16 countries were spread amongst the top 3 clusters with the US claiming cluster 3 all to itself. In addition to ending with a much higher number of infections at the end of 40 days, the US followed a unique trajectory through most of the data set.
In the code below, modelData
is a dataframe containing infection (or mortality) data as y
. x
is the offset day from the date the country recorded more than 50 infections (or 10 deaths). The testData
dataframe is simply a listing of countries from modelData
that have more than 40 days of recorded data. This is important because the KMeans function does not like NaN
values and to backfill them with data would influence the outcome of the clustering algorithm.
print(modelData.head(5).to_markdown())
| | x | y | index |
|---:|----:|----:|:------------|
| 0 | 0 | 11 | Afghanistan |
| 1 | 1 | 14 | Afghanistan |
| 2 | 2 | 14 | Afghanistan |
| 3 | 3 | 15 | Afghanistan |
| 4 | 4 | 15 | Afghanistan |
The following code is the basics for executing the k-means clustering algorithm and plotting the results.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import plotly as plotly
import plotly.express as px
# set our number of clusters
cluster_num = 4
# build a dataframe based on our model and test set and drop data > 40 days
clusterData = modelData.loc[modelData['index'].isin(testSet.index.values)]
clusterData = clusterData[(clusterData['x'] < 40)]
# set the index to the country name and the x value which will give us a column for y data
# each country and x value will have 40 y values in a single column
# unstack will put the y values across the columns where each column is the x value
# reset the index and drop the 'index' column which is the country name
clusterData = clusterData.set_index(['index','x']).unstack().reset_index()
growthCluster = clusterData.drop(columns="index")
# execute the k-means algorithm and fit the data
kmeans = KMeans(n_clusters=cluster_num)
kmeans.fit(growthCluster)
# predict the cluster for each row of the data set
y_kmeans = kmeans.predict(growthCluster)
# add the predicted cluster back into the dataframe that carries the country name as the index
clusterData['Cluster'] = y_kmeans
# set the index to identify Country and Cluster and then 'stack' the results
# the index is cleared and the result is 'tidy' data
plotData = clusterData.set_index(['index','Cluster']).stack().reset_index()
# Plot the data using plotly express
fig = px.line(plotData, x="x", y="y", color="Cluster", line_group="index", hover_name="index",
line_shape="spline", render_mode="svg")
If we were interested in doing a deeper study on the hardest hit nations, we could then drop the countries that fell into cluster 0, representing the flattest overall curve, and focus our attention on the countries in the other three.
Nation | Count @ Day 40 | Day Zero | Cluster |
---|---|---|---|
US | 275367 | 2020-02-24 | 3 |
Turkey | 110130 | 2020-03-18 | 2 |
Spain | 153222 | 2020-03-01 | 2 |
Germany | 113296 | 2020-02-29 | 2 |
Italy | 110574 | 2020-02-22 | 2 |
China | 79932 | 2020-01-22 | 2 |
Brazil | 40743 | 2020-03-12 | 1 |
Russia | 57999 | 2020-03-14 | 1 |
United Kingdom | 79874 | 2020-03-03 | 1 |
Canada | 28209 | 2020-03-07 | 1 |
Iran | 53183 | 2020-02-24 | 1 |
Portugal | 20206 | 2020-03-11 | 1 |
Belgium | 30589 | 2020-03-05 | 1 |
France | 79163 | 2020-02-28 | 1 |
Netherlands | 26710 | 2020-03-05 | 1 |
Switzerland | 25107 | 2020-03-03 | 1 |
Note: Going up to 5 clusters only moved 15 countries out of cluster 0 and into the upper clusters, leaving 90 in the lowest. With 8 clusters created, cluster 0 still had 79 countries.
Potentially more interesting than knowing which cluster each country falls into, the lines representing each clusters geometric center's can also be plotted by pulling the lists from the k-means data ( _y = kmeans.cluster_centers_[i]
) where i
is 0 -3. These lines represent the 4 models that k-means clustering could be used to predict against for countries outside of the data set. These could also be used for similar future infections (COVID-26?) to help nations determine which model they tracking closest to.
Similarly, the same algorithm can be run against mortality data.
There were only 67 countries with more than 40 days of reporting deaths. Of these 67, 15 were assigned to clusters 1 - 3 and 52 were assigned to cluster 0. With mortality, the US ended with a higher number of deaths at the end of 40 days but because the curve was similar to both France and Italy, they were all assigned to the same cluster. Again, we can drop the countries in cluster 0 and focus attention on those in clusters 1 - 3.
Nation | Count @ Day 40 | Day Zero | Cluster |
---|---|---|---|
United Kingdom | 20264 | 2020-03-13 | 3 |
Spain | 18708 | 2020-03-07 | 3 |
US | 26086 | 2020-03-04 | 2 |
Italy | 15362 | 2020-02-25 | 2 |
France | 17169 | 2020-03-07 | 2 |
Mexico | 2507 | 2020-03-27 | 1 |
Brazil | 5083 | 2020-03-20 | 1 |
Canada | 2983 | 2020-03-20 | 1 |
Sweden | 2194 | 2020-03-18 | 1 |
Belgium | 6917 | 2020-03-17 | 1 |
Germany | 5575 | 2020-03-15 | 1 |
Turkey | 3174 | 2020-03-22 | 1 |
Iran | 3294 | 2020-02-24 | 1 |
Netherlands | 3929 | 2020-03-13 | 1 |
China | 2872 | 2020-01-22 | 1 |
Similarly, the mortality curve cluster centers can be extracted and plotted.
With both sets of data, we can simply sum the two cluster numbers together to get a quick metric of how the country fared across both infections and mortality. All things being equal, you would expect close alignment between the infection and mortality clusters. Discrepancies here would be another opportunity to pull together more data and try to determine possible causes. For the sake of trimming our data set again, we will continue to ignore countries that were assigned to cluster 0 for both infections and mortality.
Nation | Infection Cluster | Infections @ Day 40 | Mortality Cluster | Deaths @ Day 40 | Cluster Sum |
---|---|---|---|---|---|
Spain | 2 | 153222 | 3 | 18708 | 5 |
US | 3 | 275367 | 2 | 26086 | 5 |
Italy | 2 | 110574 | 2 | 15362 | 4 |
United Kingdom | 1 | 79874 | 3 | 20264 | 4 |
Turkey | 2 | 110130 | 1 | 3174 | 3 |
France | 1 | 79163 | 2 | 17169 | 3 |
Germany | 2 | 113296 | 1 | 5575 | 3 |
China | 2 | 79932 | 1 | 2872 | 3 |
Belgium | 1 | 30589 | 1 | 6917 | 2 |
Iran | 1 | 53183 | 1 | 3294 | 2 |
Canada | 1 | 28209 | 1 | 2983 | 2 |
Netherlands | 1 | 26710 | 1 | 3929 | 2 |
Brazil | 1 | 40743 | 1 | 5083 | 2 |
Portugal | 1 | 20206 | 0 | 973 | 1 |
Mexico | 0 | 11633 | 1 | 2507 | 1 |
Sweden | 0 | 10948 | 1 | 2194 | 1 |
Russia | 1 | 57999 | 0 | 1827 | 1 |
Switzerland | 1 | 25107 | 0 | 1478 | 1 |
If our interest was in studying the countries with more favourable COVID-19 trajectories over the first 40 days, we could simply focus our attention on the countries in cluster 0 and drop clusters 1 - 3. We could then continue with more k-means clustering on this sub-set to look for patterns and valid models in these countries as well. The following table shows our original cluster 0 sub-clustered into 4.
Cluster Zero Nation | Count @ Day 40 | Day Zero | Sub Cluster |
---|---|---|---|
Peru | 21648 | 2020-03-16 | 3 |
Chile | 11296 | 2020-03-14 | 3 |
Israel | 12758 | 2020-03-08 | 3 |
Ireland | 16040 | 2020-03-13 | 3 |
Ecuador | 22719 | 2020-03-17 | 3 |
Korea, South | 9661 | 2020-02-20 | 3 |
Bangladesh | 13770 | 2020-03-31 | 3 |
Austria | 14226 | 2020-03-06 | 3 |
India | 15722 | 2020-03-10 | 3 |
Norway | 6525 | 2020-03-04 | 2 |
Poland | 9856 | 2020-03-13 | 2 |
Ukraine | 10406 | 2020-03-22 | 2 |
Serbia | 7483 | 2020-03-16 | 2 |
Dominican Republic | 6416 | 2020-03-20 | 2 |
Denmark | 7268 | 2020-03-09 | 2 |
Australia | 6315 | 2020-03-04 | 2 |
Indonesia | 7135 | 2020-03-13 | 2 |
Sweden | 10948 | 2020-03-05 | 2 |
Belarus | 10463 | 2020-03-18 | 2 |
Romania | 9242 | 2020-03-13 | 2 |
Saudi Arabia | 11631 | 2020-03-13 | 2 |
Philippines | 6459 | 2020-03-12 | 2 |
Pakistan | 11155 | 2020-03-15 | 2 |
Mexico | 11633 | 2020-03-15 | 2 |
Czechia | 6746 | 2020-03-11 | 2 |
United Arab Emirates | 6302 | 2020-03-10 | 1 |
Argentina | 3607 | 2020-03-16 | 1 |
Uzbekistan | 2118 | 2020-03-24 | 1 |
Kazakhstan | 3138 | 2020-03-21 | 1 |
Morocco | 4120 | 2020-03-19 | 1 |
Hungary | 2443 | 2020-03-17 | 1 |
South Africa | 3953 | 2020-03-15 | 1 |
Finland | 3783 | 2020-03-11 | 1 |
Moldova | 3638 | 2020-03-20 | 1 |
Algeria | 3127 | 2020-03-16 | 1 |
Qatar | 5448 | 2020-03-11 | 1 |
Luxembourg | 3654 | 2020-03-14 | 1 |
Colombia | 4881 | 2020-03-16 | 1 |
Croatia | 2009 | 2020-03-16 | 1 |
Greece | 2207 | 2020-03-08 | 1 |
Malaysia | 4683 | 2020-03-04 | 1 |
Egypt | 2844 | 2020-03-09 | 1 |
Thailand | 2643 | 2020-03-07 | 1 |
Panama | 5338 | 2020-03-16 | 1 |
Estonia | 1552 | 2020-03-13 | 0 |
Lithuania | 1375 | 2020-03-21 | 0 |
Japan | 1387 | 2020-02-16 | 0 |
Rwanda | 261 | 2020-03-26 | 0 |
Slovakia | 1325 | 2020-03-15 | 0 |
Kuwait | 993 | 2020-03-02 | 0 |
Cuba | 1649 | 2020-03-25 | 0 |
Taiwan* | 425 | 2020-03-13 | 0 |
Liechtenstein | 82 | 2020-03-23 | 0 |
Mauritius | 332 | 2020-03-26 | 0 |
Diamond Princess | 706 | 2020-02-07 | 0 |
Trinidad and Tobago | 116 | 2020-03-22 | 0 |
Montenegro | 322 | 2020-03-25 | 0 |
Iceland | 1727 | 2020-03-07 | 0 |
Tunisia | 975 | 2020-03-20 | 0 |
Brunei | 138 | 2020-03-15 | 0 |
West Bank and Gaza | 344 | 2020-03-22 | 0 |
New Zealand | 1476 | 2020-03-21 | 0 |
Monaco | 96 | 2020-03-31 | 0 |
Slovenia | 1330 | 2020-03-11 | 0 |
Andorra | 743 | 2020-03-19 | 0 |
Cyprus | 822 | 2020-03-19 | 0 |
Burkina Faso | 641 | 2020-03-21 | 0 |
Sri Lanka | 523 | 2020-03-18 | 0 |
Malta | 450 | 2020-03-19 | 0 |
Iraq | 1415 | 2020-03-07 | 0 |
Cote d'Ivoire | 1362 | 2020-03-24 | 0 |
Armenia | 1596 | 2020-03-16 | 0 |
Cameroon | 1832 | 2020-03-23 | 0 |
Honduras | 1178 | 2020-03-26 | 0 |
Azerbaijan | 1766 | 2020-03-21 | 0 |
Guinea | 2146 | 2020-04-02 | 0 |
Nigeria | 2558 | 2020-03-25 | 0 |
Paraguay | 431 | 2020-03-27 | 0 |
Oman | 2274 | 2020-03-21 | 0 |
Ghana | 2169 | 2020-03-24 | 0 |
Bahrain | 1136 | 2020-03-04 | 0 |
Congo (Kinshasa) | 682 | 2020-03-26 | 0 |
Senegal | 933 | 2020-03-22 | 0 |
Afghanistan | 2469 | 2020-03-24 | 0 |
Kyrgyzstan | 843 | 2020-03-27 | 0 |
Singapore | 455 | 2020-02-12 | 0 |
Latvia | 812 | 2020-03-18 | 0 |
Madagascar | 193 | 2020-03-31 | 0 |
Albania | 678 | 2020-03-16 | 0 |
Bosnia and Herzegovina | 1565 | 2020-03-19 | 0 |
Uruguay | 596 | 2020-03-17 | 0 |
San Marino | 455 | 2020-03-10 | 0 |
Vietnam | 268 | 2020-03-14 | 0 |
Kosovo | 855 | 2020-03-26 | 0 |
Georgia | 539 | 2020-03-22 | 0 |
Costa Rica | 695 | 2020-03-18 | 0 |
North Macedonia | 1421 | 2020-03-20 | 0 |
Bolivia | 1802 | 2020-03-27 | 0 |
Niger | 821 | 2020-04-01 | 0 |
Lebanon | 673 | 2020-03-11 | 0 |
Bulgaria | 1097 | 2020-03-15 | 0 |
Venezuela | 331 | 2020-03-21 | 0 |
Jordan | 447 | 2020-03-18 | 0 |
Kenya | 621 | 2020-03-30 | 0 |
Cambodia | 122 | 2020-03-20 | 0 |
Covid-19 Data Source: https://github.com/CSSEGISandData/COVID-19
This is the data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).
Thank you to them for making this data available.