by: Ilyas Ustun
Welcome to my analysis of COVID-19 where I use visualization to see the situation and speed of how the disease is spreading in some countries.
The data for this analysis was obtained from https://www.kaggle.com/c/covid19-global-forecasting-week-3/data which is curated by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE).
# Make plots wider
options(repr.plot.width=10, repr.plot.height=5, repr.plot.res=250)
## Importing packages
# This R environment comes with all of CRAN and many other helpful packages preinstalled.
# You can see which packages are installed by checking out the kaggle/rstats docker image:
# https://github.com/kaggle/docker-rstats
library(tidyverse) # metapackage with lots of helpful functions
## Running code
# In a notebook, you can run a single code cell by clicking in the cell and then hitting
# the blue arrow to the left, or by clicking in the cell and pressing Shift+Enter. In a script,
# you can run code by highlighting the code you want to run and then clicking the blue arrow
# at the bottom of this window.
## Reading in files
# You can access files from datasets you've added to this kernel in the "../input/" directory.
# You can see the files added to this kernel by running the code below.
# list.files(path = "../input")
## Saving data
# If you save any files or images, these will be put in the "output" directory. You
# can see the output directory by committing and running your kernel (using the
# Commit & Run button) and then checking out the compiled version of your kernel.
# list.files(path = "covid19-global-forecasting-week-3")
df = read_csv("covid19-global-forecasting-week-3/train.csv");
# df_tst = read_csv("../input/covid19-global-forecasting-week-2/test.csv")
# df_sub = read_csv("../input/covid19-global-forecasting-week-2/submission.csv")
# df %>% str()
Sample data:
df %>% head()
df %>% summary()
# df$Country_Region %>% unique()
Let's get the daily total cases and fatalities for each country.
df_daily = df %>%
group_by(Country_Region, Date) %>%
summarize(ConfirmedCases = sum(ConfirmedCases) ,Fatalities = sum(Fatalities)) %>%
ungroup()
library(scales)
library(RColorBrewer)
df_daily_nz = df_daily %>%
filter(ConfirmedCases > 0) %>%
# mutate(ConfirmedCases = ifelse(ConfirmedCases==0, 1,ConfirmedCases)) %>%
group_by(Country_Region) %>%
arrange(Date, `.by_group` = TRUE)
df_daily_nz = df_daily_nz %>%
group_by(Country_Region) %>%
mutate(NumberDays = rank(Date)) %>%
ungroup()
df_daily_nz %>% filter(Country_Region=='US')
Above we can see the data belonging only to the US.The ConfirmedCases and Fatalities columns show the cumulative number of observations.
The case of US is weird as the first data point of observation starts with 892 cases, as if data from previous days was lost. Or, it just so happened that suddenly so many cases were reported in one day - no clue!
countries = c('US', 'Turkey', 'Germany', 'Italy', 'Spain', 'Netherlands', 'Belgium', 'Canada')
df_daily_nz = df_daily_nz %>%
group_by(NumberDays) %>%
arrange(desc(ConfirmedCases)) %>%
mutate(CountryRank = rank(desc(ConfirmedCases)))
# df_daily_nz.head()
library(RColorBrewer)
darkcols <- brewer.pal(n=8, "Dark2")
palette = c('#d6d6c2', darkcols)
# palette
# print(palette)
The countries selected for this analysis are listed below.
countries
# df_daily_nz %>%
# mutate(Highlight=if_else(Country_Region %in% countries, Country_Region, "1_other")) %>%
# # filter(Country_Region %in% countries) %>%
# ggplot(aes(x=NumberDays, y=ConfirmedCases, color=Highlight, group=Country_Region)) +
# geom_line(size=1, alpha=0.8) +
# scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
# labels = trans_format("log10", math_format(10^.x))) +
# scale_x_continuous(name='Number of Days', breaks = seq(0, 70, 4)) +
# scale_color_manual(values = palette) +
# theme(axis.text.x = element_text(angle=90))
The following plot shows how quickly these countries reached the top 10 in the world in terms of cumulative number of people infected. Netherlands and Turkey seem to reach top 10 quite quickly.
df_daily_nz %>%
# mutate(Highlight=if_else(Country_Region %in% countries, Country_Region, "zOther")) %>%
filter(Country_Region %in% countries) %>%
ggplot(aes(x=Date, y=CountryRank, color=Country_Region)) +
geom_line(size=1.5, alpha=0.8) +
# scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
# labels = trans_format("log10", math_format(10^.x))) +
scale_y_reverse() +
scale_x_date(name='', date_breaks = '4 days') +
scale_color_brewer(palette = "Dark2") +
theme(axis.text.x = element_text(angle=90))
The x axis shows the date
The y axis shows the cumulative number of cases in 10-fold increments (log10 formatted)
df_daily_nz %>%
filter(Country_Region %in% countries) %>%
ggplot(aes(x=Date, y=ConfirmedCases, color=Country_Region)) +
geom_line(size=1.5, alpha=0.8) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_x_date(name='', date_breaks = '4 days') +
scale_color_brewer(palette = "Dark2") +
theme(axis.text.x = element_text(angle=90))
The above plot shows how the cumulative number of cases are increasing by each country. We see that the increase in US is quite worrisome. There is another country where the epidemic is increasing rapidly and that is Turkey.
To understand this better, let's align the origins of each country to the day when the first case observed. The x axis will be the number of days since the first case is observed.
df_daily_nz %>%
filter(Country_Region %in% countries) %>%
ggplot(aes(x=NumberDays, y=ConfirmedCases, color=Country_Region)) +
geom_line(size=1.5, alpha=0.8) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_x_continuous(name='Number of Days', breaks = seq(0, 70, 4)) +
scale_color_brewer(palette = "Dark2") +
theme(axis.text.x = element_text(angle=0))
The lines that have a sharper increase happening earlier (more to the left) mean that the spread of the virus is happening quicker. US, Turkey and Netherlands show a rapid increase compared to the other countries. Turkey, especially in the first week, shows a much larger slope, meaning even a quicker increase in the cumulative number of people infected. This is very serious. This will lead to hospitals getting overwhelmed quickly, which is the main problem experienced in many countries.
The number of cases in Turkey has risen from 1000 to 10,000 in the last 8 days. That means a 10-fold increase in just a week!
Let's now check the plots for fatalities.
df_daily_nz = df_daily %>%
filter(Fatalities > 0) %>%
# mutate(ConfirmedCases = ifelse(ConfirmedCases==0, 1,ConfirmedCases)) %>%
group_by(Country_Region) %>%
arrange(Date, `.by_group` = TRUE)
df_daily_nz = df_daily_nz %>%
group_by(Country_Region) %>%
mutate(NumberDays = rank(Date)) %>%
ungroup()
The x axis shows the date
The y axis shows the cumulative number of fatalities in 10-fold increments (log10 formatted)
df_daily_nz %>%
filter(Country_Region %in% countries) %>%
ggplot(aes(x=Date, y=Fatalities, color=Country_Region)) +
geom_line(size=1.5, alpha=0.8) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_x_date(name = '', date_breaks = '4 days') +
scale_color_brewer(palette = "Dark2") +
theme(axis.text.x = element_text(angle=90))
Unfortunately the dire situation in Italy and Spain is obviuos from the graph. Another country that is alarming is Turkey. Although Turkey does not have as many fatalities, the sharp increase in the first 2 weeks calls for attention.
df_daily_nz %>%
filter(Country_Region %in% countries) %>%
ggplot(aes(x=NumberDays, y=Fatalities, color=Country_Region)) +
geom_line(size=1.5, alpha=0.8) +
scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_x_continuous(name='Number of Days', breaks = seq(0, 70, 4)) +
scale_color_brewer(palette = "Dark2") +
theme(axis.text.x = element_text(angle=0))
So, the last plot again confirms the severe situation in these countries. US, Spain and Turkey are the left-most countries, which means a more rapid spread. In this plot the sharper the increase happens earlier (to the left) the worse the situation is. Turkey has especially a large slope at the begin, which seems to be more aligned with other countries after the first week.
Unfortunately Italy and Spain already passed 10,000 fatalities, followed closely by the US. Belgium, Netherlands, and Germany have more than 1000 fatalities, closely followed by Turkey. These numbers are worrisome. Canada seems to be doing fairly better compared to these countries.
These figures show how rapidly COVID-19 is spreading. This is especially dangerous for people who live or work together in large numbers within confined spaces. Places like prisons, care houses, hospitals, universities, schools, factories, malls, or any other closed quarters are a few to name. In environments similar to these it takes only one person to be infected to spread the disease to a large number of people. That's why many states have shelter-at-home orders in place. This has led to the closure of many facilities. The economies are taking a big hit because of this, but it is a necessary evil to combat this unseen enemy. Social distancing is one of the best weapons we have for now, until a better solution is found.
This brings me to another point, which is places where people are forced to live together and can't leave on their own will, namely prisons and jails. I hope that the governments worldwide are taking the facts of this pandemic into consideration and are acting accordingly with putting the human-life first. The overcrowded conditions of prisons in some countries put many human lives at risk. The journalists, professors, researchers, teachers, students, mothers and their babies, and many thousands of innocent people who are imprisoned should be released, effective immediately. Tomorrow might be too late. This is not a time of vengeance, it is time of mercy and compassion.
Take care all! Hope we will win this virus together.
Ilyas Ustun
April 6, 2020
Chicago, IL