Predicting Covid-19 Outcomes using Demographic Data

Final Report

Final Report Video Link

Introduction and Background

Health equity means “everyone has a fair and just opportunity to be as healthy as possible” despite socioeconomic barriers and discrimination (Braveman et al., 2018, p.3). However, research on the COVID-19 pandemic shows exacerbated health disparities in vulnerable communities (Andraska et al., 2021). Particularly, African American, Hispanic, and Asian American communities show “considerably higher rates of COVID-19 positivity and ICU admission compared to white individuals” (Magesh et al., 2021, “Conclusions”). Additionally, s described in the chapter “The State of Health Disparities in the United States” from the book Communities in Action: Pathways to Health Equity, ethnic minorities experience higher rates of chronic disease and premature death compared to whites in the United States. Our project surrounds the topic of health outcomes for COVID-19 based on personally identifiable characteristics such as a persons’ age, sex, race, income, etc. This is a very popular topic in the healthcare space in the United States because of the fact that the U.S. healthcare system spends more per patient than the rest of the world but has subpar health outcomes. There have been many studies describing this phenomenon across all levels of care and research continues to be conducted to further explore and understand the nature and causes of some of these issues. Thus, it is imperative that we employ the latest data from the latest outbreak in this country to understand the current state of health disparities as they exist. Below we discuss the dataset we use more specifically as well as the relevant features. The links for the data are at the bottom.

Problem Definition

Now that we all understand the datasets and the information it contains, we can introduce the problem at hand and the value in this kind of data for better understanding it. The United States has been well documented in being one of the most expensive healthcare systems in the world with below average health outcomes. In addition, there are large disparities in health outcomes based on sex, race, age, and wealth. By analyzing demographic and income data along with patient outcomes, we hope to analyze this data to both compare health outcomes across different groups, and predict health outcomes based on these different groups. If we are able to easily and very accurately determine the outcome of a patient based on their income alone, for example, this would indicate that there are clear disparities in how our health system treats patients with varying wealth levels. From this, we hope to determine which identifiers are most valuable to these models in predicting health outcomes, which would help narrow the scope for our nation to identify where these gaps are occurring and why.

Data Explanation and Cleaning

Our data was obtained from the United States Government website containing all COVID-19 related data available for public use. The dataset was a join between two separate datasets from that site. The first contained demographic data along with COVID-19 outcomes such as hospitalization, ICU admittance, and death. The initial data, once joined, contained around 100 million rows, but upon closer inspection it was clear that the vast majority of these rows contained a lot of empty values, which is not unexpected given that we had 16 different features and not all hospital systems will track all of that data. To clean the data we used pandas to identify gaps in the data, remove those, and make sure that the data types were consistent within each feature.

Data Preprocessing

With our data cleaned, we then needed to do quite a bit of tedious preprocessing. Most of our features were categorical, but stored in one column as text. An example was the feature for symptoms which contained two texts, “Symptomatic” or “Not Symptomatic”. Clearly, this can be one-hot encoded to 1 and 0 respectively into two columns, one for symptomatic and one for not symptomatic patients. We did this for all categorical columns, although there was one categorical variable for ZIP Code which we dropped for now as encoding that would require thousands of columns which would only confuse the model. Instead, we hope to find a model type that does not require encoding for this type of variable but does not assign weightings based on the actual numerical value (that is to say that zip code 22222 is worth more or less than zip code 11111). In addition, there were some columns with extra options for NaN values or other useless data that would only confuse the model, so we dropped those as well. From there, the data was fully cleaned, pruned, and transformed so that it could be interpreted by most model types and any other preprocessing was simply to split the data into training and testing sets for supervised models we implemented. Below is a table with all of the features we selected, and their data types.

Class Balancing

It is also important to mention separately that we had to consider some class balancing, as the the number of hospitalizations compared to ICU admittance and death were much higher, as one would expect. This is a natural difficulty with health data as in the vast majority of cases (thankfully) people overcome COVID-19 and make a full recovery, and thus the dataset will always be very unbalanced. Thus, we tried to balance it slightly without cutting out a significant portion of data as that could also negatively impact our results in understanding some demographic trends. In the end, we reduced the gap between the classes by 50%. Below is a graph that illustrates this disparity.

Counts of Hospitalizations, ICU Admittances, and Deaths

Table of Data Features and Data Types

Feature	Data Type
Case Month	Date Time
Res State	Text (Abbreviated State Name)
County FIPS Code	Integer
0-17 Years Old	Integer
18-49 Years Old	Integer
50-64 Years Old	Integer
65+ Years Old	Integer
No Age Given	Binary (1 or 0)
Female	Binary (1 or 0)
Male	Binary (1 or 0)
Gender Unknown	Binary (1 or 0)
American Indian/Alaska Native	Binary (1 or 0)
Asian	Binary (1 or 0)
Black	Binary (1 or 0)
Multiple/Other Race	Binary (1 or 0)
White	Binary (1 or 0)
case_positive_specimen_interval	Float
case_onset_interval	Float
Clinical Evaluation	Binary (1 or 0)
Contact tracing of case patient	Binary (1 or 0)
Laboratory reported	Binary (1 or 0)
Provider reported	Binary (1 or 0)
Routine surveillance	Binary (1 or 0)
Laboratory-confirmed case	Binary (1 or 0)
Probable Case	Binary (1 or 0)
Asymptomatic	Binary (1 or 0)
Symptomatic	Binary (1 or 0)
Unknown Symptoms	Binary (1 or 0)
Hospitalized	Binary (1 or 0)
ICU	Binary (1 or 0)
Death	Binary (1 or 0)
Underlying Condition Confirmed	Binary (1 or 0)
No underlying condition listed	Binary (1 or 0)
Median_Household_Income_2021	Integer

Explaining some of the feature types that may not be obvious, we have a section of features such as “Clinical Evaluation” and “Laboratory-confirmed case” that allow us to better understand how the diagnosis was made, as a laboratory confirmation is scientifically more reliable than a doctors analysis of symptoms alone. The “case_onset_interval” is the approximated time between infection and the diagnosis.The “case_positive_specimen_interval” is the approximated time between infection and when a positive identification of a COVID-19 specimen was obtained. The “County FIPS Code” is just another way of saying ZIP code.

Data Visualization

Within our dataset there are a lot of opportunities to explore health outcomes based on different demographic information such as age, sex, and race. As we continue with the project we will further explore these relationships visually, but for now we look at age versus health outcomes. Below we have three plots that show hospitalization, ICU admittance, and death for varying age groups. They are bucketed based on what are generally considered children, younger adults, older adults, and elderly. As we can see, the shape of the distribution becomes more and more skewed left as we increase the severity of the health outcome from hospitalization to death. We notice that no children in this dataset died from COVID-19, and that going from hospitalization to ICU admittance, the number of younger adults was almost a fifth in the latter compared to the former. We also notice that the vast majority of reported cases here are of white Americans as they make up around 60% of the general population but 80% of the population seen here. Additionally, we see that there is a large gap in the number of females that are reported here compared to males. For next steps we plan to perform similar visual explorations based on the other features we have to hopefully uncover more trends, and maybe some that are more surprising than the results we see here.

The Distribution of Hospitalization Counts by Age Group

The Distribution of ICU Admittance Counts by Age Group

The Distribution of Death Counts by Age Group

The Distribution of Gender Counts

A Pie Chart of the Percentages of each Race Present

The Distribution of Cases across Months

The Distribution of Median Household Income in 2021

PCA Dimension Reduction

It is important to note that after cleaning our data we were left with around 23000 rows. While this is still a good amount of data to make predictions with, given the categorical nature of our data and the straightforward nature of our exploration, this data is relatively simple compared to say images for image classification with neural networks. Therefore, for the initial part of our explorations with Random Forests and Logistic Regression, we used all the data as it was able to run quite quickly and achieve good accuracies. However, to be thorough, we will compare the reduced data and the accuracy/speed we are able to achieve for that compared to the entire data set for our final report. Nevertheless, we used PCA to understand the relative explained variance of our features, and found that we could retain 80% of explained variance with half of the features, and as much as 90% with 10 of 16 features. This should prove to provide interesting results once we pass this reduced data into the models. Below is a plot of the explained variance as we add features. Additionally, we list the top 5 principal components below that.

Cumulative Explained Variance as we Add Features

Top 5 Principal Components
Age
Gender
Race/Ethnicity
Underlying Condition
Median Household Income

An important thing to note from our PCA results is that they support our initial hypothesis that the demographic and income data will be most valuable in predicting COVID-19 health outcomes, as age, gender, race/ethnicity, underlying condition, and median household income were the top principal components over all of the other features available in our dataset. Thus, moving forward as we intended, we will split the data according to age, gender, race, and income to see which demographic/income factors predict the health outcomes the best. We also compare these results to giving the models all of the features together.

Methods for Prediction

Softmax Regression

The purpose of this project is to predict the COVID-19 outcomes based on the demographic/income data. Since our data is categorical and the outcomes, including Hospitalization, ICU admission, and Death are discrete classifications, we utilize Softmax Regression as one of the models for uncovering those relationships. We also use sklearn, numpy, seaborn, and matplotlib libraries for data training and visualizing. Our hypothesis is that socioeconomic factors play an impact on health outcomes. Feeding all of the demographic factors along with the health outcomes into our model, we obtained the following accuracies in each case:

Features	Accuracy	Precision	F1-score	Loss
Age	0.91	0.87	0.88	0.1336
Race	0.81	0.85	0.83	0.2709
Sex	0.82	0.90	0.97	0.3011
Median Household Income	0.90	0.91	0.85	0.1160

We can see from our results above that the Softmax Regression performed quite well, achieving right around 90% in the best cases. We note that in this particular case we have that the best predictors for COVID-19 health outcomes are Age and Median Household Income. Age comes as no surprise, as we saw in our data visualization section that there were significant disparities in hospitalization, ICU admittance, and death when comparing children to seniors.

Random Forest Classifier

As we have seen from previous studies done on demographic factors and health outcomes as well as in general, random forests are known to perform very well for classification tasks such as this, and allow for robustness in the ability to avoid overfitting and computational flexibility by tuning various hyperparameters. Thus, we first tuned our model based on accuracy as our data was not very complicated and therefore we were not concerned about computational speed. Below is an example of part of our grid search cross validation, where we see that as we increase the maximum depth of the tree, we achieve the best testing accuracy for a maximum depth of 5, but then the model begins to overfit as the training accuracy continues to increase whilst the testing decreases.

Grid Search Accuracy in Training v.s. Testing for Max Depth

Now that we hyperparameter tuned, we fed the model the same data splits as we did for Logistic Regression to obtain the following results:

Features	Accuracy	Precision	F1-score	Loss
Age	0.93	0.90	0.90	0.1096
Race	0.91	0.85	0.83	0.1313
Sex	0.88	0.89	0.83	0.3582
Median Household Income	0.90	0.91	0.85	0.2937

The Random Forest Classifier turned out to be our best classifier, unsurprisingly. By hyperparameter tuning we were able to achieve greater than 90% accuracy in some cases, with the trend continuing with Age and Median Household Income being the best predictors of the COVID-19 health outcomes we explored. Due to its architecture being quite “decision like”, we expected the Random Forest Classifier to perform best as it is quite flexible and robust in avoiding overfitting.

Neural Network Classifier

Neural Networks have powerful capabilities in learning non-linear behavior depending on the activation functions used. In our case we used a fully connected Neural Network with 2 dense layers that had non-linear ReLU activation layers in between. Finally, we use a softmax layer at the end to out our multiclass classification results. Below we have two plots to show that while we were able to get the neural net to train and learn, it did not fair as well in testing, as we can see from the results in the table.

Comparing Training and Validation Losses

Comparing Training and Validation Accuracies

Features	Accuracy	Precision	F1-score	Loss
Age	0.81	0.80	0.79	3.57
Race	0.77	0.77	0.76	4.21
Sex	0.78	0.76	0.78	4.62
Median Household Income	0.83	0.79	0.81	3.14

Generally, Neural Networks do quite well as they can understand linear and non-linear relationships with the proper activation functions. However, in this case we had relatively simple data (as compared to images for example), and thus we are not surprised that it often overfitted or simply did not learn as we would expect. Although the Neural Network was unable to outperform the other two methods, we still see the same pattern in that Age and Median Household Income were the best predictors for COVID-19 health outcomes.

Conclusion

In conclusion, there is still a lot to be done with COVID-19 data to get a full and satisfactory image of the disparities that may or may not exist in health outcomes. An important step forward is to incentivize hospitals and health organizations to accurately report health data. Even then we are not fully on a level playing field, as we already know that poorer people and people of color generally do not seek and/or get the same level of care as wealthier people. Nonetheless, even with our sparse data that required a substantial amount of cleaning, we were still able to corroborate the hypothesis that demographic and income features would be the best predictors for a person’s health outcome with COVID-19. More specifically, Age and Median Household Income were the best predictors for all methods employed, with other factors such as underlying health conditions and gender still very relevant.

(Note that certain information in the introduction/data explanation were merged from the midterm report, and thus that section is ommitted for cleanliness and redundancy reasons)

Project Proposal

Proposal Video Link

Proposal Video

Contribution Table

Team Member	Contributions
Emmanuel Lyngberg	Project Manager, Data Cleaning and Preprocessing, PCA Dimension Reduction Implementation
Hanmo Zhang	Random Forest Implementation, Logistic Regression Implementation, Data Visualization
Han Nguyen	Data Visualization, Logistic Regression Implementation, Contribution Management
Mohammed Abbas	Random Forest Implementation, Logistic Regression Implementation, Hyperparameter Tuning
Ronith Charungundla	PCA Dimension Reduction Implementation, Data Visualization, Logistic Regression Implementation

Timeline to Final Report

Date	Idea/Implementation/Analysis to be completed
11/17/2023	Complete comparisons of demographic groups and health outcomes
11/17/2023	Implement Logistic Regression and analyze results
11/17/2023	Finalize hyperparameter tuning for Random Forest Classifier and analyze results
12/1/2023	Finalize Neural Network Implementation, hyperparameter tuning, and analyze results
12/3/2023	Finalize final report discussions

Gantt Chart Link

https://gtvault-my.sharepoint.com/:x:/g/personal/elyngberg3_gatech_edu/ES88WMagU9FIsdjbzDlTRe4BeuISmBQs8HfB_XIha59-7g?e=zxcTTr

Datasets

References

National Academies of Sciences, Engineering, and Medicine; Health and Medicine Division; Board on Population Health and Public Health Practice; Committee on Community-Based Solutions to Promote Health Equity in the United States; Baciu A, Negussie Y, Geller A, et al., editors. Communities in Action: Pathways to Health Equity. Washington (DC): National Academies Press (US); 2017 Jan 11. 2, The State of Health Disparities in the United States. Available from: https://www.ncbi.nlm.nih.gov/books/NBK425844/

Braveman, P., Arkin, E., Orleans, T., Proctor, D., Acker, J., & Plough, A. (2018). What is Health Equity? Behavioral Science & Policy, 4(1), 1-14. https://doi.org/10.1177/237946151800400102

Andraska, E. A., Alabi, O., Dorsey, C., Erben, Y., Velazquez, G., Franco-Mesa, C., & Sachdev, U. (2021). Health care disparities during the COVID-19 pandemic. Seminars in Vascular Surgery, 34(3). https://doi.org/10.1053/j.semvascsurg.2021.08.002

Magesh, S., John, D., Li, W. T., Li, Y., Mattingly-app, A., Jain, S., Chang, E. Y., & Ongkeko, W. M. (2021). Disparities in COVID-19 Outcomes by Race, Ethnicity, and Socioeconomic Status. JAMA Network Open, 4(11), e2134147. https://doi.org/10.1001/jamanetworkopen.2021.34147