- Alexander Morgan, Van Ha Lee, Sydney Nguyen
The COVID19 pandemic has changed our world in ways we could not have predicted -- what started out as only a few notable cases overseas has erupted into a pandemic unlike anything our generation has ever experienced. On January 9, 2020, the World Health Organization announced the unfamiliar Coronavirus-related Pneumonia case in Wuhan, China, and only a few weeks later on January 21, the first case of Coronavirus was discovered in the United States. When President Trump declared a National emergency on March 7th, the United States went on lockdown subsequently thrusting American citizens into a quarantine state. Our sense of normalcy has drastically changed, and during these unprecedented times people are looking for the best ways to cope.
Boredom during quarantine was the driving factor for trends like baking banana bread, watching Tiger King, and developing an unhealthy obsession with TikTok. Virtual get-togethers became the new normal; from Zoom happy hours to Zoom birthday parties -- people were looking for any excuse to drink with their friends without leaving their homes. Even though restaurants and bars were closed during quarantine, Americans seemed to be drinking more than ever.
We will attempt to analyze the correlation between covid cases rising and the drinking habits of Americans during this time. We also want to understand whether people in lower income areas or higher income areas are drinking/purchasing more alcohol during the covid pandemic.
!pip install folium
import warnings
import folium
import requests
import os
from folium import plugins
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import sys
import json
import pylab as pl
from google.colab import files
Alcohol Sales: First, we will collect and tidy our data -- starting with our alcohol sales data. We are only able to find data for alcohol sales up until August 30, 2020 because findings are still being recorded and collected. The dataset found only has sufficient data for 18 states, so we can only draw general assumptions based on the limited data we have. We will thus use these states as a jumping-off point for other variables we wish to test.
alc_sales = pd.read_csv('alcsales_August2020.csv')
alc_sales.head()
Per the dataset definitions, FIPS is the Federal Information Processing Standards geographic code for each state. Beverage is also a code where 1=spirits, 2=wine, and 3=beer. Gallons is the gallons of beverage sold, and Ethanol is the gallons of pure alcohol sold. PerCapita represents the gallons of ethanol sold per capita age 14 and older in 2020, and PerCapita3Yr represents the three-year average gallons of ethanol sold per capita age 14 and older, 2017–2019. We will replace the FIPS numbers with the names of their corresponding states and the Beverage numbers with the names of their corresponding beverage. We will get rid of columns we don't need like PctChange.
fips = pd.read_csv('FIPScodes.csv') #FIPS codes contains the codes for the states being recorded
fips["FIPS code"] = pd.to_numeric(fips["FIPS code"], downcast="float") #converts fips code to float to match alc sales dataframe format
fips= fips.rename(columns = {'FIPS code':'FIPS'})
alc_sales=alc_sales.merge(fips,on="FIPS") #merge FIPS and alc sales dataframe so we can see the states that are recorded
#Here, we will replace the Beverage codes.
#in the dataset, Beverage represents the code for what type of alcoholic drink was purchased. 1=spirits, 2=wine, 3=beer
alc_sales["Beverage"].replace({1.0: "spirits", 2.0: "wine", 3.0:"beer"}, inplace=True)
alc_sales["Month"].replace({1.0: "January", 2.0: "February", 3.0:"March", 4.0:"April", 5.0:"May", 6.0:"June",7.0:"July",8.0:"August", 9.0:"September",
10:"October", 11:"November", 12:"December"}, inplace=True)
#Drop columns we don't need
alc_sales.drop('PctChange', axis=1, inplace=True)
alc_sales.drop('Unnamed: 2', axis=1, inplace=True)
alc_sales.drop('FIPS', axis=1, inplace=True)
alc_sales
Covid-19 has quickly brought destruction to our entire world and way of life. While some places, such as Australia and New Zealand, have been finding their way back into "normal" times, the United States has reached an astonishing 17.9 million cases. Here, we analyze a dataset (which updates frequently) to put these numbers into perspective. Which states have it worst? Where are these numbers coming from? We specifically hone in on the 19 states which bear telling alchol sales data here, to provide room for clear comparison between the affect of covid (a natural, uncontrollable occurance) on existing drinking habits. We chose to also showcase July 2020's covid case rate, which represents "peak" quarantine.
# read in our file containing recent data on covid deaths/cases per state
covid_data = pd.read_csv('United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv')
# clean up our dataframe by only keeping relevant information
covid_data = covid_data[['submission_date','state', 'tot_cases', 'tot_death']]
# modify our submission dates to contain only the corresponding month per state, to be used in
# calculating TOTAL cases & deaths per month, per state
covid_data['submission_date'] = covid_data['submission_date'].astype(str).str[:2]
# aggregate cases per month, per state
covid_data = covid_data.groupby(['submission_date','state'],as_index=False).agg({'tot_cases': 'sum', 'tot_death': 'sum'})
# updating our covid chart to link this data to states with known alcohol consumption histories, as shown
# in our alc_gallons2020 dataframe
state_sel = ['AK', 'AR', 'CO', 'CT', 'DE', 'FL', 'IL', 'KS', 'KY', 'LA', 'MA', 'MO', 'ND', 'OR', 'TN', 'TX', 'UT', 'VA', 'WI']
covid_sel = covid_data.loc[covid_data['state'].isin(state_sel)]
covid_sel
# update our covid frame to only contain values for each state, for the month of July.
# We use July as we are able to derive a better segment of information from the alcohol-related
# set of data, and because July is essentially at 'peak' quarantine
covid_map_data = covid_sel.loc[covid_sel['submission_date'].isin(['07'])]
covid_map_data = covid_map_data.shift()[1:]
# we want an accurate picture of covid cases based on population, so we prepare population to
# calculate overall rates
covid_map_data['population'] = ['683879','3039000', '5842076', '3700000',
'995764', '21477737', '12630000', '2774044'
'4500000', '4600000', '6700000', '6150000',
'762062','4280000', '6890000', '28996000', '3280000', '8535519', '5860000']
# cast population column to floats in order to use in later calculation
covid_map_data['population'] = pd.to_numeric(covid_map_data['population'], downcast="float")
covid_map_data = covid_map_data.drop(['submission_date', 'tot_death'], axis=1)
# calculate covid case rates per state
covid_map_data['case_rate'] = covid_map_data['tot_cases'] / covid_map_data['population']
# only want to keep columns to be shown in the chloropleth map
covid_map_data = covid_map_data.drop(['tot_cases', 'population'], axis=1)
covid_map_data
Here, we are using a dataset that includes the Gross State Product in millions of dollars as well as population for each state in millions. Gross State Product is a measurement of a state's output; it is the sum of value added from all industries in the state. Current dollar GSP components are compensation of employees, taxes on production and imports, and gross-operating surplus.
We believe that this dataset could provide a sufficient explanation as to how each state is doing economically and how the covid pandemic could have affected them. If we are analyzing covid cases and drinking rates per state, factoring in each state's economic wellbeing can also be a good indicator as to why some states are drinking more/less.
# reading in the file which contains GDP information per state
gdp2020 = pd.read_csv('gdp2020clean.csv')
gdp2020.head() # display a sample of our file/the information we're deriving from it
Like we have done before, we need to prepare this dataset
pd.set_option('mode.chained_assignment', None)
# modify our dataframe to only have the values for states in which we have substantial alcohol-related data, for comparison purposes
state_select = ['Alaska', 'Arkansas', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Illinois', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts', 'Missouri', 'North Dakota', 'Oregon', 'Tennessee', 'Texas', 'Utah', 'Virginia', 'Wisconsin']
gdp_sel = gdp2020.loc[gdp2020['State'].isin(state_select)]
# cast our population values to floats, so we can use in in calculation below
gdp_sel['Population (million)'] = pd.to_numeric(gdp_sel['Population (million)'], downcast="float")
# modify 'Gross State Product' values to be used as floats in our calculation of GDP Per Capita
gdp_sel['Gross State Product'] = gdp_sel['Gross State Product'].str.replace(',', '').astype(float)
# calculate GDP per capita
gdp_sel['GDP Per Capita'] = gdp_sel['Gross State Product'] / gdp_sel['Population (million)']
# update our dataframe to contain the 2 fields to be used later in our chloropleth map
gdp_sel = gdp_sel[['State','GDP Per Capita']]
gdp_sel # display resulting frame
Now that we have our data, we want to identify an overall trend between alcohol sales and years. Do we see an increase in alcohol sold in 2020 compared to that of 2017-2019?
To answer this question, we will aggregate the gallons of pure alcohol (ethanol) sold per month per year from 2017-2020 and plot on a line graph so we can see the trends.
As previously discussed, we only have alcohol sales data for January-August in 2020, so to ensure validity in our analysis, we should only look at January-August for the years 2017-2019 as well.
alc_gallons = alc_sales.groupby(['Year','Month'],as_index=False).agg({'Ethanol': 'sum'})
alc_gallons['Month'] = pd.Categorical(alc_gallons['Month'],categories=['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November','Decemeber'],ordered=True)
alc_gallons = alc_gallons.sort_values(by='Year', ascending=True).sort_values('Month')
alc_gallons2 = alc_gallons.sort_values(['Month', 'Year'])
#Drop the last few
alc_gallons2.drop(alc_gallons2.tail(12).index,inplace=True)
fig, axis = plt.subplots()
for lab, grp in alc_gallons2.groupby('Year'):
grp.plot(x = 'Month', y = 'Ethanol', ax = axis, label = lab, figsize = (15,15), title = 'Gallons of Ethanol (Pure Alcohol) Sold per Month')
From the graph above we can see a pattern of higher drinking levels in 2020 than any other year for most months. Altough there is a drop off in August its reasonable to assume that as its the last month of data collected, there may be missing numbers or companies that havent reported their sales to the dataset yet and based on the numbers for 2020 being higher in every other month, we are able to assume that it would follow the pattern of previous years but with higher numbers. The graph shows the aggregated sale of pure alcohol over all collected states in 2020. At the data point for March and April, we are able to see that quarantine and lockdowns in the states collected caused a much higher level of drinking than any of the other three years before it.
We now want to confirm our hypothesis from the monthly graph that there has overall been more alcohol sold in 2020 than the other years we have data for. To check this we will take the mean alcohol sold over the 7 months and compare.
alc_gallons2.drop(alc_gallons2.tail(16).index,inplace=True)
dfgroup = alc_gallons2.groupby(['Year'])['Ethanol']
dfmean = dfgroup.mean()
print(dfmean)
dfmean.plot(kind='line',x='Year',y='Ethanol', title = 'Mean Ethanol sold per Year for January-July')
The graph above further confirms our findings from the previous monthly sales of ethanol showing that 2020 followed the increasing sale of alcohol per year and then some. Due to the likely issues with data collection we dropped the month of august from the dataset and took a mean of the total ethanol sales between January-July of each year. Shown above there is a dramatic increase in the sale of pure alcohol content per year over all of the total states showing that we are correct in assuming people are drinking more alcohol.
alc_pc = alc_sales.groupby(['Year','Month'],as_index=False).agg({'PerCapita': 'sum'})
alc_pc['Month'] = pd.Categorical(alc_pc['Month'],categories=['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November','Decemeber'],ordered=True)
alc_pc = alc_pc.sort_values(by='Year', ascending=True).sort_values('Month')
alc_pc2 = alc_pc.sort_values(['Month', 'Year'])
#-------
alc_pc2.drop(alc_pc2.tail(16).index,inplace=True)
alc_pc2
dfgroup = alc_pc2.groupby(['Year'])['PerCapita']
dfmean = dfgroup.mean()
print(dfmean)
dfmean.plot(kind='line',x='Year',y='PerCapita', title = 'Mean Total Per Capita Consumption per Year for January-July')
To make sure we accounted for population increases and other factors like sales vs. actual consumption, we checked the per-capita consumption of pure ethanol for the population of individuals above the age of 14. The results of this test show even more that 2020 has been a year of heavier drinking for the US population in major part due to COVID-19.
We will now explore the drinking preferences of Americans by looking at the sales data for different the three distinct types of alcohol.
alc_sales
#plot changes in different types of beverages over the years
alc_bev = alc_sales.groupby(['Year','Month', 'Beverage'],as_index=False).agg({'Gallons': 'sum'})
#spirits_sales.plot(kind='line',x='Year',y='Gallons', title = 'Mean Gallons of Alcohol sold per Year for January-August')
alc_bev['Month'] = pd.Categorical(alc_bev['Month'],categories=['January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'September', 'October', 'November','Decemeber'],ordered=True)
alc_bev2 = alc_bev.sort_values(['Month'])
alc_bev2.drop(alc_bev2.tail(36).index,inplace=True)
beer_sales = alc_bev2[alc_bev2['Beverage'] == 'beer']
wine_sales = alc_bev2[alc_bev2['Beverage'] == 'wine']
spirits_sales = alc_bev2[alc_bev2['Beverage'] == 'spirits']
spirits_sales = spirits_sales.groupby(['Year'],as_index=False).agg({'Gallons': 'sum'})
wine_sales = wine_sales.groupby(['Year'],as_index=False).agg({'Gallons': 'sum'})
beer_sales = beer_sales.groupby(['Year'],as_index=False).agg({'Gallons': 'sum'})
spirits_sales.plot(kind='line',x='Year',y='Gallons', title = 'Mean Gallons of Spirits sold per Year for January-August')
wine_sales.plot(kind='line',x='Year',y='Gallons', title = 'Mean Gallons of Wine sold per Year for January-August')
beer_sales.plot(kind='line',x='Year',y='Gallons', title = 'Mean Gallons of Beer sold per Year for January-August')
We can see from the graphs above that the sales of beer, wine, and spirits in 2020 all followed the sales trends from the previous years. There are many possible explanations for these trends but in relation to COVID-19 the increase in spirits makes sense as beer and wine are more portable and convenient for social drinking whereas spirits provide a cheaper but less portable option. Since everyone is staying at home, buying more liquor to make drinks in a kitchen with all of the supplies you may need makes sense.
Here, we display a chloropleth map of GSP Per Capita in our relevant states:
map_gdp = folium.Map(location=[38, -100], tiles='Stamen Toner', zoom_start=4) # creating a map to
# show GDP values for the states in which we have substantial alcohol-related data
# url = 'https://github.com/python-visualization/folium/blob/master/examples/data/us-states.json'
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
state_geo = f'{url}/us-states.json' # obtain json file of US states
import requests
#state_geo = requests.get(url).json()
folium.Choropleth(
geo_data = state_geo,
name = 'choropleth',
data = gdp_sel,
columns = ['State', 'GDP Per Capita'],
key_on = 'feature.properties.name',
fill_color = 'PiYG',
fill_opacity = 0.7,
line_opacity = 0.2,
legend_name = 'Projected GSP Per Capita 2020'
).add_to(map_gdp)
folium.LayerControl().add_to(map_gdp)
map_gdp
Here, we display a chloropleth map of Covid Case Rates in our relevant states:
map_covid = folium.Map(location=[38, -100], tiles='Stamen Toner', zoom_start=4) # creating a map to
# show Covid Case Rates for the states in which we have substantial alcohol-related data
folium.Choropleth( # creating our chloropleth map for the covid data
geo_data= state_geo,
name='choropleth',
data= covid_map_data,
columns=['state', 'case_rate'],
key_on='feature.id',
fill_color='BuPu',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Covid Case Rates in July 2020'
).add_to(map_covid)
folium.LayerControl().add_to(map_covid)
map_covid
Here, we display a chloropleth map of Drinking Rates in our relevant states:
#Extract data from 2020
alc_sales2020 = alc_sales[alc_sales['Year'] == 2020]
#create new dataframe calculating gallons per month per state in 2020 to use for visualization
alc_gallons2020 = alc_sales2020.groupby(['Month','State'],as_index=False).agg({'Gallons': 'sum'})
alc_gallons2020
alc_gallons2020['Month'] = pd.Categorical(alc_gallons2020['Month'],categories=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August'],ordered=True)
alc_gallons2020 = alc_gallons2020.sort_values('Month')
#prepare dataframe for total gallons of alcohol sold per state for 2020
alc_gallons2020 = alc_sales2020.groupby(['State'],as_index=False).agg({'Gallons': 'sum'})
# add our population data per state, to our alc_gallons2020 dataframe
alc_gallons2020['population'] = ['683879','3039000', '5842076', '3700000',
'995764', '21477737', '12630000', '2774044'
'4500000', '4600000', '6700000', '6150000',
'762062','4280000', '6890000', '28996000', '3280000', '8535519', '5860000', '5822000']
# cast populations to floating point numbers, to be used in a calculation involving gallons of alc purchased per state
alc_gallons2020['population'] = pd.to_numeric(alc_gallons2020['population'], downcast="float")
# get the drinking rate per state, by dividing the total gallons of alc consumed by the overall population
alc_gallons2020['drinking_rate'] = alc_gallons2020['Gallons'] / alc_gallons2020['population']
map_alc = folium.Map(location=[38, -100], tiles='Stamen Toner', zoom_start=4) # creating a map to
# show drinking rates for the states in which we have substantial alcohol-related data
folium.Choropleth( # creating our chloropleth map
geo_data= state_geo,
name='choropleth',
data= alc_gallons2020,
columns=['State', 'drinking_rate'],
key_on='feature.properties.name',
fill_color='OrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Drinking rate in 2020',
legend_color='White'
).add_to(map_alc)
folium.LayerControl().add_to(map_alc)
map_alc
Analysis of above 3 maps (displaying GSP Per Capita, Covid Case Rates, and Drinking Rates per state): Our previous notion was that GDP, Covid Case Rate, and Drinking rate would all show strong correlations. When we derived the chloropleth maps for the 3 topics, however, we saw almost identical results in just the Covid Case Rate and Drinking Rate. We suspected GSP to be a good factor to consider in that it is a huge indicator of economic growth in states. Covid-19 has hurt many businesses, and could greatly affect the economic output of all states, let alone a few. We assumed that states better equipped to handle the consequences of Covid-19 (with stronger GSPs), could see better handling of Covid Case Rates (a reduction) and have higher purchases quantities of alcohol. We saw that in the end, the GSP per Capita calculations left us with GSP's varying greatly across the states in which we chose to display. In contrast, the chloropleth map for Covid Case Rates and Drinking Rates are strikingly similar/produce clearer outputs. In both, Texas and Missouri are significantly darkened compared to the rest, which remain the same pale shades (cannot distinguish between them); in the case of Covid Case Rates, we see dark purple (Texas) and a distinct darker blue (Missouri) & a dark orange-red in Texas and Missouri for Drinking Rates. We can thus conclude that there seems to be a relationship between Covid-19's arrival & drinking habits in these areas.
We now want to look at the average monthly drinking rates for individuals in the US
pc_yearly = alc_sales.groupby(['Year'],as_index=False)['PerCapita'].mean()
yearly_difference = pc_yearly.iloc[3]['PerCapita'] - pc_yearly.iloc[2]['PerCapita']
print("Yearly difference in per capita ethanol consumption between 2020 and 2019:",yearly_difference)
From this number we can see that on average the PerCapita consumption of alcohol is .0025 higher than in 2019. We can use this number to adjust up the 2019 november mean per capita consumption and predict how much people are drinking right now in December of 2020.
pc_monthly = alc_sales.groupby(['Year','Month'],as_index=False)['PerCapita'].mean()
dec_2019_pc = pc_monthly.iloc[26]['PerCapita']
dec_2020_pc = dec_2019_pc + yearly_difference
dec_2020_pc
This number indicates the amount of ethanol consumed per month per member of the population above the age of 14 in gallons.
ounces_per_gallon = 128
ounces_per_drink = .6 #Ounces of ethanol in a standard drink
pc_ounces = dec_2020_pc * ounces_per_gallon
print("Monthly average ethanol consumed per person:", pc_ounces)
average_monthly_drinks = pc_ounces/ounces_per_drink
print("Monthly average standard drinks per month per person:", average_monthly_drinks)
This number is affected by many different variables like our data set using 14 as the start of drinking age vs. 21 and it does not account for the fact that only about 54.9% of Americans drink at all. These numbers both bring the average down.
realistic_average_monthly_drinks = average_monthly_drinks*1.65
print("Realistic average monthly standard drinks per person:",realistic_average_monthly_drinks)
This number accounts for the population that doesnt drink in the US and shows the average amoount of monthly drinks per person only for individuals who actually drink. I used 1.65 to account for the ~50% who doesnt drink and another 10% to account for the young teenagers in the population who should not be included in the consumption data as they are not old enough to drink.
Throughout this project we have shown the affects of COVID-19 on the drinking habits of the American people and how factors between the two have a strong correlation. From our analysis it appears that states who are doing worse with COVID rates are drinking at a much higher rate than states who are handing the pandemic well. We can also see that on average the average drinking rates of Americans has gone up by ___ standard drinks per month.
We also saw throughout our project that the Pandemic has not been the only stress inducing struggle of the year. Drinking rates in the US have gone up drastically even in months before the Pandemic lockdown. There has been a steady increase in America's drinking levels every year since 2017 and most likely before that as well. We have to look at these numbers as a country and wonder if we are doing well and if the quality of life in the US is changing as Americans drink more. This year has not been one of lots of celebration, so its easy to infer that this drinking has been to cope. We need to work harder as a country and for our people to find other ways to help individuals struggling with life issues and create accessible fun and healthy activities for people to look towards instead of cracking another bottle. Other countriess are already wrapping up the pandemic and most other countries on earth emphasis daily athletic activity and other options for people to get their daily dose of seratonin and enjoy life without being under the influence of drugs and we believe that the US can aim to do that too.
As it stands right now though, Americans are drinking more and more. Do our numbers for average drinking levels make sense for you, or are you doing better or worse than the average in your alcohol consumption habbits throughout this crazy year? And if you're under the average, what methods are you using to distract yourself and cope that you can share with your friends and family to help them get the most out of their life this year.