The World Happiness Report is an annual publication by the United Nations that scores and ranks the national happiness of reporting countries based on various quality of life factors, including perceived social support and life expectancy. we will attempt to analyze this data in order to determine what characteristics have the greatest impact on happiness, as well as if there are any global or regional trends in measured happiness or other features over time through exploratory data analysis. Additionally, we will attempt to model scores based on relevant features through regression.
The dataset was obtained from Kaggle and contains the 2015-2019 World Happiness Reports. The 2020 dataset was independently scraped from Wikipedia. These feature data on GDP, Social Support, Life Expectancy, Freedom to make Life Choices, Generosity, and Perceptions of Corruption, as well as a holistic Happiness Score. Certain aspects of the dataset had to be cleaned, which was more easily done by directly modifying the .csv file. For example, some of the country names had inconsistencies and were corrected to "Hong Kong, Trinidad and Tobago, Taiwan". We can read the data in to get an initial look at it.
import pandas as pd
from IPython.display import display, HTML
import warnings
warnings.filterwarnings('ignore')
#We'll read in the data from 6 different csv files into a list for now
data = []
for i in range(2015,2021):
input = pd.read_csv("{}.csv".format(i),index_col=False)
data.append(input)
for i in range(len(data)):
print("{} Data:".format(2015+i))
display(data[i])
It looks like there's a pretty wide range of formats for the data, although most of the important columns are retained albeit in different names. The simplest way to unify this is to drop extraneous columns and effectively take the intersection of the different data sets.
Before analysis, the data needs to be cleaned. The countries are inconsistent across years, with some countries only reporting data for one year. In addition, the data across years are in varying formats, with some years missing region data or other columns. This makes it difficult to preserve region data, since we don't have an overarching set that can be used to assign regions to years lacking the column. The standard deviation and whisker columns have to be dropped, since we don't have equivalent values for each year.
First, we'll set aside the 2015 and 2020 data to examine starting and final trends. As a side effect, we can also use the region data that exists in these sets to compare regions over time.
data2015 = data[0].copy()
data2020 = data[5].copy()
The labels and columns for each year vary and need to be consolidated before we can concatenate the data together. In addition, we need to add an additional column with the year for the entry.
We can drop the regions, standard error, and dystopia residual data. We also correct the labels of the columns to be consistent with what we'll be using later on. Finally, we add the year values.
data[0].drop(data[0].columns[[1,4, 11]], axis=1, inplace=True)
data[0].columns=['Country','Rank','Score','GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']
data[0]['Year']=[2015]*len(data[0])
We can drop the regions, upper/lower confidence intervals, and dystopia residual data. Again, we correct the columns labels and add the years.
data[1].drop(data[1].columns[[1,4,5,12]],axis = 1, inplace=True)
data[1].columns = ['Country','Rank','Score','GDP','Social_Support','Life_Expectancy','Freedom','Corruption','Generosity']
data[1]['Year']=[2016]*len(data[1])
We can drop the high/low whiskers and dystopia residual data. We correct the columns labels and add the years.
data[2].drop(data[2].columns[[3,4,11]], axis=1, inplace=True)
data[2].columns = ['Country','Rank','Score','GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']
data[2]['Year']=[2017]*len(data[2])
There aren't any extraneous columns here, but there is a missing value: United Arab Emirates doesn't have a value for Perceptions of Corruption. For now, we'll drop it.
data[3].columns = ['Rank','Country','Score','GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']
data[3].dropna(inplace=True)
data[3]['Year']=[2018]*len(data[3])
This one is easy—we just standardize the column labels and add year data
data[4].columns = ['Rank','Country','Score','GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']
data[4]['Year']=[2019]*len(data[4])
What a mess! We can drop all the columns with proportion explained by features, as well as the standard error and whiskers. We change the column labels and add years.
data[5].drop(data[5].columns[[1,3,4,5,12,13,14,15,16,17,18,19]], axis=1, inplace=True)
data[5].columns = ['Country','Score','GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']
data[5]['Rank']=[i for i in range(len(data[5]))]
data[5]['Year']=[2020]*len(data[5])
Now that all the columns are identical, we can concatenate all of the data into a single variable. With our data tidy and condensed, this will make it easier for us to perform analysis in the future.
merged_data = pd.concat([i for i in data])
display(merged_data)
Let's take a look at trends over time. We can group the total data by both country and year and use a violin plot in order to see the yearly distribution of country values. To do this, we can use the seaborn python library. Ideally, we would expect to see score and all other features increase over time, with the sole exception of corruption which should decrease over time.
import matplotlib.pyplot as plt
import seaborn as sns
#We want to pivot the data to get country values on a yearly basis
scores = merged_data.groupby(['Country','Year'],as_index=False)['Score'].sum()
scores = scores.pivot(index='Year',columns='Country',values='Score')
def axis():
chart1, ax1 = plt.subplots()
return ax1
display(scores)
That looks good, although ideally the change would be more dramatic. The score generally appears to increase over time, with the average rising very slightly. More interestingly, the distribution of scores shifts over time, from having a higher probability of values being below the mean for 2015, to being evenly distributed around the center and slightly favoring higher values in 2020. This is good news—even if news and media seems to be getting more negative recently, global happiness is trending upwards.
ax = sns.violinplot(data=merged_data,x='Year',y='Score',ax=axis()).set_title('Score')
Global GDP remains generally constant, although the values are getting more centralized as time progresses. A closer look at economic trends would be necessary for further analysis, but for now it is sufficient to say that GDP is more or less stable.
ax = sns.violinplot(data=merged_data,x='Year',y='GDP',ax=axis()).set_title('GDP')
There is a large amount of variance in social support, suggesting that regions or individual countries differ greatly. Based on the distribution, it would appear that there are a few countries with very low social support compared to the average. In general, social support has increased over time, although it has remained stable in the last 4 years.
ax = sns.violinplot(data=merged_data,x='Year',y='Social_Support',ax=axis()).set_title('Social Support')
Life expectancy has trended upwards over time. This matches expectations that technological development and globalization will lead to improved health care and increased life expectancy.
ax = sns.violinplot(data=merged_data,x='Year',y='Life_Expectancy',ax=axis()).set_title('Life Expectancy')
Freedom to make life choices seems to have slightly increased, although in general this can be considered to remain stable. These values seem to fluctuate the most of any feature on a year-to-year basis, which is likely explained by its inherent qualitativeness. Individual people in a country likely have different perceptions on freedom, and even different baselines on what freedom is to be expected.
ax = sns.violinplot(data=merged_data,x='Year',y='Freedom',ax=axis()).set_title('Freedom')
Generosity has increased in general, although outliers have become less generous over time. This is the most consistent of the features globally, with the majority of the globe falling into a distribution range that is roughly 0.2 units wide. It is possible that countries look to each other for an expectation of generosity, resulting in this uniformity.
ax = sns.violinplot(data=merged_data,x='Year',y='Generosity',ax=axis()).set_title('Generosity')
Perceived corruption has remained generally stable in the past 5 years, with unusually high values for 2015. It seems that this is low for the majority of the globe, with the exception of a few outlier countries that experience high corruption.
ax = sns.violinplot(data=merged_data,x='Year',y='Corruption',ax=axis()).set_title('Corruption')
Next, we can look at the amount of correlation between features in order to see how related they are to one another. Specifically, we are most interested which features have the highest influence on happiness score.
display(merged_data.corr())
We can visualize this correlation data through a correlation matrix, which will make it easier for us to compare values at a glance.
f = plt.figure(figsize=(12, 8))
ax=f.add_subplot()
f.colorbar(ax.matshow(merged_data.corr()))
ax.tick_params(labelsize=14, rotation = 45)
ax.set_xticklabels(['']+merged_data.columns)
ax.set_yticklabels(['']+merged_data.columns)
f.show()
From this, we can see that happiness score correlates most strongly with GDP, life expectancy, social support, and freedom in that order. We know from our previous visualization that GDP, social support, and freedom have all remained generally stable, while life expectancy has increased. This supports the observation that global happiness has increased slightly over time.
Interestingly, these four features also have the highest correlation with each other, while freedom, generosity, and corruption also have the highest correlation with each other. This seems to suggest independent relationships between these two groups. In other words, while it is unlikely to have a country with high GDP but low life expectancy, it is possible to have a country with high GDP and low freedom, which makes sense.
These global trends are interesting, but it is likely that they will vary across different regions. We can look more closely to see if there are hidden regional trends that are otherwise obfuscated. The data distribution is too thin to effectively visualize with a violinplot, so let's use a boxplot instead.
Using the 2015 and 2020 data, we can compare how happiness scores have trended by region. There are some descrepancies in region labeling, so we can fix this by relabeling the region values preferring the 2015 values.
#Adjust the column labels for consistency
data2020=data2020.rename(columns={'Country name':'Country','Regional indicator':'Region','Ladder score':'Happiness Score'})
#create a union of region values from 2015 and 2020, with a preference for 2015 if available
regions= data2015[['Country','Region']].merge(data2020[['Country','Region']],on='Country',how='outer')
regions['Region']=regions.apply(lambda x:x[2] if pd.isnull(x[1]) else x[1],axis=1)
#Reassign the new region values in another column
data2015['Fixed Region']=data2015.apply(lambda x: regions.loc[regions['Country']==x[0]]['Region'].iloc[0],axis=1)
data2020['Fixed Region']=data2020.apply(lambda x: regions.loc[regions['Country']==x[0]]['Region'].iloc[0],axis=1)
data2015.boxplot(column=['Happiness Score'],by='Fixed Region',figsize=(20,5))
plt.xticks(rotation=45)
plt.show()
data2020.boxplot(column=['Happiness Score'],by='Fixed Region',figsize=(20,5))
plt.xticks(rotation=45)
plt.show()
Based on this, we can see that Central/Eastern Europe, Southern Asia, Sub-Saharan Africa, and Western Europe all increased in happiness over time. The Middle East and Northern Africa as well as North America both decreased. East Asia, Latin America, and Southeast Asia all decreased in spread.
Taking the 2020 data, we can compare different features by region for trends. For brevity, we will only examine the two features with the greatest correlation with score: GDP and life expectancy.
The trend is generally similar to the happiness data shown above, although there is less spread overall in each region. The Middle East and North Africa has a higher GDP per capita than their score would suggest.
data2020.boxplot(column=['GDP per capita'],by='Fixed Region',figsize=(20,5))
plt.xticks(rotation=45)
plt.show()
Again, the trend is similar to score but with less spread in each region. Additionally, South Asia and East Asia have higher life expectancies than their scores suggest.
data2020.boxplot(column=['Healthy life expectancy'],by='Fixed Region',figsize=(20,5))
plt.xticks(rotation=45)
plt.show()
Based on the correlation matrix, we hypothesize that there is a relationship between our happiness score and the other features. Let's see if we can use a model to fit and predict the score values over time. We will only use the four variables with the most correlation with scores, along with the year values. The data itself is also randomized.
variables = ['GDP','Social_Support','Life_Expectancy','Freedom','Year']
data_random = merged_data.sample(frac=1).reset_index(drop=True)
We first try a linear model as a predictor and evaluate the score using 10-fold cross validation. It's not bad, with the R-squared value averaging around 0.75 with a standard deviation around 0.04.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
import statistics as stats
kf = KFold(n_splits=10)
def kfold(data_random,model,variables,predict):
scores=[]
#Generate indices for cross valiation
for train_index, test_index in kf.split(data_random):
#Generate training and test data based on indices
X_train, X_test = data_random.iloc[train_index][variables], data_random.iloc[test_index][variables]
y_train, y_test = data_random.iloc[train_index][predict],data_random.iloc[test_index][predict]
#fit the model and return the score as a list
model.fit(X_train,y_train)
score = model.score(X_test,y_test)
scores.append(score)
return scores
model = LinearRegression()
results = kfold(data_random,model,variables,'Score')
print("Average Score:",stats.mean(results))
print("Standard Deviation:",stats.stdev(results))
Can we do better than this? Next we'll try a random forest with the default 100 decision trees. This does better, with an average R-squared of 0.8 and standard deviation of around 0.03.
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
forest_scores = kfold(data_random,forest,variables,'Score')
print("Average Score:",stats.mean(forest_scores))
print("Standard Deviation:",stats.stdev(forest_scores))
This performance is alright, but it is likely that there are confounding factors across regions that muddle an overall trend. Let's select a single region to see if the prediction is any better. Arbitrarily, we'll pick the Middle East and Northern Africa.
region = data2015.groupby('Region').get_group('Middle East and Northern Africa')['Country']
region_data = pd.concat([i.loc[i['Country'].isin(region)] for i in data])
display(region_data)
Doing a random forest regression like before, the score remains the same at around 0.8, although the standard deviation increases to around 0.12. Interestingly, this suggests that the model fits better on the global data rather than specific regions.
region_sample = region_data.sample(frac=1).reset_index(drop=True)
region_forest = kfold(region_sample,forest,variables,'Score')
print("Average Score:",stats.mean(region_forest))
print("Standard Deviation:",stats.stdev(region_forest))
Out of curiousity, how effective is happiness score as a predictor for other features? We know the correlation, but we can also use our best-performing random forest model on global data to examine the bidirectionality of the relationships.
for value in ['GDP','Social_Support','Life_Expectancy','Freedom','Generosity','Corruption']:
forest_scores = kfold(data_random,forest,['Score'],value)
print("Average Score for {}:".format(value),stats.mean(forest_scores))
It turns out, happiness is a pretty bad predictor for its own contributing features. This makes sense; it's inherently a loss of dimensionality due to integrating a bunch of different features into one value.
We analysed the UN's World Happiness Report data from 2015 through 2020. We adjusted and reformatted columns to be consistent across years and visualized the relationships between happiness score and various factors. We found that score has a weakly positive correlation with time, suggesting that global happiness does tend to increase over time. It most strongly relates to wealth, health, social support, and freedom, which was confirmed by a correlation matrix. In general, happiness is highest in the Americas, Western Europe, and Australia, which are commonly perceived as western countries. Happiness is lowest in Africa and Asia. Finally, we used a linear regression and a random forest to fit the data with respect to GDP, life expectancy, social support, freedom, and time. The random forest performed the best, and was more accurate on global data than regional data.
In general, the relatively short timeframe of the report makes it difficult to determine any trends. Furthermore, it is likely that there is a great deal of variance in how values are reported across countries, due to differences in culture or perception. This degree of inconsistency makes it difficult to compare values between countries and in general causes reported scores to not deviate much from year to year. Ultimately, since these values are largely qualitative and subjective, it is difficult to use them to quantitatively predict the future. Instead, the study should be used to holistically assess the performance of different countries, ideally across a timespan of several centuries.
Of greater interest are the features that had the greatest correlation with a high level of happiness. While wealth and health were both expected to be influential, it surprised me that social support was so high and that perceived corruption was so low. This makes sense, however, since humans are innately social beings, and the average person is routinely influenced more by their immediate community than corruption in government or bureaucracy.
One final thing of note is that the data for 2020 was collected in March, prior to the brunt of the impact from the COVID-19 pandemic. It is likely that corresponding scores for 2021 would be much lower than trends expect as a result. While overall this would be expected to be a temporary deviation from the growth curve, it is worth using as an example that these models are only useful over the long-term, both in terms of collecting useful data as well as predicting future trends.
Kaggle data: https://www.kaggle.com/londeen/world-happiness-report-2020
World Happiness Report site: https://worldhappiness.report/
Another Study on World Happiness Data: https://rpubs.com/LeonaAnn/645318