Analyzing Yearly and Monthly Temperature Changes in India, Canada, and Brazil
Setup
We start by importing five necessary modules:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
from sklearn.linear_model import LinearRegression
We can now set up a database and then read in CSV tables for temperature, country, and station data using pd.read_csv()
.
conn = sqlite3.connect('climate.db')
# Create three tables using the CSV files
temps_df = pd.read_csv("./_includes/temps_stacked.csv")
countries_df = pd.read_csv("./_includes/countries.csv")
stations_df = pd.read_csv("./_includes/station-metadata.csv")
# For writing months in the plot titles
months = ['January','February','March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November',
'December']
Next, we add our three tables to the database.
# Uses the SQLite database connection to create tables
temps_df.to_sql("temperatures", conn, if_exists="replace",
index=False)
countries_df.to_sql("countries", conn, if_exists="replace",
index=False)
stations_df.to_sql("stations", conn, if_exists="replace",
index=False)
Let us first check out the information in each table that might be necessary for our visualization.
stations_df.head()
ID | LATITUDE | LONGITUDE | STNELEV | NAME | |
---|---|---|---|---|---|
0 | ACW00011604 | 57.7667 | 11.8667 | 18.0 | SAVE |
1 | AE000041196 | 25.3330 | 55.5170 | 34.0 | SHARJAH_INTER_AIRP |
2 | AEM00041184 | 25.6170 | 55.9330 | 31.0 | RAS_AL_KHAIMAH_INTE |
3 | AEM00041194 | 25.2550 | 55.3640 | 10.4 | DUBAI_INTL |
4 | AEM00041216 | 24.4300 | 54.4700 | 3.0 | ABU_DHABI_BATEEN_AIR |
countries_df.head()
FIPS 10-4 | ISO 3166 | Name | |
---|---|---|---|
0 | AF | AF | Afghanistan |
1 | AX | - | Akrotiri |
2 | AL | AL | Albania |
3 | AG | DZ | Algeria |
4 | AQ | AS | American Samoa |
temps_df.head()
ID | Year | Month | Temp | |
---|---|---|---|---|
0 | ACW00011604 | 1961 | 1 | -0.89 |
1 | ACW00011604 | 1961 | 2 | 2.36 |
2 | ACW00011604 | 1961 | 3 | 4.72 |
3 | ACW00011604 | 1961 | 4 | 7.73 |
4 | ACW00011604 | 1961 | 5 | 11.28 |
Accessing Country-Level Data
Now, let’s write a function called query_climate_database
that will return the subset of a country’s station observations that falls on the given month
and in between year_begin
and year_end
. We can achieve this with an SQL command which additionally finds the correct country records by matching the country’s FIPS 10-4 code to its name in the countries
dataframe. We can then inner join this filtered table with the stations table on the station ID.
def query_climate_database(country, year_begin, year_end, month):
'''Returns all temperature observations from a country
between year_begin and year_end for a given month'''
# Identify the given country's FIPS 10-4 code
code = countries_df.loc[countries_df["Name"] \
==country]["FIPS 10-4"].iloc[0]
# Filter the temperatures table by the FIPS 10-4 code and
# the given month and year bounds
cmd = "SELECT * FROM temperatures \
WHERE SUBSTRING(id,1,2) = '" + str(code) + \
"' AND year >= " + str(year_begin) + \
" AND year <= " + str(year_end) + " AND month = " + \
str(month)
df = pd.read_sql(cmd, conn)
# Inner join on stations to return only the relevant records
climate_df = pd.merge(df, stations_df, how="inner",
on=["ID", "ID"])
climate_df["Country"] = country
return climate_df[["NAME", "LATITUDE", "LONGITUDE",
"Country", "Year", "Month","Temp"]]
As an example, let us see the table returned by the function above:
query_climate_database('China', 1980, 2000, 1)
NAME | LATITUDE | LONGITUDE | Country | Year | Month | Temp | |
---|---|---|---|---|---|---|---|
0 | AN_XI | 40.50 | 96.0 | China | 1980 | 1 | -9.35 |
1 | AN_XI | 40.50 | 96.0 | China | 1981 | 1 | -8.90 |
2 | AN_XI | 40.50 | 96.0 | China | 1983 | 1 | -10.10 |
3 | AN_XI | 40.50 | 96.0 | China | 1984 | 1 | -11.70 |
4 | AN_XI | 40.50 | 96.0 | China | 1985 | 1 | -10.00 |
... | ... | ... | ... | ... | ... | ... | ... |
8232 | YUANLING | 28.47 | 110.4 | China | 1995 | 1 | 4.15 |
8233 | YUANLING | 28.47 | 110.4 | China | 1996 | 1 | 4.19 |
8234 | YUANLING | 28.47 | 110.4 | China | 1997 | 1 | 4.92 |
8235 | YUANLING | 28.47 | 110.4 | China | 1999 | 1 | 7.68 |
8236 | YUANLING | 28.47 | 110.4 | China | 2000 | 1 | 4.34 |
8237 rows × 7 columns
Let us define an additional query that can be used for additional data visualization. This function will return all temperature records in a year from a given station.
def year_observations(country, year):
'''Returns all observations for a given country and year'''
# Match the country name to its FIPS 10-4 code
code = countries_df.loc[countries_df["Name"] \
==country]["FIPS 10-4"].iloc[0]
# Filter by year and FIPS 10-4 and inner join with
# station data
cmd = "SELECT * FROM temperatures INNER JOIN \
stations ON temperatures.ID = stations.ID WHERE \
temperatures.year = " + str(year) + \
" and SUBSTRING(temperatures.ID,1,2) = '" + \
str(code) + "'"
df = pd.read_sql(cmd, conn)
# ID column is returned twice, so we use iloc
# to remove the duplicate column
return df[["ID", "Year", "Month", "Temp",
"NAME"]].iloc[:,1:6]
year_observations("China", 2005)
ID | Year | Month | Temp | NAME | |
---|---|---|---|---|---|
0 | CHM00050136 | 2005 | 1 | -24.77 | MOHE |
1 | CHM00050136 | 2005 | 2 | -24.03 | MOHE |
2 | CHM00050136 | 2005 | 3 | -11.62 | MOHE |
3 | CHM00050136 | 2005 | 4 | 0.57 | MOHE |
4 | CHM00050136 | 2005 | 5 | 7.44 | MOHE |
... | ... | ... | ... | ... | ... |
4341 | CHXLT999147 | 2005 | 8 | 26.77 | YUANLING |
4342 | CHXLT999147 | 2005 | 9 | 25.32 | YUANLING |
4343 | CHXLT999147 | 2005 | 10 | 17.93 | YUANLING |
4344 | CHXLT999147 | 2005 | 11 | 14.53 | YUANLING |
4345 | CHXLT999147 | 2005 | 12 | 7.32 | YUANLING |
4346 rows × 5 columns
Data Analysis
Our new function will help us better tailor the data visualization segment of this walkthrough. Our first visualization exercise will use plotly
to map out the locations of a country’s stations and provide an estimate of the year-on-year change in temperature at each station.
To do so, let’s first define a function that can be applied to each group of station observations to obtain the desired estimate. This function, model()
, will make use of the OLS linear regression estimate.
def model(df):
'''Returns the slope of a linear regression that estimates
the year-on-year change in temperature for a station'''
return round(LinearRegression().fit(df[['Year']].values,
df[['Temp']].values)\
.coef_[0][0], 4)
Next, we will create the plot using the following steps:
- Fetch the desired table using
query_climate_database
. - Use
groupby()
andapply()
to group the table by station. - Group by station and count number of observations for each station.
- Filter all stations with fewer than
min_obs
observations and inner join with the estimates table. - Plot the data using
scatter_mapbox
.
def temperature_coefficient_plot(country, year_begin,
year_end, month, min_obs,
**kwargs):
'''Return a plotly map that plots the estimate for year-
on-year change in temperature for stations in a country'''
df = query_climate_database(country, year_begin, year_end,
month)
# Group by station (which is uniquely identified by latitude
# and longitude) and fetch the OLS estimate for yearly
# temperature change
regmod = df.groupby(['LATITUDE', 'LONGITUDE'],
as_index=False).apply(model)
agg = pd.DataFrame(data=regmod.values,
columns=['LATITUDE','LONGITUDE',
'Estimated Yearly Increase (°C)'])
<<<<<<< HEAD
# Filter out stations with fewer than min_obs observations
# and inner join with the estimates table
g = df.groupby(['NAME','LATITUDE','LONGITUDE'],
as_index = False).count()
data = g[g.Year >= min_obs][['NAME','LATITUDE','LONGITUDE']]\
.merge(agg, on=['LATITUDE',
'LONGITUDE'],
how='inner')
fig = px.scatter_mapbox(data, lat="LATITUDE", lon="LONGITUDE",
hover_name = 'NAME',
title = "Estimates in Yearly \
Temperature Change in " + \
months[month - 1] + " for Stations \
in <br>" + country + " for Years \
" + str(year_begin) + " - \
" + str(year_end),
color='Estimated Yearly \
Increase (°C)', **kwargs)
return fig
We obtain the desired visual below:
color_map = px.colors.diverging.RdGy_r
fig = temperature_coefficient_plot("India", 1980, 2000, 1,
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=\
color_map,
color_continuous_midpoint=0)
pio.write_html(fig, "./_includes/india_estimate.html")
fig.show()
We can run the same function on December in Canada in the years 2000-2005:
color_map = px.colors.diverging.RdGy_r # choose a colormap
fig2 = temperature_coefficient_plot("Canada", 2000, 2005, 12,
min_obs = 5,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=\
color_map,
color_continuous_midpoint=0)
pio.write_html(fig2, "./_includes/canada_estimate.html")
fig2.show()
Other Visualizations
To cover some other types of visualizations, we look to histograms and time plots. Using year_observations()
, we will first write a function to plot a histogram of temperature observations for a given country in a given year:
def country_obs_plot(country, year, **kwargs):
'''Returns a distribution of temperature observations
(°C) for a country with a marginal rug distribution for
each month'''
temp_obs_df = year_observations(country, year)
# gets the histogram
fig = px.histogram(temp_obs_df,
x = "Temp",
color = "Month",
marginal='rug',
title = "Distribution of \
Temperature Observations (°C) by Month \
in " + str(country) + ", " + str(year),
**kwargs)
return fig
Let us view the distribution of temperature observations in Brazil in 2010:
fig3 = country_obs_plot("Canada", 2010)
pio.write_html(fig3, "./_includes/canada_distribution.html")
fig3.show()
As we would expect, the temperature is distributed around lower temperatures in the latest and earliest months of the year.
We can also examine the time-path of temperature over each year for a given country using px.line()
. We will plot only stations with a complete set of data for all years given in the argument.
def country_obs_by_year_plot(country, year_begin, year_end,
**kwargs):
'''Returns time plot for each station in a country for a
given sequence of years, plotting by station'''
# concatenate dataframes for each month using
# query_climate_database
obs_df = pd.concat([query_climate_database(country,
year_begin,
year_end,
i)
for i in range(1,13)],
ignore_index=True)
# Use this to filter out stations with too few observations
g = obs_df.groupby(['NAME','LATITUDE','LONGITUDE'],
as_index = False).count()
data = g[g.Year >= 12 * (year_end - year_begin + 1)]\
[['NAME','LATITUDE','LONGITUDE']]\
.merge(obs_df,
on=['NAME','LATITUDE','LONGITUDE'], how='inner')
return px.line(data,
x = "Month",
y = "Temp",
color = "NAME",
facet_col = "Year",
title = "Over-Year Temperature \
Change by Station for \
" + country + " in Years " +
str(year_begin) + " - " + \
str(year_end), **kwargs)
Let’s generate this plot on Brazil in the years 2000 to 2003. We might expect this one to show a stark difference from the temperature distributions in Canada.
fig4 = country_obs_by_year_plot("Brazil", 2000, 2003)
pio.write_html(fig4, "./_includes/brazil_timeplots.html")
fig4.show()
In contrast, Southern Hemisphere countries reacch a trough in the middle of the year, as evidenced by our time plot.