Analyzing Yearly and Monthly Temperature Changes in India, Canada, and Brazil

Setup

We start by importing five necessary modules:

import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
from sklearn.linear_model import LinearRegression

We can now set up a database and then read in CSV tables for temperature, country, and station data using pd.read_csv().

conn = sqlite3.connect('climate.db')

# Create three tables using the CSV files
temps_df = pd.read_csv("./_includes/temps_stacked.csv")
countries_df = pd.read_csv("./_includes/countries.csv")
stations_df = pd.read_csv("./_includes/station-metadata.csv")

# For writing months in the plot titles
months = ['January','February','March', 'April', 'May', 'June', 
          'July', 'August', 'September', 'October', 'November', 
          'December']

Next, we add our three tables to the database.

# Uses the SQLite database connection to create tables
temps_df.to_sql("temperatures", conn, if_exists="replace", 
                index=False)
countries_df.to_sql("countries", conn, if_exists="replace", 
                    index=False)
stations_df.to_sql("stations", conn, if_exists="replace", 
                   index=False)

Let us first check out the information in each table that might be necessary for our visualization.

stations_df.head()

	ID	LATITUDE	LONGITUDE	STNELEV	NAME
0	ACW00011604	57.7667	11.8667	18.0	SAVE
1	AE000041196	25.3330	55.5170	34.0	SHARJAH_INTER_AIRP
2	AEM00041184	25.6170	55.9330	31.0	RAS_AL_KHAIMAH_INTE
3	AEM00041194	25.2550	55.3640	10.4	DUBAI_INTL
4	AEM00041216	24.4300	54.4700	3.0	ABU_DHABI_BATEEN_AIR

countries_df.head()

	FIPS 10-4	ISO 3166	Name
0	AF	AF	Afghanistan
1	AX	-	Akrotiri
2	AL	AL	Albania
3	AG	DZ	Algeria
4	AQ	AS	American Samoa

temps_df.head()

	ID	Year	Month	Temp
0	ACW00011604	1961	1	-0.89
1	ACW00011604	1961	2	2.36
2	ACW00011604	1961	3	4.72
3	ACW00011604	1961	4	7.73
4	ACW00011604	1961	5	11.28

Accessing Country-Level Data

Now, let’s write a function called query_climate_database that will return the subset of a country’s station observations that falls on the given month and in between year_begin and year_end. We can achieve this with an SQL command which additionally finds the correct country records by matching the country’s FIPS 10-4 code to its name in the countries dataframe. We can then inner join this filtered table with the stations table on the station ID.

def query_climate_database(country, year_begin, year_end, month):
    '''Returns all temperature observations from a country 
    between year_begin and year_end for a given month'''
    
    # Identify the given country's FIPS 10-4 code
    code = countries_df.loc[countries_df["Name"] \
                            ==country]["FIPS 10-4"].iloc[0]
    
    # Filter the temperatures table by the FIPS 10-4 code and 
    # the given month and year bounds
    cmd = "SELECT * FROM temperatures \
        WHERE SUBSTRING(id,1,2) = '" + str(code) + \
        "' AND year >= " + str(year_begin) + \
        " AND year <= " + str(year_end) + " AND month = " + \
        str(month)
    df = pd.read_sql(cmd, conn)
    
    # Inner join on stations to return only the relevant records
    climate_df = pd.merge(df, stations_df, how="inner", 
                          on=["ID", "ID"])
    climate_df["Country"] = country
    
    return climate_df[["NAME", "LATITUDE", "LONGITUDE", 
                       "Country", "Year", "Month","Temp"]]

As an example, let us see the table returned by the function above:

query_climate_database('China', 1980, 2000, 1)

	NAME	LATITUDE	LONGITUDE	Country	Year	Month	Temp
0	AN_XI	40.50	96.0	China	1980	1	-9.35
1	AN_XI	40.50	96.0	China	1981	1	-8.90
2	AN_XI	40.50	96.0	China	1983	1	-10.10
3	AN_XI	40.50	96.0	China	1984	1	-11.70
4	AN_XI	40.50	96.0	China	1985	1	-10.00
...	...	...	...	...	...	...	...
8232	YUANLING	28.47	110.4	China	1995	1	4.15
8233	YUANLING	28.47	110.4	China	1996	1	4.19
8234	YUANLING	28.47	110.4	China	1997	1	4.92
8235	YUANLING	28.47	110.4	China	1999	1	7.68
8236	YUANLING	28.47	110.4	China	2000	1	4.34

8237 rows × 7 columns

Let us define an additional query that can be used for additional data visualization. This function will return all temperature records in a year from a given station.

def year_observations(country, year):
    '''Returns all observations for a given country and year'''
    
    # Match the country name to its FIPS 10-4 code
    code = countries_df.loc[countries_df["Name"] \
                            ==country]["FIPS 10-4"].iloc[0]
    
    # Filter by year and FIPS 10-4 and inner join with 
    # station data
    cmd = "SELECT * FROM temperatures INNER JOIN \
        stations ON temperatures.ID = stations.ID WHERE \
        temperatures.year = " + str(year) + \
        " and SUBSTRING(temperatures.ID,1,2) = '" + \
        str(code) + "'"
    df = pd.read_sql(cmd, conn)
    
    # ID column is returned twice, so we use iloc 
    # to remove the duplicate column
    return df[["ID", "Year", "Month", "Temp", 
               "NAME"]].iloc[:,1:6]

year_observations("China", 2005)

	ID	Year	Month	Temp	NAME
0	CHM00050136	2005	1	-24.77	MOHE
1	CHM00050136	2005	2	-24.03	MOHE
2	CHM00050136	2005	3	-11.62	MOHE
3	CHM00050136	2005	4	0.57	MOHE
4	CHM00050136	2005	5	7.44	MOHE
...	...	...	...	...	...
4341	CHXLT999147	2005	8	26.77	YUANLING
4342	CHXLT999147	2005	9	25.32	YUANLING
4343	CHXLT999147	2005	10	17.93	YUANLING
4344	CHXLT999147	2005	11	14.53	YUANLING
4345	CHXLT999147	2005	12	7.32	YUANLING

4346 rows × 5 columns

Data Analysis

Our new function will help us better tailor the data visualization segment of this walkthrough. Our first visualization exercise will use plotly to map out the locations of a country’s stations and provide an estimate of the year-on-year change in temperature at each station.

To do so, let’s first define a function that can be applied to each group of station observations to obtain the desired estimate. This function, model(), will make use of the OLS linear regression estimate.

def model(df):
    '''Returns the slope of a linear regression that estimates 
    the year-on-year change in temperature for a station'''
    
    return round(LinearRegression().fit(df[['Year']].values, 
                                        df[['Temp']].values)\
                                         .coef_[0][0], 4)

Next, we will create the plot using the following steps:

Fetch the desired table using query_climate_database.
Use groupby() and apply() to group the table by station.
Group by station and count number of observations for each station.
Filter all stations with fewer than min_obs observations and inner join with the estimates table.
Plot the data using scatter_mapbox.

def temperature_coefficient_plot(country, year_begin, 
                                 year_end, month, min_obs, 
                                 **kwargs):
    '''Return a plotly map that plots the estimate for year-
    on-year change in temperature for stations in a country'''
    
    df = query_climate_database(country, year_begin, year_end, 
                                month)
    
    # Group by station (which is uniquely identified by latitude 
    # and longitude) and fetch the OLS estimate for yearly 
    # temperature change
    regmod = df.groupby(['LATITUDE', 'LONGITUDE'], 
                        as_index=False).apply(model)
    agg = pd.DataFrame(data=regmod.values, 
                       columns=['LATITUDE','LONGITUDE',
                                'Estimated Yearly Increase (°C)'])
    
<<<<<<< HEAD
    # Filter out stations with fewer than min_obs observations 
    # and inner join with the estimates table
    g = df.groupby(['NAME','LATITUDE','LONGITUDE'], 
                   as_index = False).count()
    data = g[g.Year >= min_obs][['NAME','LATITUDE','LONGITUDE']]\
                                .merge(agg, on=['LATITUDE',
                                                'LONGITUDE'], 
                                       how='inner')
    fig = px.scatter_mapbox(data, lat="LATITUDE", lon="LONGITUDE", 
                             hover_name = 'NAME', 
                             title = "Estimates in Yearly \
                            Temperature Change in " + \
                            months[month - 1] + " for Stations \
                            in <br>" + country + " for Years \
                            " + str(year_begin) + " - \
                            " + str(year_end),
                             color='Estimated Yearly \
                             Increase (°C)', **kwargs)
    return fig

We obtain the desired visual below:

color_map = px.colors.diverging.RdGy_r

fig = temperature_coefficient_plot("India", 1980, 2000, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=\
                                   color_map,
                                   color_continuous_midpoint=0)

pio.write_html(fig, "./_includes/india_estimate.html")
fig.show()

We can run the same function on December in Canada in the years 2000-2005:

color_map = px.colors.diverging.RdGy_r # choose a colormap

fig2 = temperature_coefficient_plot("Canada", 2000, 2005, 12, 
                                   min_obs = 5,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=\
                                    color_map,
                                   color_continuous_midpoint=0)

pio.write_html(fig2, "./_includes/canada_estimate.html")
fig2.show()

Other Visualizations

To cover some other types of visualizations, we look to histograms and time plots. Using year_observations(), we will first write a function to plot a histogram of temperature observations for a given country in a given year:

def country_obs_plot(country, year, **kwargs):
    '''Returns a distribution of temperature observations 
    (°C) for a country with a marginal rug distribution for 
    each month'''
    
    temp_obs_df = year_observations(country, year)
    
    # gets the histogram
    fig = px.histogram(temp_obs_df, 
                       x = "Temp", 
                       color = "Month", 
                       marginal='rug', 
                       title = "Distribution of \
                       Temperature Observations (°C) by Month \
                       in " + str(country) + ", " + str(year),
                       **kwargs)
    return fig

Let us view the distribution of temperature observations in Brazil in 2010:

fig3 = country_obs_plot("Canada", 2010)

pio.write_html(fig3, "./_includes/canada_distribution.html")
fig3.show()

As we would expect, the temperature is distributed around lower temperatures in the latest and earliest months of the year.

We can also examine the time-path of temperature over each year for a given country using px.line(). We will plot only stations with a complete set of data for all years given in the argument.

def country_obs_by_year_plot(country, year_begin, year_end, 
                            **kwargs):
    '''Returns time plot for each station in a country for a 
    given sequence of years, plotting by station'''
    
    # concatenate dataframes for each month using 
    # query_climate_database
    obs_df = pd.concat([query_climate_database(country, 
                                               year_begin, 
                                               year_end, 
                                               i) 
                        for i in range(1,13)], 
                       ignore_index=True)
    
    # Use this to filter out stations with too few observations
    g = obs_df.groupby(['NAME','LATITUDE','LONGITUDE'], 
                       as_index = False).count()
    data = g[g.Year >= 12 * (year_end - year_begin + 1)]\
     [['NAME','LATITUDE','LONGITUDE']]\
     .merge(obs_df, 
           on=['NAME','LATITUDE','LONGITUDE'], how='inner')

    return px.line(data, 
                   x = "Month", 
                   y = "Temp", 
                   color = "NAME", 
                   facet_col = "Year", 
                   title = "Over-Year Temperature \
                   Change by Station for \
                   " + country + " in Years " + 
                   str(year_begin) + " - " + \
                   str(year_end), **kwargs)

Let’s generate this plot on Brazil in the years 2000 to 2003. We might expect this one to show a stark difference from the temperature distributions in Canada.

fig4 = country_obs_by_year_plot("Brazil", 2000, 2003) 

pio.write_html(fig4, "./_includes/brazil_timeplots.html")
fig4.show()

In contrast, Southern Hemisphere countries reacch a trough in the middle of the year, as evidenced by our time plot.

Written on April 3, 2022