Analyzing Yearly and Monthly Temperature Changes in India, Canada, and Brazil

Setup

We start by importing five necessary modules:

import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
from sklearn.linear_model import LinearRegression

We can now set up a database and then read in CSV tables for temperature, country, and station data using pd.read_csv().

conn = sqlite3.connect('climate.db')

# Create three tables using the CSV files
temps_df = pd.read_csv("./_includes/temps_stacked.csv")
countries_df = pd.read_csv("./_includes/countries.csv")
stations_df = pd.read_csv("./_includes/station-metadata.csv")

# For writing months in the plot titles
months = ['January','February','March', 'April', 'May', 'June', 
          'July', 'August', 'September', 'October', 'November', 
          'December']

Next, we add our three tables to the database.

# Uses the SQLite database connection to create tables
temps_df.to_sql("temperatures", conn, if_exists="replace", 
                index=False)
countries_df.to_sql("countries", conn, if_exists="replace", 
                    index=False)
stations_df.to_sql("stations", conn, if_exists="replace", 
                   index=False)

Let us first check out the information in each table that might be necessary for our visualization.

stations_df.head()
ID LATITUDE LONGITUDE STNELEV NAME
0 ACW00011604 57.7667 11.8667 18.0 SAVE
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR
countries_df.head()
FIPS 10-4 ISO 3166 Name
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa
temps_df.head()
ID Year Month Temp
0 ACW00011604 1961 1 -0.89
1 ACW00011604 1961 2 2.36
2 ACW00011604 1961 3 4.72
3 ACW00011604 1961 4 7.73
4 ACW00011604 1961 5 11.28

Accessing Country-Level Data

Now, let’s write a function called query_climate_database that will return the subset of a country’s station observations that falls on the given month and in between year_begin and year_end. We can achieve this with an SQL command which additionally finds the correct country records by matching the country’s FIPS 10-4 code to its name in the countries dataframe. We can then inner join this filtered table with the stations table on the station ID.

def query_climate_database(country, year_begin, year_end, month):
    '''Returns all temperature observations from a country 
    between year_begin and year_end for a given month'''
    
    # Identify the given country's FIPS 10-4 code
    code = countries_df.loc[countries_df["Name"] \
                            ==country]["FIPS 10-4"].iloc[0]
    
    # Filter the temperatures table by the FIPS 10-4 code and 
    # the given month and year bounds
    cmd = "SELECT * FROM temperatures \
        WHERE SUBSTRING(id,1,2) = '" + str(code) + \
        "' AND year >= " + str(year_begin) + \
        " AND year <= " + str(year_end) + " AND month = " + \
        str(month)
    df = pd.read_sql(cmd, conn)
    
    # Inner join on stations to return only the relevant records
    climate_df = pd.merge(df, stations_df, how="inner", 
                          on=["ID", "ID"])
    climate_df["Country"] = country
    
    return climate_df[["NAME", "LATITUDE", "LONGITUDE", 
                       "Country", "Year", "Month","Temp"]]

As an example, let us see the table returned by the function above:

query_climate_database('China', 1980, 2000, 1)
NAME LATITUDE LONGITUDE Country Year Month Temp
0 AN_XI 40.50 96.0 China 1980 1 -9.35
1 AN_XI 40.50 96.0 China 1981 1 -8.90
2 AN_XI 40.50 96.0 China 1983 1 -10.10
3 AN_XI 40.50 96.0 China 1984 1 -11.70
4 AN_XI 40.50 96.0 China 1985 1 -10.00
... ... ... ... ... ... ... ...
8232 YUANLING 28.47 110.4 China 1995 1 4.15
8233 YUANLING 28.47 110.4 China 1996 1 4.19
8234 YUANLING 28.47 110.4 China 1997 1 4.92
8235 YUANLING 28.47 110.4 China 1999 1 7.68
8236 YUANLING 28.47 110.4 China 2000 1 4.34

8237 rows × 7 columns

Let us define an additional query that can be used for additional data visualization. This function will return all temperature records in a year from a given station.

def year_observations(country, year):
    '''Returns all observations for a given country and year'''
    
    # Match the country name to its FIPS 10-4 code
    code = countries_df.loc[countries_df["Name"] \
                            ==country]["FIPS 10-4"].iloc[0]
    
    # Filter by year and FIPS 10-4 and inner join with 
    # station data
    cmd = "SELECT * FROM temperatures INNER JOIN \
        stations ON temperatures.ID = stations.ID WHERE \
        temperatures.year = " + str(year) + \
        " and SUBSTRING(temperatures.ID,1,2) = '" + \
        str(code) + "'"
    df = pd.read_sql(cmd, conn)
    
    # ID column is returned twice, so we use iloc 
    # to remove the duplicate column
    return df[["ID", "Year", "Month", "Temp", 
               "NAME"]].iloc[:,1:6]
year_observations("China", 2005)
ID Year Month Temp NAME
0 CHM00050136 2005 1 -24.77 MOHE
1 CHM00050136 2005 2 -24.03 MOHE
2 CHM00050136 2005 3 -11.62 MOHE
3 CHM00050136 2005 4 0.57 MOHE
4 CHM00050136 2005 5 7.44 MOHE
... ... ... ... ... ...
4341 CHXLT999147 2005 8 26.77 YUANLING
4342 CHXLT999147 2005 9 25.32 YUANLING
4343 CHXLT999147 2005 10 17.93 YUANLING
4344 CHXLT999147 2005 11 14.53 YUANLING
4345 CHXLT999147 2005 12 7.32 YUANLING

4346 rows × 5 columns

Data Analysis

Our new function will help us better tailor the data visualization segment of this walkthrough. Our first visualization exercise will use plotly to map out the locations of a country’s stations and provide an estimate of the year-on-year change in temperature at each station.

To do so, let’s first define a function that can be applied to each group of station observations to obtain the desired estimate. This function, model(), will make use of the OLS linear regression estimate.

def model(df):
    '''Returns the slope of a linear regression that estimates 
    the year-on-year change in temperature for a station'''
    
    return round(LinearRegression().fit(df[['Year']].values, 
                                        df[['Temp']].values)\
                                         .coef_[0][0], 4)

Next, we will create the plot using the following steps:

  1. Fetch the desired table using query_climate_database.
  2. Use groupby() and apply() to group the table by station.
  3. Group by station and count number of observations for each station.
  4. Filter all stations with fewer than min_obs observations and inner join with the estimates table.
  5. Plot the data using scatter_mapbox.
def temperature_coefficient_plot(country, year_begin, 
                                 year_end, month, min_obs, 
                                 **kwargs):
    '''Return a plotly map that plots the estimate for year-
    on-year change in temperature for stations in a country'''
    
    df = query_climate_database(country, year_begin, year_end, 
                                month)
    
    # Group by station (which is uniquely identified by latitude 
    # and longitude) and fetch the OLS estimate for yearly 
    # temperature change
    regmod = df.groupby(['LATITUDE', 'LONGITUDE'], 
                        as_index=False).apply(model)
    agg = pd.DataFrame(data=regmod.values, 
                       columns=['LATITUDE','LONGITUDE',
                                'Estimated Yearly Increase (°C)'])
    
<<<<<<< HEAD
    # Filter out stations with fewer than min_obs observations 
    # and inner join with the estimates table
    g = df.groupby(['NAME','LATITUDE','LONGITUDE'], 
                   as_index = False).count()
    data = g[g.Year >= min_obs][['NAME','LATITUDE','LONGITUDE']]\
                                .merge(agg, on=['LATITUDE',
                                                'LONGITUDE'], 
                                       how='inner')
    fig = px.scatter_mapbox(data, lat="LATITUDE", lon="LONGITUDE", 
                             hover_name = 'NAME', 
                             title = "Estimates in Yearly \
                            Temperature Change in " + \
                            months[month - 1] + " for Stations \
                            in <br>" + country + " for Years \
                            " + str(year_begin) + " - \
                            " + str(year_end),
                             color='Estimated Yearly \
                             Increase (°C)', **kwargs)
    return fig

We obtain the desired visual below:

color_map = px.colors.diverging.RdGy_r

fig = temperature_coefficient_plot("India", 1980, 2000, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=\
                                   color_map,
                                   color_continuous_midpoint=0)

pio.write_html(fig, "./_includes/india_estimate.html")
fig.show()

We can run the same function on December in Canada in the years 2000-2005:

color_map = px.colors.diverging.RdGy_r # choose a colormap

fig2 = temperature_coefficient_plot("Canada", 2000, 2005, 12, 
                                   min_obs = 5,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=\
                                    color_map,
                                   color_continuous_midpoint=0)

pio.write_html(fig2, "./_includes/canada_estimate.html")
fig2.show()

Other Visualizations

To cover some other types of visualizations, we look to histograms and time plots. Using year_observations(), we will first write a function to plot a histogram of temperature observations for a given country in a given year:

def country_obs_plot(country, year, **kwargs):
    '''Returns a distribution of temperature observations 
    (°C) for a country with a marginal rug distribution for 
    each month'''
    
    temp_obs_df = year_observations(country, year)
    
    # gets the histogram
    fig = px.histogram(temp_obs_df, 
                       x = "Temp", 
                       color = "Month", 
                       marginal='rug', 
                       title = "Distribution of \
                       Temperature Observations (°C) by Month \
                       in " + str(country) + ", " + str(year),
                       **kwargs)
    return fig

Let us view the distribution of temperature observations in Brazil in 2010:

fig3 = country_obs_plot("Canada", 2010)

pio.write_html(fig3, "./_includes/canada_distribution.html")
fig3.show()

As we would expect, the temperature is distributed around lower temperatures in the latest and earliest months of the year.

We can also examine the time-path of temperature over each year for a given country using px.line(). We will plot only stations with a complete set of data for all years given in the argument.

def country_obs_by_year_plot(country, year_begin, year_end, 
                            **kwargs):
    '''Returns time plot for each station in a country for a 
    given sequence of years, plotting by station'''
    
    # concatenate dataframes for each month using 
    # query_climate_database
    obs_df = pd.concat([query_climate_database(country, 
                                               year_begin, 
                                               year_end, 
                                               i) 
                        for i in range(1,13)], 
                       ignore_index=True)
    
    # Use this to filter out stations with too few observations
    g = obs_df.groupby(['NAME','LATITUDE','LONGITUDE'], 
                       as_index = False).count()
    data = g[g.Year >= 12 * (year_end - year_begin + 1)]\
     [['NAME','LATITUDE','LONGITUDE']]\
     .merge(obs_df, 
           on=['NAME','LATITUDE','LONGITUDE'], how='inner')

    return px.line(data, 
                   x = "Month", 
                   y = "Temp", 
                   color = "NAME", 
                   facet_col = "Year", 
                   title = "Over-Year Temperature \
                   Change by Station for \
                   " + country + " in Years " + 
                   str(year_begin) + " - " + \
                   str(year_end), **kwargs)

Let’s generate this plot on Brazil in the years 2000 to 2003. We might expect this one to show a stark difference from the temperature distributions in Canada.

fig4 = country_obs_by_year_plot("Brazil", 2000, 2003) 

pio.write_html(fig4, "./_includes/brazil_timeplots.html")
fig4.show()

In contrast, Southern Hemisphere countries reacch a trough in the middle of the year, as evidenced by our time plot.

Written on April 3, 2022