Getting Daily Mean Temperature from Historical Weather Data
I’ve been working on a few home automation projects using historical weather data from the Big Query NOAA Public Weather Dataset to schedule automations, estimate energy usage, or predict when I’ll get my next home heating oil delivery. To take things a step further, I need to know the average mean temperature on a given date.
To do this, we’ll take the last 20 years of NOAA weather data for a given weather station and calculate the mean mean ;) (average temperature by day using the mean temperature values per day). We could just as easily calculate the mean high and low temperature for each day as well.
Let’s get started!
If you have not already done so, be sure to get familiar with the Big Query NOAA GSOD public dataset using my previously published tutorial Finding the Closest Weather Stations. Be sure that you are using a weather station with (complete) “good data”.
You will need the following information:
- stn — Station number for the location
- wban — Historical “Weather Bureau Air Force Navy” number
Let’s start with a simple query to get the mean temperature for each day of the year for the last full year:
SELECT
CAST(temp as Numeric) as meantemp,
CONCAT(year,mo,da) as thedate # Construct a date field
FROM
`bigquery-public-data.noaa_gsod.gsod2019`
WHERE
max != 9999.9 # code for missing data
AND stn = '725037'
AND wban = '94745'
ORDER BY thedate
We CAST the temp value from a FLOAT to NUMERIC to avoid issues when performing calculations (despite what IEEE says). As you can see, we output a simple list of dates (thedate) with the mean temperature per date (temp).
Now that we’ve gotten the mean temperature by date, let’s expand to a period that include several years of data so we can calculate a mean value per day from historical data. Using the BETWEEN operator, we can quickly and easily select a number of years at once.
SELECT
CAST(temp as Numeric) as meantemp,
CONCAT(mo,da) as thedate # Get rid of the Year
FROM
`bigquery-public-data.noaa_gsod.gsod20*` # Use wildcard to select many tables at once
WHERE
max != 9999.9 # code for missing data
AND _TABLE_SUFFIX BETWEEN '00' AND '19' # Limit wildcard tables to 2000-2019
AND stn = '725037'
AND wban = '94745'
We should now have 20 mean temperature values per date:
…
To make this data useful, we we are going to need a way to GROUP data by date (minus the year) in MMDD format. In order to get the mean temperature by date, we are going to need to SUM the meantemp and divide by the COUNT (or number of occurrences) of thedate. By using the COUNT, we cover ourselves in case of missing data.
SELECT
SUM(CAST(temp as Numeric))/COUNT(temp) as mean_mean_temp,
CONCAT(mo,da) as thedate # Get rid of the Year
FROM
`bigquery-public-data.noaa_gsod.gsod20*`
# Use wildcard to select many tables at once
WHERE
max != 9999.9 # code for missing data
AND _TABLE_SUFFIX BETWEEN '00' AND '19'
# Limit wildcard tables to 2000-2019
AND stn = '725037'
AND wban = '94745'GROUP BY thedate
ORDER BY thedate
The result is 366 rows (Don’t forget about the leap years ;) of the mean mean temperature by date.
We can easily do the same for high (max) and low (min) temperature by day as well.
SELECT
SUM(CAST(temp as Numeric))/COUNT(temp) as mean_mean_temp,
SUM(CAST(min as Numeric))/COUNT(min) as mean_min_temp,
SUM(CAST(max as Numeric))/COUNT(max) as mean_max_temp,
CONCAT(mo,da) as thedate # Get rid of the Year
FROM
`bigquery-public-data.noaa_gsod.gsod20*`
# Use wildcard to select many tables at once
WHERE
max != 9999.9 # code for missing data
AND _TABLE_SUFFIX BETWEEN '00' AND '19'
# Limit wildcard tables to 2000-2019
AND stn = '725037'
AND wban = '94745'
GROUP BY thedate
ORDER BY thedate
Since this query uses over 3GB to process, we might consider saving this as its own Big Query table for use is later queries. Let’s click the “Save Results” button and select “BigQuery table”.