Finding the Closest Weather Stations — BigQuery Public Datasets
Use BigQuery Public Datasets, Geography Functions, Wildcard Tables, and ARRAY_AGG & UNNEST to locate and query local historical weather data near any address.
While working on a smart home data project, I found myself searching for public weather data. It turns out that Google BigQuery has a ton of weather data in the NOAA GSOD (Global Surface Summary of the Day Weather Data) Public Datasets — comprised of daily weather data from approximately 30,000 distinct weather stations worldwide.
While it would have been easy enough to filter by country and state, then weed through the list until I found something that looked familiar; this was a great excuse to delve into BigQuery Geography Functions!
Get Latitude & Longitude of Your Location
Let’s start by getting our geo coordinates in Google Maps. You can start by searching for an address, then right-click that point on the map and choose “What’s here?”
The resulting dialog box will show your address and the geo coordinates at the bottom of the screen (Latitude, Longitude). Take note and set aside for now.
BigQuery Public Datasets
Let’s dig into the NOAA GSOD (Global Surface Summary of the Day Weather Data) Public Dataset and add to BigQuery in our Google Cloud project.
Click “VIEW DATASET” which should open a new BigQuery tab and add the noaa_gsod dataset to your project.
Be sure to scroll up and “pin” the bigquery-public-data project.
Let’s Find a Station (the easy way)
Start by expanding the noaa_gsod dataset and scroll to the bottom until you see the stations table.
Let’s run a simple query against the stations table.
SELECT *
FROM `bigquery-public-data.noaa_gsod.stations`
WHERE state = 'ny'
No results found….
Let’s get a list of states.
SELECT DISTINCT(state)
FROM `bigquery-public-data.noaa_gsod.stations`
ORDER BY state asc
So, it looks like state is all caps. Let’s try that search again.
SELECT *
FROM `bigquery-public-data.noaa_gsod.stations`
WHERE state = 'NY'
We can now get a sense for some of the weather stations in the area and confirm that the lat and lon coordinates (latitude and longitude) jive with what we got from Google Maps earlier.
Let’s Find a Station (the hard way)
Time to dig out the latitude and longitude coordinates that we obtained from Google Maps — 41.197, -73.764. Since we’ll be using the BigQuery Geography Function ST_GEOGPOINT we’ll need to reverse the values to longitude, latitude. Let’s start by setting up the ST_GEOGPOINT coordinates and defining the 2 pieces of data that we’ll be comparing — the geo coordinates of my location and weather station data table.
-- Get nearest weather station by geo coordinateswith my_location as (
SELECT ST_GEOGPOINT(-73.764, 41.197) as my_location,
'Chappaqua' as home
), stations as (
SELECT *, ST_GEOGPOINT(lon,lat) as latlon_geo
FROM `bigquery-public-data.noaa_gsod.stations`
)
In the temp table my_location we are setting 2 fields:
- my_location = ST_GEOGPOINT(-73.764, 41.197)
- home = ‘Chappaqua’
In the temp table stations we are grabbing all of the columns in the bigquery-public-data.noaa_gsod.stations table and adding a final column called latlon_geo which is constructed the same way as the my_location field in the my_location temp table — ST_GEOGPOINT(lon,lat).
Let’s try creating an array of all weather stations within a 20-mile radius (32,187 meters). We’ll first include fields from the my_location temp table then construct an array of fields from the stations table and adding in a calculation of distance between my_location and each station using the ST_DISTANCE function.
-- Get nearest weather station by geo coordinateswith my_location as (
SELECT ST_GEOGPOINT(-73.764, 41.197) as my_location,
'Chappaqua' as home
), stations as (
SELECT *, ST_GEOGPOINT(lon,lat) as latlon_geo
FROM `bigquery-public-data.noaa_gsod.stations`
)
--
SELECT ST_ASTEXT(my_location) as my_location,
home,
ARRAY_AGG( # get the closest station
STRUCT(usaf,wban,name,lon,lat,country,state,
ST_DISTANCE(my_location, b.latlon_geo)*0.00062137 as miles)
) as stations
FROM my_location a, stations b
WHERE ST_DWITHIN(my_location, b.latlon_geo, 32187) --meters = 20 miles
GROUP BY my_location, home
The result is an array with the fields we specified from the stations table plus our distance calculation miles.
The data looks good and shows 6 stations within 20 miles in the nested array stations. Surely the closest station would be the best, right? There’s something about the 999999 in the results that make me a bit suspicious. Are there any other data we need to look at to avoid pitfalls of bad data? Can we assume that complete data exists for all stations?? Let’s look further.
Let’s Get to Know Each Station
Now we’ll need to unnest the array so we can join to other tables and query the data more easily. Add the 3 bolded lines below to wrap the main section of our last query. Notice that we are using the UNNEST command to expand the array stations so we can query it like a table.
-- Get nearest weather station by geo coordinateswith my_location as (
SELECT ST_GEOGPOINT(-73.764, 41.197) as my_location,
'Chappaqua' as home
), stations as (
SELECT *, ST_GEOGPOINT(lon,lat) as latlon_geo
FROM `bigquery-public-data.noaa_gsod.stations`
)
--
SELECT home,my_location, st.*,
FROM (
SELECT ST_ASTEXT(my_location) as my_location,
home,
ARRAY_AGG( # get the closest station
STRUCT(usaf,wban,name,lon,lat,country,state,
ST_DISTANCE(my_location, b.latlon_geo)*0.00062137 as miles)
) as stations
FROM my_location a, stations b
WHERE ST_DWITHIN(my_location, b.latlon_geo, 32187) --meters = 20 miles
GROUP BY my_location, home
), UNNEST(stations) as st
Now the results are a bit more manageable and easier to join.
Going back to the other tables in the noaa_gsod dataset, let’s see if we can compare the actual weather data for each station. To do that, we’ll need to join in weather data from a given year. Let’s look at the data for 2020 in the gsod2020 table. Locate and click the gsod2020 table, then click the preview tab. Notice that the usaf and wban fields seem to match up to the stn and wban fields. Let’s join on that, then grab all of the fields from the gsod2020 table. To get a sense for the frequency of data collection, let’s also count the temp (daily mean temperature) field for each station since that’s the data point that I am interested in for my smart home project.
-- Get nearest weather station by geo coordinateswith my_location as (
SELECT ST_GEOGPOINT(-73.764, 41.197) as my_location,
'Chappaqua' as home
), stations as (
SELECT *, ST_GEOGPOINT(lon,lat) as latlon_geo
FROM `bigquery-public-data.noaa_gsod.stations`
), get_closest as (
SELECT home,my_location, st.*,
FROM (
SELECT ST_ASTEXT(my_location) as my_location,
home,
ARRAY_AGG( # get the closest station
STRUCT(usaf,wban,name,lon,lat,country,state,
ST_DISTANCE(my_location, b.latlon_geo)*0.00062137 as miles)
) as stations
FROM my_location a, stations b
WHERE ST_DWITHIN(my_location, b.latlon_geo, 32187) --meters = 20 miles
GROUP BY my_location, home
), UNNEST(stations) as st
)SELECT gc.*, COUNT(temp) as Data_Points
FROM get_closest gc, `bigquery-public-data.noaa_gsod.gsod2020` gs
WHERE max != 9999.9 # code for missing data
AND gc.usaf = gs.stn
AND gc.wban = gs.wbanGROUP BY home, my_location, usaf, wban, name, lon, lat, country, state, miles
ORDER BY miles ASC
It turns out that 4 of the 6 stations do not have data in 2020. We are left with 2 stations that have collected the same amount of data in 2020.
Between the 2 airport weather stations, one seems to be twice as far as the other. But, I need to look at weather data since 2011. Do I need to join in 10 datasets individually (2011–2020)? There has to be an easier way…..
Wildcard Tables
BigQuery Wildcard Tables provide a way to query multiple tables with one FROM statement using wildcards in the table name.
SELECT * FROM `bigquery-public-data.noaa_gsod.gsod20*`
This we be OK if we needed all data from 2000–present, but we only need 2011–present. I guess we could filter out the years that we don’t need, but these are giant tables and querying them all is going to be expensive!
Yikes! Turns out we can also filter the wildcard in the FROM statement with an additional WHERE clause.
AND _TABLE_SUFFIX BETWEEN '11' AND '20'
Well, that’s a bit better… about 12x better (smaller)!
Putting it all together, it should look something like this:
-- Get nearest weather station by geo coordinateswith my_location as (
SELECT ST_GEOGPOINT(-73.764, 41.197) as my_location,
'Chappaqua' as home
), stations as (
SELECT *, ST_GEOGPOINT(lon,lat) as latlon_geo
FROM `bigquery-public-data.noaa_gsod.stations`
), get_closest as (
SELECT home,my_location, st.*,
FROM (
SELECT ST_ASTEXT(my_location) as my_location,
home,
ARRAY_AGG( # get the closest station
STRUCT(usaf,wban,name,lon,lat,country,state,
ST_DISTANCE(my_location, b.latlon_geo)*0.00062137 as miles)
) as stations
FROM my_location a, stations b
WHERE ST_DWITHIN(my_location, b.latlon_geo, 32187) --meters = 20 miles
GROUP BY my_location, home
), UNNEST(stations) as st
) -- Thanks to Felipe Hoffa - https://stackoverflow.com/a/53678307/11748236-- get count of data points from closest stations for 2011-2020
SELECT gc.*, COUNT(temp) as Data_Points
FROM get_closest gc, `bigquery-public-data.noaa_gsod.gsod20*` gs
WHERE max != 9999.9 # code for missing data
AND _TABLE_SUFFIX BETWEEN '11' AND '20'
AND gc.usaf = gs.stn
AND gc.wban = gs.wbanGROUP BY home, my_location, usaf, wban, name, lon, lat, country, state, miles
ORDER BY miles ASC
It turns out that one of the 4 that dropped out in the last query re-appeared since that station collected data over 1532 days between 2011 and 2020. Looking at the 2 remaining stations from our least query, we see that the closer station WESTCHESTER COUNTY AIRPORT has 3 more data points than the further location — 3538 vs 3535.
Winner — WESTCHESTER COUNTY AIRPORT
Bonus Content
For fun, let’s make sure that we have a complete dataset. How many days since 1/1/2011? Sounds like a job for the DATE_DIFF function!
date_diff(Date1, Date2, date_units)
Let’s plug in the dates and see what we get.
SELECT DATE_DIFF('2011-01-01','2020-09-09',day)
Since today’s mean temperature has not yet been computed, 3539 days jives with the 3538 records we found earlier. I’m looking forward to using this data in a project where I’ll compute the K-value to be able to predict when I’ll get my next home heating oil delivery and the number of gallons to be delivered.
Special thanks to Felipe Hoffa who’s query got me started with the UNNEST approach.