Weather forecast is a complicated process. If you live in an area with lots of oscillation in weather like us in Melbourne, you should always give some chance for the weather to be different from what you see on websites.
The weather is typically forecasted by first gathering a lot of information about the atmosphere, humidity, wind, etc. and then relying on our atmospheric knowledge and a physical model to articulate changes in the near future. But due to our limited understanding of the physical model and the chaotic nature of the atmosphere, it might be unreliable.
Instead of the common approach for this, here we try to scrutinise the idea of entrusting a machine learning model for this purpose. We expect the model to look at the historical data and get a feeling of how the temperature will change in near future, let’s say tomorrow.
The NOAA weather data is one of the publicly-available datasets in Google’s BigQuery and is maintained and updated daily by Googlers. We will use this data along with the recent BigQuery ML to see if it is possible to come up with the forecast.
We will prepare and preprocess the data in BigQuery and rely on BigQuery ML for our machine learning model here. All these are carried out inside BigQuery and this is the beauty of it. So expect to see a lot of queries for training data preparation and pre-processing inside BigQuery, and the final model by BigQuery ML.
Weather data in BigQuery
BigQuery maintains a dataset with the updated National Oceanic and Atmospheric Administration (NOAA) weather data which encompasses the atmospheric data from different stations all over the world. This data contains the climatology data from 1929 to now and is updated daily.
The dataset is `bigquery-public-data.noaa_gsod` and there is a table for each year. So accessing the tables is easy, and by BigQuery’s syntax, it is even possible to access them with regex expressions. On a preliminary study, we observed that there is still sufficient data for the model to train if we limit our training dataset to 2010 onwards. So when creating the dataset we will query the corresponding tables by refering to them as `bigquery-public-data.noaa_gsod.gsod201*`.
Each record of the gsod tables corresponds to an observation from a station. The ‘stn’ field shows the station number which refers us to the stations table where the information about the stations including their geographic location and elevation and their country are kept.
Each observation refers to a specific date and consists of the temperature (min, max, avg), the atmospheric data including wind speed, precipitation, humidity, etc. We will make use of the station and the observation data as our feature inputs to our model.
Data preparation and preprocessing
The training dataset preparation consists of some steps. Firstly, we will limit our training data to the geographic extent of interest. Secondly, we will cast the values to float for the non-float fields. Thirdly, we will handle the missing values, and fourthly, we will prepare our final table with the past 7 days and a label field which is the temperature for tomorrow. By this, I am sure you noticed that our model looks at the temperature and atmospheric data for the past 7 days and predicts the temperature for the upcoming day.
I guess this is more appreciated if you notice that everything is in SQL and inside BigQuery. Just close your eyes and think of making a query that gives you the past 7 days in one record. Self-joins right? Sorry, my friend, that is an anti-pattern in BigQuery and your query takes forever to run. So any other idea how to deal with it? This should be enough boost for you, so bear with me!
Geographic and temporal extent
We will limit our experiment to the data from 2010 onwards and the geographic interest. This is easy, a join and a set of where clause
select lon, lat, elev, stn, temp, ... from `bigquery-public-data.noaa_gsod.gsod201*` gsod ,`bigquery-public-data.noaa_gsod.stations` stns where stns.usaf = gsod.stn and stns.country='AS' and lat > LAT_MIN and lat < LAT_MAX and lon > LON_MIN and lon < LON_MAX
where you can replace the uppercase variables with your extent. A fun point is that Australia is shown by code ‘AS’ in the dataset. But more importantly, we could easily limit our tables to 2010 onwards with the * sign.
Another point for you: if you want to limit your geographic extent to a more complicated polygon rather than the box we used here, you might want to have a look at the BigQuery GIS, which is in Beta at the time of writing this blog.
BigQuery ML treats string fields as categorical values, and that’s not what we want for some fields like elevation of the station. This is easy. Example below
select cast(elev as FLOAT64) as elev , ... from ...
Also we can create the datetime field from the year, month, and day field by
select datetime(year, mo , da , 00 , 00 , 00) as datetime , ... from ...
If you look at the description of the gsod tables, you notice that the missing values are represented by specific values like 99.9 or alike. We will write a query to replace these missing values by the average of the field. An example below
select if( temp = 99.9 , avg_temp, temp) as temp from ... , ( SELECT AVG(temp) AS avg_temp FROM `bigquery-public-data.noaa_gsod.gsod201*` WHERE temp 99.9 )
OLAP instead of self-join and final table
So far we have limited our data to the temporal and geographical extent of interest, cast the fields and replaced the missing values with the average. The most important part of the query is to keep 7 days of data in a single row. If you think about it, the self-join is the solution in conventional databases. But self-joins are among the anti-patterns in BigQuery, and the engine will encounter some difficulty running what you are asking for, especially if there are more than one of them. Instead, the solution is to use the lead online analytical processing (OLAP) function
select temp , LEAD(temp, 1 ) OVER (PARTITION BY stn ORDER BY datetime DESC) AS temp_1, LEAD(temp, 2 ) OVER (PARTITION BY stn ORDER BY datetime DESC) AS temp_2 , ... , LEAD(temp, 7 ) OVER (PARTITION BY stn ORDER BY datetime DESC) AS temp_7 , LEAD(temp, 1 ) OVER (PARTITION BY stn ORDER BY datetime ASC) as LABEL , ... from ... where ...
The lead function returns the value of the value expression (temp here) on a subsequent row. The second parameter is the offset. As an example offset value of 1 will return the previous day’s temperature, and offset value 2 denotes the temperature value for two days ago. Similarly, we can access tomorrow’s weather (LABEL) by sorting the partitions in ascending order.
The partition by stn will tell BigQuery to consider the elements with the same value as one cluster and iterate through them. Hence, the (partition by stn order by datetime desc) will tell BigQuery to consider the rows which denote to the same station as one cluster, and then sort the rows in the cluster by time in descending order and return the temperature values from consecutive rows. This is the right way to self-join in BigQuery!
Here, we talked about different code patterns that are used in the query. Apparently, the final query is a very long query which utilises the code snippets above repetitively to obtain the final features view. Each record of the features view consists of the atmospheric and temperature features for 7 consecutive days and the label which is the temperature of the 8th day. All the fields are clean and ready to be ingested into the linear regression model by BigQuery ML.
Machine learning model by BigQuery ML
BigQuery ML is the recent machine learning module inside BigQuery. At the moment it only consists of the logistic regression for classification and linear regression. Note that at this moment, the aim of the BigQuery ML can be seen to provide an easy and quick approach to implement your first model (your benchmark if you like). But this is a very valuable feature since training and validating a machine learning network is abstracted into a single query.
The training of the linear regression is easy as running the command below.
CREATE or REPLACE MODEL `weather-prediction-dataset.linear_regression` OPTIONS ( model_type="linear_reg" , input_label_cols=["LABEL"] , l2_reg=0.001 , max_iterations=20 , ls_init_learn_rate=0.01 , data_split_method="seq" , data_split_eval_fraction=0.2 , data_split_col="ID" ) select ... from ... where ...
This tells BigQuery to train a linear regression model. All the fields in the query will be considered the input features to the model except the field ID and LABEL. ID is used for sorting the training dataset records, and data_split_eval_fraction will tell BigQuery to take the last 20% of the dataset as the evaluation set. The label field is what the model will predict. This query trains a linear regression model on the first 80% of the data. The remaining 20% of the data is used for evaluation.
We can also report the performance metric on another subset by running
select * from ML.EVALUATE( MODEL `weather-prediction-dataset.linear_regression` , ( select ... from ... where ... ))
and getting a prediction from the model is as easy as
select * from ML.PREDICT ( MODEL `weather-prediction=dataset.linear_regression` , ( select ... from ... where ...))
In this blog, we showed how we can preprocess the weather forecast data and train a linear regression model on top of it. The weather model shows the RMSE of 6.9 Fahrenheit which can be inferred that the simplest linear ML model has been able to yield the temperature prediction by around 7 degrees error on average.
In the next blog in this series, we show how you can use BigQuery for feature selection, and how you can connect the dots and show the final outcome to the end user.