This is the third and last part of my article series on how to get started with Machine Learning on GCP. You will find the other articles here: Part 1, and here: Part 2
So far we have taken data from a source system and uploaded into Google BigQuery. In the second article, we used the interactive notebook environment on GCP to explore our data and identify some issues. Finally, we used Google Dataflow to address those issues in a data stream pipeline and load them into a datastore with clean and quality controlled data.
For the last part we will focus on building a model, and demonstrating how one can use BigQuery Transform clauses to reuse code for feature engineering, model training and making predictions on new data.
Feature Engineering and Selection
Feature engineering is a process where a data scientist uses his or hers domain knowledge to create new data from existing data. This new data can often be used by to enhance the performance of machine learning models.
A different approach is embedded feature engineering, where this process is also automated; either by the model it self during training, or by some other pre-process. This approach allows one to get started quickly which is our focus for this article.
Google BigQuery has an auto ML offering called BigQuery ML and one Transform clauses; these transform the input data and creates more features for the model during training. The transformations are then saved in the model, so the input data does not have to contain all the new features when evaluating new data.
For feature selection we will also use an embedded method. We will train a tree model using XGBoost. XGBoost also include some embedded feature selection during training using L1 and/or L2-regularization to penalize features which do not contribute to a better prediction. It also helps with reducing the models variance, lowering the risk for over fitting.
The amount of regularization applied is controlled by the L1_REG and L2_REG parameters in the CREATE MODEL statement.
Using these two methods we can automate a lot of the process of modeling.
BigQuery ML
BigQuery ML is a service on GCP which allows you to create and use Machine Learning models using standard SQL. It automates many parts of the training process such as hyper parameter tugging and in some cases even data pre processing, saving tremendous amounts of time in the “getting started” phase.
BigQuery ML offers solutions for a range of problems such as time series predictions, regression, classification, clustering and even product recommendations. Whatever your case might be, chances are you can get started fairly quickly and easily with machine learning.
BigQuery is available in the console and ready to use straight away, just navigate to BigQuery in the console and we can build our first model.
To Build a model one simply use the CREATE MODEL statement:
CREATE OR REPLACE MODEL etc_dataset.mymodel TRANSFORM( ML.POLYNOMIAL_EXPAND(STRUCT(open, high, low, close, volume)) AS output, datetime, target) OPTIONS(MODEL_TYPE='BOOSTED_TREE_REGRESSOR', BOOSTER_TYPE = 'GBTREE', NUM_PARALLEL_TREE = 1, MAX_ITERATIONS = 50, EARLY_STOP = TRUE, SUBSAMPLE = 0.85, DATA_SPLIT_COL = 'datetime', DATA_SPLIT_METHOD = 'SEQ', INPUT_LABEL_COLS = ['target']) AS SELECT * FROM ( SELECT datetime, open, high, low, close, volume, lead(close, 5) OVER(ORDER BY datetime asc) as target FROM etc_dataset.btc_processed_data ) WHERE target IS NOT NULL order by datetime asc
The OPTIONS clause gives us the opportunity to tune some of the parameters for the model. For now lets focus on the DATA_SPLIT_COL, DATA_SPLIT_METHOD and INPUT_LABEL_COLS.
Split options tells the pre-processing part of the modeling which column to use when splitting the data in a training and test dataset. The training dataset is used for the actual training of the model, and the test set is then used to evaluate the model.
Since our case is has some “time” embedded into it, it is important not to “cheat” and let the model get a glimpse in the future. To resolve this we specify the split method, which in our case should be sequential. If it would be random which is often the preferred method of splitting data, the training data could contain datapoints which occurred after some of the data in the testset, giving the model an unfair advantage.
The INPUT_LABEL_COL specify which column we want to predict. In this case we want to predict the close price 5 minutes into the future.
For more information on the options, refer to the official docs.
Now just hit “Run” and our model will be trained. Once finished, we must evaluate the performance of our model. This is done by using the EVALUATE clause.
SELECT * FROM ML.EVALUATE(MODEL `etc_dataset.mymodel`)
This query returns a set of performance metrics by testing the model on the testing fraction of the data we specified in the training query. You can however specify another set of data for evaluating the model. Read more here.
From here, you can add features and tune the options and see if it improves the results of the model. Interpretation of these values will vary from case to case, and I will not cover that here.
Adding a TRANSFORM
Lets continue and add the TRANSFORM clause for some simple feature engineering when training our model. There are several functions available in the transform clause, you can find all here.
We are going to do a simple expansion of our feature-set using polynomial expansion.
CREATE OR REPLACE MODEL btc_dataset.mymodel2 TRANSFORM( ML.POLYNOMIAL_EXPAND(STRUCT(open, high, low, close, volume)) AS output, datetime, target) OPTIONS(MODEL_TYPE='BOOSTED_TREE_REGRESSOR', BOOSTER_TYPE = 'GBTREE', NUM_PARALLEL_TREE = 1, MAX_ITERATIONS = 50, EARLY_STOP = TRUE, L1_REG = 0.1, L2_REG = 2, SUBSAMPLE = 0.85, DATA_SPLIT_COL = 'datetime', DATA_SPLIT_METHOD = 'SEQ', INPUT_LABEL_COLS = ['target']) AS SELECT * FROM ( SELECT datetime, open, high, low, close, volume, lead(close, 5) OVER(ORDER BY datetime asc) as target FROM btc_dataset.btc_processed_data ) WHERE target IS NOT NULL order by datetime asc
Polynomial expansion takes each polynomial combination of the input features and outputs a set of all of those combinations. This transformation is saved in the new model, saving us the trouble of having to rewrite it for prediction on new data. We can now use the EVALUATE clause again to see if our model performance is any better, relative to the last model.
Making predictions
To make predictions on new data, simply use the PREDICT clause, all you have to make sure of is that the query which contains the data, contains the same fields as the input for the model training.
SELECT * FROM ML.PREDICT(MODEL `btc_dataset.mymodel2`, (SELECT * FROM `btc_dataset.btc_processed_data` ORDER BY datetime DESC LIMIT 1) )
And voilĂ , we have our first prediction of the “future”.
Since our model now contains all necessary data transformations, the input datasets in the SELECT clause, only needs to contain the same data as the input data for the model training, making it very easy to make new predictions. As we in the future populate the btc_processed_data table with new data, it will be a breeze to make predictions on the future price based on the new data.
In a future article, I will show how you can handle change data capture to make sure your datasets is always up to date, allowing you to make real predictions of the future.
Summary and disclaimer
In this series of articles, the focus has been on how you can use the services and tools offered on GCP to get started quickly with machine learning. We have covered data ingestion, data transformation, and exploration, feature engineering and selection, and lastly modeling and predicting. The aim was never to create a good model for price prediction. XGBoost models are not very good at extrapolating data giving it a hard time giving predictions of values with ranges outside of the training set.
If we were to use the models we have created in this series, we would most likely have a bad time. Please refrain from doing so.
I hope that I have shown the values of the offerings on GCP when it comes to working with data and machine learning, and perhaps made it easier for you to get started with ML on GCP.
More insights and blog posts
When we come across interesting technical things on our adventures, we usually write about them. Sharing is caring!
A summary of the most interesting AI Use Cases we have implemented.
Composable commerce creates the ability to meet changing customer expectations quickly and successfully.
Data Mesh is a strategy for scaling up your reporting and analysis capabilities. Learn more about the Google Cloud building blocks that enable your Data Mesh.