Introducing BigQuery ML for building predictive models with SQL

Google’s beta extension performs linear regression forecasting and binary logistic classification in the BigQuery data warehouse

One key to efficient data analysis of big data is to do the computations where the data lives. In some cases, that means running R, Python, Java, or Scala programs in a database such as SQL Server or in a big data environment such as Spark. But that takes some fairly technical programming and data science skills not often found among business analysts or SQL programmers. In addition, if you have to extract, transform, and load your datasets from your data warehouse to another data store for machine learning, you introduce delays in the process.

To that end, Google has announced a beta release of BigQuery ML, a SQL-based extension to its enterprise data warehouse service, for building and deploying machine learning models. This is important because it lowers the barriers to training certain kinds of models and deploying predictive analytic services, both reducing the time required and bringing it within reach of data analysts and statisticians.

One SQL-style statement and six functions have been added to BigQuery's SQL dialect to support machine learning: the CREATE MODEL statement and the functions ML.EVALUATE, ML.ROC_CURVEML.PREDICT, ML.TRAINING_INFO, ML.FEATURE_INFO, and ML.WEIGHTS. Google has produced animated GIFs that show how to create a model with the CREATE MODEL statement and generate predictions from the model using the ML.PREDICT function in a SELECT query.

You might wonder how all the formidable complexity of machine learning can be encapsulated in a SQL statement. The short answer is that it can't, at least not now. BigQuery ML is nota general-purpose neural network modeling tool on the order of TensorFlow.

There are only two supported types of models: linear regression for forecasting (such as "What will next month's sales be?") and binary logistic regression for classification (such as "Will the credit applicant default?"). BigQuery ML can't even handle multiclass logistic regression, although there's a hint in the data structures used that multiclass logistic classification might be supported in a future release.

The way you store data for normal SQL queries isn't appropriate for machine learning, so in preparing features (input variables used for predictions) for a CREATE MODEL statement, BigQuery ML automatically one-hot encodes categorical variables (strings, dates, times, and Booleans) and standardizes numerical variables (numeric, float, and integer). Numerical variables are also standardized for prediction. BigQuery ML can draw data from multiple BigQuery datasets for both training and prediction.

To continue reading this article register now