Machine Learning on Snowflake: Simplify Forecasting Today!

Snowflake is revolutionising data handling in today’s dynamic tech world. Its powerful platform now offers pre-compiled ML models. These simplify forecasting and predictive analytics. Machine learning on Snowflake also features time-series data slicing. Furthermore, it allows seamless model re-training. This makes advanced analytics more accessible. Analysts and business leaders alike can now benefit.

Snowflake recently announced new ML features. These are now generally available. Consequently, all new accounts can access them. They offer a standardised way to train models. The provided ML functions are generally suitable for base models. Moreover, they are ideal for proof-of-concept projects.

This article introduces the concept. It also provides a demonstration. The demo focuses on creating forecast predictions. Specifically, it explores slice variations. For instance, you can compare three models. These models use the same taxi data, but with different time slices: daily, every four hours, and every 15 minutes.

Machine learning on Snowflake: Comparing three forecast models with different time slices (daily, 4-hourly, 15-minutely) — Forecast Taxi Series Data: Comparison of three trained forecast models on the same dataset with different slicing—daily, per four hours, and every 15 minutes.

The above graphic shows three trained forecast models, on the same data, but in different slicing: daily, per four hours or every 15 minutes.

Machine learning on Snowflake: Table displaying time series data for driver payments and forecast results using different time intervals. — “Forecast Model Data Table: Display of the time series data for driver payments alongside forecast results using different time intervals.

The idea of pre-compiled ML models as functions

Getting started with machine learning on Snowflake

The basic process for making use of an ML function is the following:

Prepare time-series data, sliced/grouped into suitable intervals. Store the data object with only data relevant for a model, this means:
Object contains only the timestamp and the target column.
Slicing data, means that all data points are represented in a unique window slices, use TIME_SLICE and GROUP BY function in Snowflake SQL.
Multiple views slicing data differently can exist in parallel, this is done by creating different TIME_SLICE configurations in views/tables.
Based on prepared data, a FORECAST model is trained to predict:
CREATE OR REPLACE FORECAST DB.SCHEMA.MODELNAME (INPUT_DATA => SYSTEM$REFERENCE(‘VIEW’, ‘NAME’), TIMESTAMP_COLNAME => ‘TIME_COLUMN’, TARGET_COLNAME => ‘COLUMN_TO_PREDICT’)
If multiple slicings of data have been prepared, repeat step 6 on each.
Each forecast model can now be called by its Snowflake object name:CALL MODELNAME!FORECAST(FORECASTING_PERIODS => 100);
This call create time-series data using forecast model with the trend.
Store the predictions in separate persistent or temporary table. CREATE TABLE “NAME” AS (SELECT * FROM TABLE(RESULT_SCAN(-1)));
Use predictions to visualise trends by overlaying existing data. SELECT “TIMESTAMP” as “TS” , “SOURCEVALUE” , NULL AS “FORECAST” FROM “SOURCE_TABLE” UNION SELECT TIME_SLICE(“TS”, 1, HOUR)::TIMESTAMP_NTZ , NULL AS “SOURCEVALUE” , SUM(FORECAST) AS “FORECAST” FROM “FORECAST_TABLE” GROUP BY TIME_SLICE(“TS”, 1, HOUR)::TIMESTAMP_NTZ;
If you have a dual-column data output from your overlay, click “Chart”, and add your forecast column to the visualisation.

If you trust the released ML functionality, and do not need additional explainability from algorithm tweaking or weighing, then you will at this point have a trained prediction model.

Depending on how many forecasting periods you decided to predict, and how you choose to compare the historic with predicted, you will have something like this:

Machine learning on Snowflake: Visualizing driver payment data against forecasted values for minute-level predictions. — ML Taxi Train: Visualisation of driver payment data against the forecasted values for minute-level predictions using machine learning.

Once predictions are working, a training/test split can be conducted, and quality metrics of the trained model can be derived of how well it predicts. To perform this test, train the model on only a part of the data, and calculate the deviation from each prediction

Unlock the full Potential of forecasting with machine learning on Snowflake

Analysts play a crucial role. They understand the dataset. Thus, they are responsible for slicing the timescale. They also set up the input data parameters. To explore the possibilities, I recommend some experimentation. Train a small set of models on different data slices. For example, try daily, hourly, and minute-by-minute slices.

This process offers valuable experience. By training multiple models, analysts learn the tool. They also discover how different time slices affect predictions.

Setting up the model for re-training on new data

Depending on use case, this interval is application-specific. But a rule of thumb would be that a lot more data should be available before retraining.

When the ideal slicing approach has been selected, make sure the data pipeline is stable and will contain the updated records when available. When this is done, select a regular interval where you would like the model to be retrained, and schedule this by placing step 2, in a Snowflake task during off-hours. Small tips:

Construct your data source, so base views will reflect updated data.
Expose an object that contains both historical and forecasted data.
Use your favourite visualisation tool to overlay the two time series.

Snowsight can present data natively, though additional tweaks are limited. The visualisation below, shows blue historical data, and two trained forecast models; a green with data sliced into 4 hour slots, and a yellow where the slicing is done for each day:

Machine learning on Snowflake: Visualizing driver payment sum against forecast predictions with different time slices. — ML Taxi Train: Driver Payment Sum vs. Forecast Predictions

Short term predictions are similar, but only the daily sliced seems to take an overall steady growth into consideration. They are trained on the same data points, just grouped differently for training. Cost of training were done on x–small Snowflake warehouse:

Green/four hour took around 6 minutes to train.
Yellow/daily took around 2 minutes to train.

The total of 8 minutes training time amount to less than one euro/dollar in training time, depending on your Snowflake plan. So the initially provided credits by Snowflake, should be more than sufficient in finishing a PoC.

Pitfalls when using the current version of models

A few important topics to mention, to avoid pitfall when designing a prediction model using native Snowflake ML functions:

All namespaces seem to currently be required in UPPERCASE.
Create a view for slicing/aggregations, timestamps must be unique.
Timestamps must not have large gaps, and adhere to “cleaned data”.
Regular sized warehouses can perform the training for smaller data sets, but warehouses of the SNOWPARK type will not run out of memory for training on larger datasets (for detailed minute-sliced training, I ended up with ~1 M rows, which required around 50 minutes on a medium Snowpark optimised warehouse ~6 Snowflake credits total)

That’s it, go have fun with your new toys!

Simplify Forecasting with Snowflake ML

Leverage Snowflake’s ML tools to enhance data forecasting. Contact our experts today.

Get in touch