Snowflake is revolutionising data handling in today’s dynamic tech world. Its powerful platform now offers pre-compiled ML models. These simplify forecasting and predictive analytics. Machine learning on Snowflake also features time-series data slicing. Furthermore, it allows seamless model re-training. This makes advanced analytics more accessible. Analysts and business leaders alike can now benefit.
Snowflake recently announced new ML features. These are now generally available. Consequently, all new accounts can access them. They offer a standardised way to train models. The provided ML functions are generally suitable for base models. Moreover, they are ideal for proof-of-concept projects.
This article introduces the concept. It also provides a demonstration. The demo focuses on creating forecast predictions. Specifically, it explores slice variations. For instance, you can compare three models. These models use the same taxi data, but with different time slices: daily, every four hours, and every 15 minutes.
The above graphic shows three trained forecast models, on the same data, but in different slicing: daily, per four hours or every 15 minutes.
The idea of pre-compiled ML models as functions
Snowflake is revolutionising data handling in today’s dynamic tech world. Its powerful platform now offers pre-compiled ML models. These simplify forecasting and predictive analytics. Machine learning on Snowflake also features time-series data slicing. Furthermore, it allows seamless model re-training. This makes advanced analytics more accessible. Analysts and business leaders alike can now benefit.
Snowflake recently announced new ML features. These are now generally available. Consequently, all new accounts can access them. They offer a standardised way to train models. The provided ML functions are generally suitable for base models. Moreover, they are ideal for proof-of-concept projects.
This article introduces the concept. It also provides a demonstration. The demo focuses on creating forecast predictions. Specifically, it explores slice variations. For instance, you can compare three models. These models use the same taxi data, but with different time slices: daily, every four hours, and every 15 minutes.
Getting started with machine learning on Snowflake
The basic process for making use of an ML function is the following:
- Prepare time-series data, sliced/grouped into suitable intervals. Store the data object with only data relevant for a model, this means:
- Object contains only the timestamp and the target column.
- Slicing data, means that all data points are represented in a unique window slices, use TIME_SLICE and GROUP BY function in Snowflake SQL.
- Multiple views slicing data differently can exist in parallel, this is done by creating different TIME_SLICE configurations in views/tables.
- Based on prepared data, a FORECAST model is trained to predict:
- CREATE OR REPLACE FORECAST DB.SCHEMA.MODELNAME (INPUT_DATA => SYSTEM$REFERENCE(‘VIEW’, ‘NAME’), TIMESTAMP_COLNAME => ‘TIME_COLUMN’, TARGET_COLNAME => ‘COLUMN_TO_PREDICT’)
- If multiple slicings of data have been prepared, repeat step 6 on each.
- Each forecast model can now be called by its Snowflake object name:CALL MODELNAME!FORECAST(FORECASTING_PERIODS => 100);
- This call create time-series data using forecast model with the trend.
- Store the predictions in separate persistent or temporary table. CREATE TABLE “NAME” AS (SELECT * FROM TABLE(RESULT_SCAN(-1)));
- Use predictions to visualise trends by overlaying existing data. SELECT “TIMESTAMP” as “TS” , “SOURCEVALUE” , NULL AS “FORECAST” FROM “SOURCE_TABLE” UNION SELECT TIME_SLICE(“TS”, 1, HOUR)::TIMESTAMP_NTZ , NULL AS “SOURCEVALUE” , SUM(FORECAST) AS “FORECAST” FROM “FORECAST_TABLE” GROUP BY TIME_SLICE(“TS”, 1, HOUR)::TIMESTAMP_NTZ;
- If you have a dual-column data output from your overlay, click “Chart”, and add your forecast column to the visualisation.
If you trust the released ML functionality, and do not need additional explainability from algorithm tweaking or weighing, then you will at this point have a trained prediction model.
Depending on how many forecasting periods you decided to predict, and how you choose to compare the historic with predicted, you will have something like this:
Once predictions are working, a training/test split can be conducted, and quality metrics of the trained model can be derived of how well it predicts. To perform this test, train the model on only a part of the data, and calculate the deviation from each prediction
Unlock the full Potential of forecasting with machine learning on Snowflake
Analysts play a crucial role. They understand the dataset. Thus, they are responsible for slicing the timescale. They also set up the input data parameters. To explore the possibilities, I recommend some experimentation. Train a small set of models on different data slices. For example, try daily, hourly, and minute-by-minute slices.
This process offers valuable experience. By training multiple models, analysts learn the tool. They also discover how different time slices affect predictions.
Setting up the model for re-training on new data
Depending on use case, this interval is application-specific. But a rule of thumb would be that a lot more data should be available before retraining.
When the ideal slicing approach has been selected, make sure the data pipeline is stable and will contain the updated records when available. When this is done, select a regular interval where you would like the model to be retrained, and schedule this by placing step 2, in a Snowflake task during off-hours. Small tips:
- Construct your data source, so base views will reflect updated data.
- Expose an object that contains both historical and forecasted data.
- Use your favourite visualisation tool to overlay the two time series.
Snowsight can present data natively, though additional tweaks are limited. The visualisation below, shows blue historical data, and two trained forecast models; a green with data sliced into 4 hour slots, and a yellow where the slicing is done for each day:
Short term predictions are similar, but only the daily sliced seems to take an overall steady growth into consideration. They are trained on the same data points, just grouped differently for training. Cost of training were done on x–small Snowflake warehouse:
- Green/four hour took around 6 minutes to train.
- Yellow/daily took around 2 minutes to train.
The total of 8 minutes training time amount to less than one euro/dollar in training time, depending on your Snowflake plan. So the initially provided credits by Snowflake, should be more than sufficient in finishing a PoC.
Pitfalls when using the current version of models
A few important topics to mention, to avoid pitfall when designing a prediction model using native Snowflake ML functions:
- All namespaces seem to currently be required in UPPERCASE.
- Create a view for slicing/aggregations, timestamps must be unique.
- Timestamps must not have large gaps, and adhere to “cleaned data”.
- Regular sized warehouses can perform the training for smaller data sets, but warehouses of the SNOWPARK type will not run out of memory for training on larger datasets (for detailed minute-sliced training, I ended up with ~1 M rows, which required around 50 minutes on a medium Snowpark optimised warehouse ~6 Snowflake credits total)
That’s it, go have fun with your new toys!
Simplify Forecasting with Snowflake ML
Leverage Snowflake’s ML tools to enhance data forecasting. Contact our experts today.