Incremental Restate Materialization #813

rhoward-rr · 2024-09-30T19:16:53Z

rhoward-rr
Sep 30, 2024

Hello, all. I've developed a new materialization for dbt on Databricks that we have been using at Rec Room for over a year now and I have been highly encouraged to share with the dbt community. This materialization provides: 1) new options for incremental strategies, the 2) ability to restate history in a rolling window to handle late arriving data, 3) the ability to run the materialization in batches so that you can (re)load large amounts of data in manageable chunks, and 4) the ability to restate a specific period by passing start and end dates in a var, making reloads/backfills much simpler.

I'd love to contribute this to the databricks adapter for everyone to use.

Below is the README documentation I have for the materialization. Let me know your thoughts.

incremental_restate

The incremental restate materialization is intended to support models where history needs
to be restated, and:

There is no usable unique key.
Or, records in the target table need to be deleted if they are no longer in source.
Or, all or most of the records in the range being updated will be updated, so a standard merge would be inefficient.
Or, the model could benefit by some of the additional features of this materialization, such as changing the load range at runtime or breaking up a load into batches.

A typical config using this materialization might look like this:

{{
  config(
    materialized = 'incremental_restate'
    , incremental_strategy = 'replace'
    , file_format = 'delta'
    , restate_period = 5
    , restate_unit = 'DAY'
    , restate_column = 'created_date_utc'
    , batch_size = 1
    , batch_unity = 'WEEK'
  )
}}
select
    *
from
    source_table as s
where
    s.some_date_column between __start__ and __end__

Note the placeholders __start__ and __end__. These should always be included in any model using this
materialization and will be dynamically replaced just before execution based on your settings.

Model Configurations:

In addition to other common configurations, this materialization supports the following options:

incremental_strategy can be set to "replace", "merge", "append", or "safe_append" .

When set to replace, all records that are in the update range in the target table will be deleted using the
replace where insert parameter. This option should be chosen when there is no unique key.
When using merge, the source records will be merged into the target table. By default, the merge will
be set to delete records no longer in source, but this functionality can be disabled by setting merge_delete = false
When using append, the source records will be naively inserted into the target. It's important therefore
when using this strategy that the model itself prevents duplication.
When using safe_append, the source records will be inserted only if they do not match a record in the target
that is in the restate range. A unique key will be required when using this strategy.

restate_column is the name of the column in the target table that should be used as the date to be restated.

restate_unit is the SQL interval (day, week, month, year) that your restate_period will be given in. (Defaults to DAY)

restate_period determines how long the model's lookback period is. Set this to the number of days, weeks, months or years desired.
This can also be set to 0 if data should NOT be restated and only records after the max date from the table
should be loaded.

merge_delete can be set explicitly to "false" when using the "merge" strategy to prevent deleting records
no longer in source. Using this option SUBSTANTIALLY improve performances. (Defaults to "false" -- it will NOT
delete records no longer in source.)

initial_start_date is the earliest date that should be automatically loaded into the table during
the initial load. This can be especially useful when you would like to have a subset of more recent data
immediately available for use and intend to backfill older data in a separate load using vars.

disable_incremental_predicate controls whether to add a filter on the merge operation into the target table.
This config can be useful for state tables or tables with long term late arriving facts. If it is set to true it will remove the incremental predicate and merge new data into the entirety of the target table instead of a range of specific dates.

Large Table Batching

If your table is too large to be run in a single run, you can configure it to be run in batches of
days, weeks, months or years. The batch size will be used for first time loads / full refreshes and
incremental updates. To set up your model to run in batches add the following configurations:

batch_unit should be set to the SQL Interval (day, week, month, or year) desired.

batch_size will determine now many days, weeks, months, or years should be batched together.

Runtime Options (Backfill)

When a start_date and end_date are passed through the runtime vars, they will be used in preference
to the automatically generated dates. This can enable backfilling or reloading portions of a table.

Below is an example of passing the parameters via the command line. Note that cli is very finicky about
spacing, so if you do not put a space between the colon (:) and the value, it will not be read correctly.

dbt run --select your_model_name --vars '{start_date: "2023-07-01", end_date: "2023-08-01"}'

You may also pass the backfill flag to have the model loaded only up to the first date that has data
within your specified range. For instance, if your table currently has data from 1/1/2023 to 7/1/2023
and you set the start_date to 1/1/2020 and the end date to 12/31/2023, data will be loaded from 1/1/2020
to 1/1/2023 only.

dbt run --select your_model_name --vars '{start_date: "2020-01-01", end_date: "2023-12-31", backfill: true}'

rhoward-rr · 2024-10-07T15:24:10Z

rhoward-rr
Oct 7, 2024
Author

Is it easier if I just start the process of doing a PR for this?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental Restate Materialization #813

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Incremental Restate Materialization #813

rhoward-rr Sep 30, 2024

incremental_restate

Model Configurations:

Large Table Batching

Runtime Options (Backfill)

Replies: 1 comment

rhoward-rr Oct 7, 2024 Author

rhoward-rr
Sep 30, 2024

rhoward-rr
Oct 7, 2024
Author