Incremental Restate Materialization #813
rhoward-rr
started this conversation in
Ideas
Replies: 1 comment
-
Is it easier if I just start the process of doing a PR for this? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, all. I've developed a new materialization for dbt on Databricks that we have been using at Rec Room for over a year now and I have been highly encouraged to share with the dbt community. This materialization provides: 1) new options for incremental strategies, the 2) ability to restate history in a rolling window to handle late arriving data, 3) the ability to run the materialization in batches so that you can (re)load large amounts of data in manageable chunks, and 4) the ability to restate a specific period by passing start and end dates in a var, making reloads/backfills much simpler.
I'd love to contribute this to the databricks adapter for everyone to use.
Below is the README documentation I have for the materialization. Let me know your thoughts.
incremental_restate
The incremental restate materialization is intended to support models where history needs
to be restated, and:
A typical config using this materialization might look like this:
Note the placeholders
__start__
and__end__
. These should always be included in any model using thismaterialization and will be dynamically replaced just before execution based on your settings.
Model Configurations:
In addition to other common configurations, this materialization supports the following options:
incremental_strategy
can be set to "replace", "merge", "append", or "safe_append" .replace
, all records that are in the update range in the target table will be deleted using thereplace where
insert parameter. This option should be chosen when there is no unique key.merge
, the source records will be merged into the target table. By default, the merge willbe set to delete records no longer in source, but this functionality can be disabled by setting
merge_delete = false
append
, the source records will be naively inserted into the target. It's important thereforewhen using this strategy that the model itself prevents duplication.
safe_append
, the source records will be inserted only if they do not match a record in the targetthat is in the restate range. A unique key will be required when using this strategy.
restate_column
is the name of the column in the target table that should be used as the date to be restated.restate_unit
is the SQL interval (day, week, month, year) that your restate_period will be given in. (Defaults to DAY)restate_period
determines how long the model's lookback period is. Set this to the number of days, weeks, months or years desired.This can also be set to
0
if data should NOT be restated and only records after the max date from the tableshould be loaded.
merge_delete
can be set explicitly to "false" when using the "merge" strategy to prevent deleting recordsno longer in source. Using this option SUBSTANTIALLY improve performances. (Defaults to "false" -- it will NOT
delete records no longer in source.)
initial_start_date
is the earliest date that should be automatically loaded into the table duringthe initial load. This can be especially useful when you would like to have a subset of more recent data
immediately available for use and intend to backfill older data in a separate load using vars.
disable_incremental_predicate
controls whether to add a filter on the merge operation into the target table.This config can be useful for state tables or tables with long term late arriving facts. If it is set to true it will remove the incremental predicate and merge new data into the entirety of the target table instead of a range of specific dates.
Large Table Batching
If your table is too large to be run in a single run, you can configure it to be run in batches of
days, weeks, months or years. The batch size will be used for first time loads / full refreshes and
incremental updates. To set up your model to run in batches add the following configurations:
batch_unit
should be set to the SQL Interval (day, week, month, or year) desired.batch_size
will determine now many days, weeks, months, or years should be batched together.Runtime Options (Backfill)
When a start_date and end_date are passed through the runtime vars, they will be used in preference
to the automatically generated dates. This can enable backfilling or reloading portions of a table.
Below is an example of passing the parameters via the command line. Note that cli is very finicky about
spacing, so if you do not put a space between the colon (:) and the value, it will not be read correctly.
You may also pass the
backfill
flag to have the model loaded only up to the first date that has datawithin your specified range. For instance, if your table currently has data from 1/1/2023 to 7/1/2023
and you set the start_date to 1/1/2020 and the end date to 12/31/2023, data will be loaded from 1/1/2020
to 1/1/2023 only.
Beta Was this translation helpful? Give feedback.
All reactions