Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liquid cluster columns are updated on every run, even when there is no change #802

Open
krifra1234 opened this issue Sep 20, 2024 · 8 comments
Labels
bug Something isn't working

Comments

@krifra1234
Copy link

krifra1234 commented Sep 20, 2024

Describe the bug

When using liquid clustering the cluster columns in the deltalake table are update every time the dbt model is ran, even if the cluster columns are not changed in the config.

Steps To Reproduce

Create a DBT model with the follwoing config:
materialized= 'incremental',
incremental_strategy= 'append',
liquid_clustered_by= ['columnname']

Run the dbt model multiple times and look for the operation "CLUSTER BY" in the deltalake table history. Fine the column “Operation Parameters” and you will see something similar to this for every run:

{
"oldClusteringColumns": "columnname",
"newClusteringColumns": "columnname"
}

Expected behavior

I would expect the CLUSTER BY operation not to run when the cluster columns are not changed.

Screenshots and log output

If applicable, add screenshots or log output to help explain your problem.

System information

The output of dbt --version:

(dbt_1.8.5) PS C:\repo\dbt-bitechno> dbt --version
Core:
  - installed: 1.8.5
  - latest:    1.8.6 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.8.5 - Update available!
  - spark:      1.8.0 - Up to date!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:
Microsoft Windows 11 Enterpris

The output of python --version:
Python 3.10.14

Additional context

Add any other context about the problem here.

@krifra1234 krifra1234 added the bug Something isn't working label Sep 20, 2024
@krifra1234 krifra1234 changed the title Liquid cluster columns are updated on every run, even when there is no change. Liquid cluster columns are updated on every run, even when there is no change Sep 20, 2024
@benc-db
Copy link
Collaborator

benc-db commented Sep 24, 2024

Thanks for reporting. Need to rethink this.

@sundeep1687
Copy link

sundeep1687 commented Oct 3, 2024

Hi I am also facing the same issue, also another effect of the same is after the alter table cluster by, it's running "optimize table <table_name>" every day.
it would be great if you can prioritize this.

@benc-db
Copy link
Collaborator

benc-db commented Oct 3, 2024

@sundeep1687 you can skip optimize by setting DATABRICKS_SKIP_OPTIMIZE=true

@benc-db
Copy link
Collaborator

benc-db commented Oct 3, 2024

@krifra1234 is the alter operation slow? The alternative is querying for metadata to decide whether to do it or not, and that is not particularly fast.

@sundeep1687
Copy link

Alter option is almost instant, but the optimize after is taking a long time you mean something like this in config or if you have an example please share
{{
config(
materialized="incremental",
incremental_strategy='replace_where',
incremental_predicates =[lookback_predicate],
liquid_clustered_by = 'request_date_local',
DATABRICKS_SKIP_OPTIMIZE =true,
tags=["cvs"]
)
}}

@benc-db
Copy link
Collaborator

benc-db commented Oct 3, 2024

I mean set it as an environment variable. It might also work as a dbt variable, but not as config.

@saadmansoorhbo
Copy link

saadmansoorhbo commented Oct 4, 2024

Hi @benc-db - Can you please confirm that this DATABRICKS_SKIP_OPTIMIZE=true as a dbt var should not mess with table property autoOptimize.optimizeWrite=true? I understand that autoOptimize kicks in at write and will probably work as-is but wanted to 100% sure. We would like to avoid explicit optimize on the table if autoOptimize is working.

@benc-db
Copy link
Collaborator

benc-db commented Oct 4, 2024

that variable only affects whether we call optimize explicitly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants