Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retraining #20

Open
anacrolix opened this issue Sep 28, 2022 · 1 comment
Open

Retraining #20

anacrolix opened this issue Sep 28, 2022 · 1 comment

Comments

@anacrolix
Copy link
Contributor

It's not clear from the docs, or your blog note about how to retrain dictionaries. If my read of the code is correct, if transparent compression is used, the dict is created once on the first maintenance. For manual compression, one could train and manage one's own dictionaries, presumably you could clean up old ones by ensuring no references were retained?

I assume it's not possible to use a new dictionary for data that was compressed with another dictionary?

Thanks!

@phiresky
Copy link
Owner

If you use the "base" set of functions (zstd_train_dict, zstd_compress, zstd_dcompress), then you can handle dictionaries however you want (store them as blobs in a separate table, identified however you want). You'll also have to do the "reference counting" yourself.

If you use transparent compression, dictionaries are created based on a few factors

  1. The dict chooser expression has to return a non-null value. If it returns null, the corresponding rows are not compressed. This means you can delay compression for specific rows based on the value of other table columns.
  2. If you return '[nodict]' from the dict chooser, the row is compressed without a dictionary
  3. If the amount of data that would be used to train a dictionary is too small, the data stays uncompressed. This is computed separately for each dictionary (each unique value returned from dict_chooser). The heuristic to determine the target dictionary size is total_bytes_in_dict_group * config.dict_size_ratio (default 1%). If that value is < config.min_dict_size_bytes_for_training then a dictionary will not be trained. The default here is 5000 bytes. So your data will by default only be compressed if you have at least 500kByte in one group

Right now, there's no integrated functionality to "retrain" dictionaries or de/recompress data. That would be future functionality, though I think if you chose your dict_chooser well it shouldn't give you that much. You can work around the lack of this feature by decompressing everything and then enabling the compression again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants