Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in TokenBlockDataset causing potential CUDA OOM due to incorrect size casting #5527

Open
YuvalRingel opened this issue Jul 21, 2024 · 0 comments

Comments

@YuvalRingel
Copy link

Description:

When initializing an object of TokenBlockDataset, the sizes of data blocks (_sizes) are cast to np.uint16 or np.uint32 based on block_size. However, this casting leads to potential issues later in the code:

Note:
Though this is an edge case that requires a very long sentence, it is indeed possible, and the training process doesn't capture that there's an issue. Leading to OOM exception for cuda, or faulty training.

  1. Incorrect Size Handling: The casting to np.uint16 truncates the sizes, leading self.sizes to store values that do not correspond to the actual sizes of data blocks.
    See these lines:
    size_dtype = np.uint16 if block_size < 65535 else np.uint32
    num_tokens = slice_indices[-1].max()
    slice_indices_dtype = best_fitting_int_dtype(num_tokens)
    slice_indices = slice_indices.astype(slice_indices_dtype)
    _sizes = _sizes.astype(size_dtype)
size_dtype = np.uint16 if block_size < 65535 else np.uint32
...
_sizes = _sizes.astype(size_dtype)

for exmaple:
image

  1. Filtering Issue: During filtering in filter_indices_by_size, the incorrect sizes in self.sizes can cause sentences with more tokens than max_tokens to be incorrectly retained, bypassing intended filtering logic.
    See this line:
    indices = indices[self.sizes[indices] <= max_sizes]
indices = indices[self.sizes[indices] <= max_sizes]
  1. OOM in CUDA: This issue can propagate during data iteration and model training, potentially causing CUDA Out-of-Memory (OOM) errors when processing data samples with an extreme number of tokens that were not properly filtered out.

Steps to Reproduce:

  1. Train a model with --max-tokens below 65535 and have a data smple with more then 65535 tokens.
  2. Watch _size at the indice of that data sample get reduced.
  3. Watch the indice of the data sample go past the filtering un-ignored.

Expected Behavior:
self.sizes should accurately reflect the sizes of data blocks, irrespective of the casting to np.uint16 or np.uint32.

Proposed Solution:

  1. Adjust the handling of _sizes and self.sizes to ensure that casting to np.uint16 or np.uint32 does not compromise the integrity of size information.
  2. Assert that no mismatch between _sizes and the actual sizes.
  3. Implement a verification mechanism or adjust the filtering logic in filter_indices_by_size to correctly handle sizes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant