Bug in TokenBlockDataset causing potential CUDA OOM due to incorrect size casting #5527

YuvalRingel · 2024-07-21T15:41:24Z

Description:

When initializing an object of TokenBlockDataset, the sizes of data blocks (_sizes) are cast to np.uint16 or np.uint32 based on block_size. However, this casting leads to potential issues later in the code:

Note:
Though this is an edge case that requires a very long sentence, it is indeed possible, and the training process doesn't capture that there's an issue. Leading to OOM exception for cuda, or faulty training.

Incorrect Size Handling: The casting to np.uint16 truncates the sizes, leading self.sizes to store values that do not correspond to the actual sizes of data blocks.
See these lines:

fairseq/fairseq/data/token_block_dataset.py

Lines 136 to 140 in d9a6270

    
           size_dtype = np.uint16 if block_size < 65535 else np.uint32 
        
           num_tokens = slice_indices[-1].max() 
        
           slice_indices_dtype = best_fitting_int_dtype(num_tokens) 
        
           slice_indices = slice_indices.astype(slice_indices_dtype) 
        
           _sizes = _sizes.astype(size_dtype)

size_dtype = np.uint16 if block_size < 65535 else np.uint32
...
_sizes = _sizes.astype(size_dtype)

for exmaple:

Filtering Issue: During filtering in filter_indices_by_size, the incorrect sizes in self.sizes can cause sentences with more tokens than max_tokens to be incorrectly retained, bypassing intended filtering logic.
See this line:

fairseq/fairseq/data/fairseq_dataset.py

Line 174 in d9a6270

indices = indices[self.sizes[indices] <= max_sizes]

indices = indices[self.sizes[indices] <= max_sizes]

OOM in CUDA: This issue can propagate during data iteration and model training, potentially causing CUDA Out-of-Memory (OOM) errors when processing data samples with an extreme number of tokens that were not properly filtered out.

Steps to Reproduce:

Train a model with --max-tokens below 65535 and have a data smple with more then 65535 tokens.
Watch _size at the indice of that data sample get reduced.
Watch the indice of the data sample go past the filtering un-ignored.

Expected Behavior:
self.sizes should accurately reflect the sizes of data blocks, irrespective of the casting to np.uint16 or np.uint32.

Proposed Solution:

Adjust the handling of _sizes and self.sizes to ensure that casting to np.uint16 or np.uint32 does not compromise the integrity of size information.
Assert that no mismatch between _sizes and the actual sizes.
Implement a verification mechanism or adjust the filtering logic in filter_indices_by_size to correctly handle sizes.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in TokenBlockDataset causing potential CUDA OOM due to incorrect size casting #5527

Bug in TokenBlockDataset causing potential CUDA OOM due to incorrect size casting #5527

YuvalRingel commented Jul 21, 2024

Bug in TokenBlockDataset causing potential CUDA OOM due to incorrect size casting #5527

Bug in TokenBlockDataset causing potential CUDA OOM due to incorrect size casting #5527

Comments

YuvalRingel commented Jul 21, 2024