You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When initializing an object of TokenBlockDataset, the sizes of data blocks (_sizes) are cast to np.uint16 or np.uint32 based on block_size. However, this casting leads to potential issues later in the code:
Note:
Though this is an edge case that requires a very long sentence, it is indeed possible, and the training process doesn't capture that there's an issue. Leading to OOM exception for cuda, or faulty training.
Incorrect Size Handling: The casting to np.uint16 truncates the sizes, leading self.sizes to store values that do not correspond to the actual sizes of data blocks.
See these lines:
Filtering Issue: During filtering in filter_indices_by_size, the incorrect sizes in self.sizes can cause sentences with more tokens than max_tokens to be incorrectly retained, bypassing intended filtering logic.
See this line:
OOM in CUDA: This issue can propagate during data iteration and model training, potentially causing CUDA Out-of-Memory (OOM) errors when processing data samples with an extreme number of tokens that were not properly filtered out.
Steps to Reproduce:
Train a model with --max-tokens below 65535 and have a data smple with more then 65535 tokens.
Watch _size at the indice of that data sample get reduced.
Watch the indice of the data sample go past the filtering un-ignored.
Expected Behavior: self.sizes should accurately reflect the sizes of data blocks, irrespective of the casting to np.uint16 or np.uint32.
Proposed Solution:
Adjust the handling of _sizes and self.sizes to ensure that casting to np.uint16 or np.uint32 does not compromise the integrity of size information.
Assert that no mismatch between _sizes and the actual sizes.
Implement a verification mechanism or adjust the filtering logic in filter_indices_by_size to correctly handle sizes.
The text was updated successfully, but these errors were encountered:
Description:
When initializing an object of
TokenBlockDataset
, the sizes of data blocks (_sizes
) are cast tonp.uint16
or np.uint32 based onblock_size
. However, this casting leads to potential issues later in the code:Note:
Though this is an edge case that requires a very long sentence, it is indeed possible, and the training process doesn't capture that there's an issue. Leading to OOM exception for cuda, or faulty training.
np.uint16
truncates the sizes, leadingself.sizes
to store values that do not correspond to the actual sizes of data blocks.See these lines:
fairseq/fairseq/data/token_block_dataset.py
Lines 136 to 140 in d9a6270
for exmaple:
filter_indices_by_size
, the incorrect sizes inself.sizes
can cause sentences with more tokens thanmax_tokens
to be incorrectly retained, bypassing intended filtering logic.See this line:
fairseq/fairseq/data/fairseq_dataset.py
Line 174 in d9a6270
Steps to Reproduce:
--max-tokens
below 65535 and have a data smple with more then 65535 tokens._size
at the indice of that data sample get reduced.Expected Behavior:
self.sizes
should accurately reflect the sizes of data blocks, irrespective of the casting tonp.uint16
ornp.uint32
.Proposed Solution:
_sizes
andself.sizes
to ensure that casting tonp.uint16
ornp.uint32
does not compromise the integrity of size information._sizes
and the actual sizes.filter_indices_by_size
to correctly handle sizes.The text was updated successfully, but these errors were encountered: