Fix overflow in nanosets with big datasets #182

jquesnelle · 2024-05-23T15:06:53Z

When a nanoset is particularly big (>4 GB), the calculation of offset (the actual location within the memmap) can overflow. The issue is with the line

offset = dataset_sample * self.sequence_length * (np.iinfo(self.token_dtype).bits / 8)

Here, dataset_sample is a numpy "uint", and the calculation of offset can overflow since numpy's "uint" type is 32 bits. The solution is to promote everything to native Python int first, which has automatic overflow.

NouamaneTazi · 2024-05-27T10:07:11Z

src/nanotron/data/nanoset.py

        input_ids_tokens = np.frombuffer(
-            self.dataset_buffer, dtype=self.token_dtype, count=(self.sequence_length + 1), offset=int(offset)
+            self.dataset_buffer, dtype=self.token_dtype, count=(self.sequence_length + 1), offset=offset


Nice find! Can you add a small unittest for this in tests/nanoset/test_build_nanoset_dataloader.py? thanks!

calculate offset as int

a3e4c17

NouamaneTazi requested changes May 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix overflow in nanosets with big datasets #182

Fix overflow in nanosets with big datasets #182

jquesnelle commented May 23, 2024

NouamaneTazi May 27, 2024

Fix overflow in nanosets with big datasets #182

Are you sure you want to change the base?

Fix overflow in nanosets with big datasets #182

Conversation

jquesnelle commented May 23, 2024

NouamaneTazi May 27, 2024

Choose a reason for hiding this comment