You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am trying to train Tiktoken on a custom dataset (size 15 GB) with 30k vocab size. It seems it will take a long time to finish. 1 vocab update took almost 8 hours. Any suggestion to make it faster?
Thanks in advance.
Hi.
I did a little research. I have a dataset of texts in 41mb. "Large dataset". This is a dataset collected from various sources - fiction, wiki, blogs. Small pieces of different topics.
From this dataset I extracted an smaller dataset of 854 kb "Small Dataset". It also contains different topics.
I trained a tokenizer with a size of 6000 tokens on each of these 2 datasets. Here are the comparison results:
"Large dataset" - trained for 6-8 hours on an old CPU. Let's take the result of tokenization as a "standard".
"Small dataset" - trained for about 12 minutes. The tokenization result coincides with the "standard" by 61.8%. I mean set1.intersection(set2) from the received tokens (only the tokens, without their indices).
More than half coincidence. Not much, but not little either. The conclusion seems obvious - you can train a tokenizer on a small sample of data. I think the best result will be achieved if the sample is divided in such a way that it covers all the topics available in the large dataset.
I am not an expert in this field, my conclusions are based on personal attempts to understand the topic of tokenizers.
Hi @vladimirzenin ,
Thanks for your input. You right. A small subset will be enough for the tokens. But for our cases, Bengali is a diverse language. We have separated ~20 GB of data to train the tokenizer to grab the actual sub-word understanding. But it seems hard to train with this module.
In respect to your conclusion, for the Bengali language if we separate a small portion it won't even be close to the original distribution of the words. But there might be an efficient way.
Thanks again.
Hi,
I am trying to train Tiktoken on a custom dataset (size 15 GB) with 30k vocab size. It seems it will take a long time to finish. 1 vocab update took almost 8 hours. Any suggestion to make it faster?
Thanks in advance.
tiktoken/tiktoken/_educational.py
Line 117 in c0ba74c
The text was updated successfully, but these errors were encountered: