Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with tiktoken #87

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Feb 12, 2024

After porting the optimizations ending in #77 back to tiktoken, this is a synchronization PR based on the reviews there.

Especially openai/tiktoken#245, which revealed that the legacy encodings have a severe backtracking problem - though it seems that Java's engine can handle it properly, since adding the possessives made it a tiny bit slower - but at least the regex is in sync with tiktoken (note that cl100k is unaffected by the regex changes, only the tokenCount > 2 change is related):

Before:

Benchmark                                                      (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                                data    ss   10  2.268 ± 0.050   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount                      data    ss   10  2.075 ± 0.025   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOrdinary              data    ss   10  2.072 ± 0.028   s/op
SingleThreadedBenchmark.benchmarkP50kBase                                  data    ss   10  4.087 ± 0.023   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                                  data    ss   10  4.131 ± 0.093   s/op
SingleThreadedBenchmark.benchmarkR50kBase                                  data    ss   10  3.802 ± 0.025   s/op

After - a bit slower for some reason, but at least it's synchronized with tiktoken:

Benchmark                                                      (dataFolderPath)  Mode  Cnt  Score   Error  Units
SingleThreadedBenchmark.benchmarkCl100kBase                                data    ss   10  2.231 ± 0.015   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCount                      data    ss   10  2.101 ± 0.029   s/op
SingleThreadedBenchmark.benchmarkCl100kBaseTokenCountOrdinary              data    ss   10  2.055 ± 0.041   s/op
SingleThreadedBenchmark.benchmarkP50kBase                                  data    ss   10  4.440 ± 0.082   s/op
SingleThreadedBenchmark.benchmarkP50kEdit                                  data    ss   10  4.451 ± 0.085   s/op
SingleThreadedBenchmark.benchmarkR50kBase                                  data    ss   10  4.086 ± 0.019   s/op

Made this a draft since the tiktoken PRs aren't fully finished yet, so I expect to have a few more changes here as well.

@@ -170,11 +170,7 @@ int mergeBytesAndGetTokenCount(ByteArrayWrapper piece, int length, IntArrayList
ranks.set(nextIndex, DUMMY_RANK);

length--;
if (length < 3) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro-optimization, let's simplify

@@ -122,7 +122,7 @@ private static Encoding from50kParameters(
String fileName,
Map<String, Integer> specialTokens
) {
Pattern regex = compileRegex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", false);
Pattern regex = compileRegex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}++| ?\\p{N}++| ?[^\\s\\p{L}\\p{N}]++|\\s+(?!\\S)|\\s++", false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except for \\s+(?!\\S) all groups should be possessive, no need to step back on match failure

@l0rinc l0rinc marked this pull request as ready for review February 13, 2024 13:33
@l0rinc l0rinc requested a review from tox-p as a code owner February 13, 2024 13:33
@l0rinc l0rinc marked this pull request as draft February 13, 2024 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant