Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add possessive quantifiers to avoid catastrophic backtracking #258

Merged
merged 5 commits into from
Oct 3, 2024

Conversation

l0rinc
Copy link
Contributor

@l0rinc l0rinc commented Feb 11, 2024

Fixes the crash in #245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.

image

Interestingly these possesives make the encoding a lot faster again in fancy-regex.

Before this change (but with large byte pair merge PR cherry-picked):

num_threads: 1, num_bytes: 98379553
tiktoken 	11,946,036 bytes / s
tiktoken 	11,961,343 bytes / s
tiktoken 	11,995,846 bytes / s
tiktoken 	11,951,263 bytes / s
tiktoken 	11,983,405 bytes / s

Same, with these changes applied:

num_threads: 1, num_bytes: 98379553
tiktoken 	14,511,827 bytes / s
tiktoken 	14,638,134 bytes / s
tiktoken 	14,644,029 bytes / s
tiktoken 	14,729,030 bytes / s
tiktoken 	14,666,903 bytes / s

Updating the regex libs makes it a tiny bit faster still:

num_threads: 1, num_bytes: 98379553
tiktoken 	14,485,590 bytes / s
tiktoken 	14,854,049 bytes / s
tiktoken 	14,891,086 bytes / s
tiktoken 	14,843,007 bytes / s
tiktoken 	14,874,520 bytes / s

This is almost 2x faster than before any of the optimizations.


Opened an issue for increasing the default backtrack limit, see: fancy-regex/fancy-regex#134, but it shouldn't be necessary here anymore.

big_value = "^" * 1000000
assert big_value == enc.decode(enc.encode(big_value))

big_value = " " + big_value
Copy link
Contributor Author

@l0rinc l0rinc Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space is often optional at the beginning, this way the backtracking can reach the space - let's test that as well

@l0rinc l0rinc marked this pull request as draft February 12, 2024 11:07
big_value = " " + big_value
assert big_value == enc.decode(enc.encode(big_value))

big_value = big_value + "\n"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some groups require a newline at the end, stress those paths as well

src/lib.rs Outdated
@@ -417,7 +418,7 @@ impl CoreBPE {
special_tokens_encoder: HashMap<String, Rank>,
pattern: &str,
) -> PyResult<Self> {
let regex = Regex::new(pattern)
let regex = RegexBuilder::new(pattern).backtrack_limit(100_000).build()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't work for values bigger than a million - see fancy-regex/fancy-regex#134

I've set it lower for now, hoping we'll be able to fix the whitespace problem

@l0rinc l0rinc marked this pull request as ready for review February 12, 2024 13:33
@@ -82,7 +84,7 @@ def cl100k_base():
}
return {
"name": "cl100k_base",
"pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""",
"pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s+(?!\S)|\s""",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems the cl100k also had some backtracking problems, these possessives improve the situation considerably (e.g. in Java these aren't necessary, see knuddelsgmbh/jtokkit#87)

@pytest.mark.parametrize("make_enc", ENCODING_FACTORIES)
def test_extremely_big_encoding(make_enc: Callable[[], tiktoken.Encoding]):
enc = make_enc()
for c in ["^", "0", "a", "'s", " ", "\n"]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stressing different parts of the regex, makin sure none have catastrophic backtracking

# The pattern in the original GPT-2 release is:
# r"""'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
# This is equivalent, but executes faster:
_legacy_splitter_regex = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}++| ?\p{N}++| ?[^\s\p{L}\p{N}]++|\s++$|\s+(?!\S)|\s"""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whitespaces can't be possessive (it needs to step back when encountering a non-whitespace), but we can rule out the offending bactracking case by adding a possessive trailing whitespace check.

Comment on lines +15 to +16
fancy-regex = "0.13.0"
regex = "1.10.3"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not absolutely necessary, but adds a tiny speed increase

@@ -417,7 +418,7 @@ impl CoreBPE {
special_tokens_encoder: HashMap<String, Rank>,
pattern: &str,
) -> PyResult<Self> {
let regex = Regex::new(pattern)
let regex = RegexBuilder::new(pattern).backtrack_limit(10_000).build()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this change we should never backtract catastrophically - and if we do, this will warn us early

@l0rinc l0rinc changed the title Add possessive quantifiers to legacy encodings as well Add possessive quantifiers to avoid catastrophic backtracking Feb 13, 2024

big_value = big_value + "\n"
assert big_value == enc.decode(enc.encode(big_value))

Copy link
Contributor Author

@l0rinc l0rinc Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big_value = big_value + "x" would still fail for whitespaces, i.e "        x".
Seems less typical than the other cases which are fixed here, not yet sure how to fix this one, though, the fancy-regex seems pretty basic in this regard...

Copy link
Collaborator

@hauntsaninja hauntsaninja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@hauntsaninja hauntsaninja merged commit 9f7f69d into openai:main Oct 3, 2024
21 checks passed
@l0rinc l0rinc deleted the paplorinc/regex-possessives branch October 3, 2024 06:34
@tmm1
Copy link

tmm1 commented Oct 16, 2024

Was the backtrack limit reverted intentionally in 05e66e8? 05e66e8#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759L421-R438

Was there a regression?

@hauntsaninja
Copy link
Collaborator

Yes, it was reverted intentionally. There are OpenAI internal encodings where setting the limit caused issues.

@l0rinc
Copy link
Contributor Author

l0rinc commented Oct 16, 2024

Thanks for checking @tmm1, @hauntsaninja.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants