-
Notifications
You must be signed in to change notification settings - Fork 831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add possessive quantifiers to avoid catastrophic backtracking #258
Conversation
30c92dc
to
ccd8702
Compare
big_value = "^" * 1000000 | ||
assert big_value == enc.decode(enc.encode(big_value)) | ||
|
||
big_value = " " + big_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space is often optional at the beginning, this way the backtracking can reach the space - let's test that as well
2d616bd
to
21c5688
Compare
big_value = " " + big_value | ||
assert big_value == enc.decode(enc.encode(big_value)) | ||
|
||
big_value = big_value + "\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some groups require a newline at the end, stress those paths as well
src/lib.rs
Outdated
@@ -417,7 +418,7 @@ impl CoreBPE { | |||
special_tokens_encoder: HashMap<String, Rank>, | |||
pattern: &str, | |||
) -> PyResult<Self> { | |||
let regex = Regex::new(pattern) | |||
let regex = RegexBuilder::new(pattern).backtrack_limit(100_000).build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't work for values bigger than a million - see fancy-regex/fancy-regex#134
I've set it lower for now, hoping we'll be able to fix the whitespace problem
019de85
to
51c8a8a
Compare
@@ -82,7 +84,7 @@ def cl100k_base(): | |||
} | |||
return { | |||
"name": "cl100k_base", | |||
"pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+""", | |||
"pat_str": r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}++|\p{N}{1,3}+| ?[^\s\p{L}\p{N}]++[\r\n]*+|\s++$|\s*[\r\n]|\s+(?!\S)|\s""", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems the cl100k also had some backtracking problems, these possessives improve the situation considerably (e.g. in Java these aren't necessary, see knuddelsgmbh/jtokkit#87)
@pytest.mark.parametrize("make_enc", ENCODING_FACTORIES) | ||
def test_extremely_big_encoding(make_enc: Callable[[], tiktoken.Encoding]): | ||
enc = make_enc() | ||
for c in ["^", "0", "a", "'s", " ", "\n"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stressing different parts of the regex, makin sure none have catastrophic backtracking
# The pattern in the original GPT-2 release is: | ||
# r"""'s|'t|'re|'ve|'m|'ll|'d| ?[\p{L}]+| ?[\p{N}]+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""" | ||
# This is equivalent, but executes faster: | ||
_legacy_splitter_regex = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}++| ?\p{N}++| ?[^\s\p{L}\p{N}]++|\s++$|\s+(?!\S)|\s""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whitespaces can't be possessive (it needs to step back when encountering a non-whitespace), but we can rule out the offending bactracking case by adding a possessive trailing whitespace check.
fancy-regex = "0.13.0" | ||
regex = "1.10.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not absolutely necessary, but adds a tiny speed increase
@@ -417,7 +418,7 @@ impl CoreBPE { | |||
special_tokens_encoder: HashMap<String, Rank>, | |||
pattern: &str, | |||
) -> PyResult<Self> { | |||
let regex = Regex::new(pattern) | |||
let regex = RegexBuilder::new(pattern).backtrack_limit(10_000).build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after this change we should never backtract catastrophically - and if we do, this will warn us early
|
||
big_value = big_value + "\n" | ||
assert big_value == enc.decode(enc.encode(big_value)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
big_value = big_value + "x"
would still fail for whitespaces, i.e " x".
Seems less typical than the other cases which are fixed here, not yet sure how to fix this one, though, the fancy-regex
seems pretty basic in this regard...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Was the backtrack limit reverted intentionally in 05e66e8? 05e66e8#diff-b1a35a68f14e696205874893c07fd24fdb88882b47c23cc0e0c80a30c7d53759L421-R438 Was there a regression? |
Yes, it was reverted intentionally. There are OpenAI internal encodings where setting the limit caused issues. |
Thanks for checking @tmm1, @hauntsaninja. |
Fixes the crash in #245 by prohibiting the regex engine from backtracking catastrophically via possessive quantifiers.
Interestingly these possesives make the encoding a lot faster again in
fancy-regex
.Before this change (but with large byte pair merge PR cherry-picked):
Same, with these changes applied:
Updating the regex libs makes it a tiny bit faster still:
This is almost 2x faster than before any of the optimizations.
Opened an issue for increasing the default backtrack limit, see: fancy-regex/fancy-regex#134, but it shouldn't be necessary here anymore.