You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think we could get some speedups in encode_entities by caching some common operations. The examples below would most benefit users encoding lots of data that is heavy on non-named, numeric entities. We may have similar benefits elsewhere.
The first is to cache the results of the sprintf in num_entity. This could be done with no effect on behavior, in exchange for some hash entries. Here are my experiments.
use Benchmark ':all';
use HTML::Entities;
timethese( 300_000,
{
original => sub {
map { HTML::Entities::num_entity($_) } 1..100
},
cached => sub {
map { cached_num_entity($_) } 1..100
},
}
);
my %cache;
sub cached_num_entity {
return $cache{$_[0]} ||= sprintf("&#x%X;", ord($_[0]));
}
This would have the side effect of modifying the %char2entity hash, which is visible to the outside world. If that wasn't OK, we could have a private copy of the hash specifically so it would be modifiable. The potential downside (or upside?) of that would be that if someone outside the module modified %char2entity, it would have no effect on encode_entities.
For benchmarking encode_entities, I used this:
my $ent = chr(8855);
my $num = chr(8854);
my $unencoded = "$ent$num" x 10;
my $text = <<"HTML";
text in "$unencoded"
HTML
timethese( 1_000_000,
{
encode => sub { my $x = encode_entities($text) },
}
);
Results:
42,281/s for the original unmodified encode_entities.
52,746/s if the encode_entities used the caching num_entity first mentioned, but the main regex is unchanged.
64,769/s if the main conversion regex caches the results of calls to num_entity in %char2entity. Changing this to call the caching num_entity gave no noticeable improvement.
I hope these give some ideas. encode_entities is an absolute workhorse at my job (we generate everything with Template Toolkit), and I'm sure for many many others. Any speedup would have wide-ranging benefits.
The text was updated successfully, but these errors were encountered:
How do you envision the cache? An in-memory LRU or is this a hash that can grow quite large? Have you tried seeing what kind of speed gains can be had via Memoize?
Just a hash that could theoretically but not likely grow quite large, as I showed in the code above.
I didn't bother with Memoize because I figure we wouldn't want to add another dependency, but now I see that Memoize has been core since 5.7.3, so not an issue.
I think we could get some speedups in
encode_entities
by caching some common operations. The examples below would most benefit users encoding lots of data that is heavy on non-named, numeric entities. We may have similar benefits elsewhere.The first is to cache the results of the
sprintf
innum_entity
. This could be done with no effect on behavior, in exchange for some hash entries. Here are my experiments.gives these results:
Bottom line: Hash lookup is faster than the
sprintf
, so let's cache it.The other tweak would be to cache the call to
num_entity
inside the main regex inencode_entities
. Swap this:for this
This would have the side effect of modifying the
%char2entity
hash, which is visible to the outside world. If that wasn't OK, we could have a private copy of the hash specifically so it would be modifiable. The potential downside (or upside?) of that would be that if someone outside the module modified%char2entity
, it would have no effect onencode_entities
.For benchmarking
encode_entities
, I used this:Results:
42,281/s for the original unmodified
encode_entities
.52,746/s if the
encode_entities
used the cachingnum_entity
first mentioned, but the main regex is unchanged.64,769/s if the main conversion regex caches the results of calls to
num_entity
in%char2entity
. Changing this to call the cachingnum_entity
gave no noticeable improvement.I hope these give some ideas.
encode_entities
is an absolute workhorse at my job (we generate everything with Template Toolkit), and I'm sure for many many others. Any speedup would have wide-ranging benefits.The text was updated successfully, but these errors were encountered: