Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Prefix needle optimization #148

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Andersama
Copy link
Contributor

@Andersama Andersama commented Nov 15, 2020

Adds a bit of machinery to try to extract a leading string<> if we're about to search.
Things to improve on

  • better string extraction, currently very simple only looks for string<> and char<> (may be buried in sequence<>)
  • generalization to string like things
  • utf8 strings, boyer moore is bidirectional, utf8 is best handled forwards only

combines both #143 #146
From testing there's anywhere upwards of a 50% performance improvement (when dealing with non-unicode things).

@Andersama Andersama force-pushed the prefix_needle_optimization branch 7 times, most recently from 2bfe7f8 to a0b6505 Compare November 16, 2020 02:16
@Andersama
Copy link
Contributor Author

Playing around with a dfa based implementation, taking some inspiration from hyperscan.
https://branchfree.org/2018/05/25/say-hello-to-my-little-friend-sheng-a-small-but-fast-deterministic-finite-automaton/

I'm going to be doing a bit more benchmarking because the results are all over the place, but it also seems worth it.

@Andersama
Copy link
Contributor Author

Andersama commented Nov 18, 2020

So long as I can properly fit into the sheng table / 16 or fewer chars it looks like a 50x improvement over the current implementation. Also the dfa approach appears to play nicely in debug mode.

Scratch that I have a bug in the dfa, not matching properly, will likely be significantly slower.

Ok, worked it out, currently the dfa's running about 33% faster than the current approach.

@Andersama
Copy link
Contributor Author

PIcked up the O'Reilley book to see if there were more optimizations mentioned, there's another similar to this one, should've been intuitive I guess, if you know where a string appears, eg after some fixed length, then you can search for the "embedded string". Also looks like a good candidate for some pattern analysis, eg: if you have a string midway through a regex and the preceding part has a minimum and maximum character consumption count being ==, then you're guaranteed that string appears that many characters into the regex, it's safe to search for the string and then shift over to validate a match. Currently trying to automate building dfa's, but I'll cycle back to this because this already was a big performance boost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant