[WIP] Prefix needle optimization #148

Andersama · 2020-11-15T23:02:30Z

Adds a bit of machinery to try to extract a leading string<> if we're about to search.
Things to improve on

better string extraction, currently very simple only looks for string<> and char<> (may be buried in sequence<>)
generalization to string like things
utf8 strings, boyer moore is bidirectional, utf8 is best handled forwards only

combines both #143 #146
From testing there's anywhere upwards of a 50% performance improvement (when dealing with non-unicode things).

… matches)

Andersama · 2020-11-18T09:33:48Z

Playing around with a dfa based implementation, taking some inspiration from hyperscan.
https://branchfree.org/2018/05/25/say-hello-to-my-little-friend-sheng-a-small-but-fast-deterministic-finite-automaton/

I'm going to be doing a bit more benchmarking because the results are all over the place, but it also seems worth it.

Andersama · 2020-11-18T10:16:13Z

So long as I can properly fit into the sheng table / 16 or fewer chars it looks like a 50x improvement over the current implementation. Also the dfa approach appears to play nicely in debug mode.

Scratch that I have a bug in the dfa, not matching properly, will likely be significantly slower.

Ok, worked it out, currently the dfa's running about 33% faster than the current approach.

Andersama · 2020-11-23T09:40:29Z

PIcked up the O'Reilley book to see if there were more optimizations mentioned, there's another similar to this one, should've been intuitive I guess, if you know where a string appears, eg after some fixed length, then you can search for the "embedded string". Also looks like a good candidate for some pattern analysis, eg: if you have a string midway through a regex and the preceding part has a minimum and maximum character consumption count being ==, then you're guaranteed that string appears that many characters into the regex, it's safe to search for the string and then shift over to validate a match. Currently trying to automate building dfa's, but I'll cycle back to this because this already was a big performance boost.

Andersama added 2 commits November 15, 2020 14:20

Add needle search optimization

08d6249

Use both iterators from string search (no need to double check string…

8b3dccb

… matches)

Andersama force-pushed the prefix_needle_optimization branch 7 times, most recently from 2bfe7f8 to a0b6505 Compare November 16, 2020 02:16

fixing errors...

1797c51

Andersama force-pushed the prefix_needle_optimization branch from a0b6505 to 1797c51 Compare November 16, 2020 03:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Prefix needle optimization #148

[WIP] Prefix needle optimization #148

Andersama commented Nov 15, 2020 •

edited

Loading

Andersama commented Nov 18, 2020

Andersama commented Nov 18, 2020 •

edited

Loading

Andersama commented Nov 23, 2020

[WIP] Prefix needle optimization #148

Are you sure you want to change the base?

[WIP] Prefix needle optimization #148

Conversation

Andersama commented Nov 15, 2020 • edited Loading

Andersama commented Nov 18, 2020

Andersama commented Nov 18, 2020 • edited Loading

Andersama commented Nov 23, 2020

Andersama commented Nov 15, 2020 •

edited

Loading

Andersama commented Nov 18, 2020 •

edited

Loading