Another approach to speed up tokenizer#24
Open
nene wants to merge 5 commits into
Open
Conversation
Instead of looping through the array of all lexemes on every step, create a map of characters-->lexemes telling which lexemes can begin with a certain character. So that when tokenizing we peek the next character, look up those few lexemes that can begin with it, and only try to mach these. My benchmarks show 2 x tokenizing performance inrease with this optimization.
Beacause we're now using StringScanner all the regexes will only match at the beginning anyway. So \A is redundant.
By just moving the :LITERALS lexeme between :REGEXP and :SINGLE_CHAR, the order of lexemes is now such that we can now blindly return the first one that matches. Also moved the :S lexeme alongside other one-line definitions. Because of the previous char-lookup-table optimization, this one improves the speed only so little. But IMHO the code is a bit cleaner this way.
This mainly speeds up the last tokenization step - converting tokens to racc tokens. That's only about 1.2 x speedup of tokenizer though.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
My previous PR for this was somewhat hacky. So I took a bit of a different approach. Building a lookup table of all the characters - so at each tokenization step we quickly determine possible tokens that can follow by looking at the next character.
Combining this with some other optimizations I get about 2.5 x speedup of the tokenizer. With this the tokenization now takes about 30% of the total parsing time (instead of about 50%), so now the main bottleneck is the parser itself.