[Tokenizers] Question regarding performance #7143

r-Larch · 2024-04-27T08:37:12Z

Hi, thanks for the effort put into the Microsoft.ML.Tokenizers!

I'm the author of the last performance improvements in SharpToken library.
Since MLTokenizers are faster now than SharpToken I looked into the sources to understand where this performance comes from.

Now I have a question (out of curiosity)

Why is it required to copy a ReadOnlySpan<char> to a buffer, when the rest of the code just uses ReadOnlySpan<char> again?

TiktokenPreTokenizer.cs line: 104

machinelearning/src/Microsoft.ML.Tokenizers/PreTokenizer/TiktokenPreTokenizer.cs

Lines 95 to 107 in 72cfdf6

    
                   public override IEnumerable<(int Offset, int Length)> PreTokenize(ReadOnlySpan<char> text) 
        
                   { 
        
                       if (text.IsEmpty) 
        
                       { 
        
                           return []; 
        
                       } 
        
           #if NET7_0_OR_GREATER 
        
                       char[] buffer = ArrayPool<char>.Shared.Rent(text.Length); 
        
                       text.CopyTo(buffer); 
        
                       return SplitText(buffer, _regex, _specialTokensRegex, text.Length); 
        
                       static IEnumerable<(int Offset, int Length)> SplitText(char[] text, Regex regex, Regex? specialTokensRegex, int textLength)

PreTokenizer.cs line: 74

machinelearning/src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs

Lines 43 to 54 in 72cfdf6

    
                   internal static IEnumerable<(int Offset, int Length)> SplitText(ReadOnlySpan<char> text, Regex regex) 
        
                   { 
        
           #if NET7_0_OR_GREATER 
        
                       char[] buffer = ArrayPool<char>.Shared.Rent(text.Length); 
        
                       text.CopyTo(buffer); 
        
                       return SplitText(buffer, regex, text.Length); 
        
                       static IEnumerable<(int Offset, int Length)> SplitText(char[] text, Regex regex, int textLength) 
        
                       { 
        
                           (int Offset, int Length) match; 
        
                           int beginning = 0; 
        
                           while (TryGetMatch(regex, text, beginning, textLength - beginning, out match))

The text was updated successfully, but these errors were encountered:

dotnet-policy-service bot added the untriaged New issue has not been triaged label Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tokenizers] Question regarding performance #7143

[Tokenizers] Question regarding performance #7143

r-Larch commented Apr 27, 2024

[Tokenizers] Question regarding performance #7143

[Tokenizers] Question regarding performance #7143

Comments

r-Larch commented Apr 27, 2024