The fastest tokenizer for GPT-3.5 and GPT-4 inspired by Tiktoken.
$ dotnet add package TiktokenThis implementation aims for maximum performance, especially in the token count operation.
There's also a benchmark console app here for easy tracking of this.
We will be happy to accept any PR.
cl100k_baser50k_basep50k_basep50k_editvar encoding = Tiktoken.Encoding.ForModel("gpt-4");
var tokens = encoding.Encode("hello world"); // [15339, 1917]
var text = encoding.Decode(tokens); // hello world
var numberOfTokens = encoding.CountTokens(text); // 2
var stringTokens = encoding.Explore(text); // ["hello", " world"]
var encoding = Tiktoken.Encoding.Get(Encodings.P50KBase);
var tokens = encoding.Encode("hello world"); // [31373, 995]
var text = encoding.Decode(tokens); // hello world
You can view the reports for each version here
<!--BENCHMARKS_START-->
BenchmarkDotNet v0.13.12, macOS Sonoma 14.2.1 (23C71) [Darwin 23.2.0]
Apple M1 Pro, 1 CPU, 10 logical and 10 physical cores
.NET SDK 8.0.100
[Host] : .NET 8.0.0 (8.0.23.53103), Arm64 RyuJIT AdvSIMD
DefaultJob : .NET 8.0.0 (8.0.23.53103), Arm64 RyuJIT AdvSIMD
| Method | Categories | Data | Mean | Median | Ratio | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|---|
| SharpTokenV1_2_16_ | CountTokens | 1. (...)57. [19866] | 1,554,552.0 ns | 1,552,769.4 ns | 1.00 | 292.9688 | 146.4844 | - | 1846147 B | 1.00 |
| TiktokenSharpV1_0_9_ | CountTokens | 1. (...)57. [19866] | 1,242,157.7 ns | 1,241,657.7 ns | 0.80 | 253.9063 | 117.1875 | 3.9063 | 1570786 B | 0.85 |
| TokenizerLibV1_3_3_ | CountTokens | 1. (...)57. [19866] | 815,490.5 ns | 806,761.4 ns | 0.52 | 247.0703 | 98.6328 | 0.9766 | 1547678 B | 0.84 |
| Tiktoken_ | CountTokens | 1. (...)57. [19866] | 311,744.2 ns | 311,591.0 ns | 0.20 | 49.3164 | - | - | 309449 B | 0.17 |
| SharpTokenV1_2_16_ | CountTokens |
Priority place for bugs: https://github.com/tryAGI/LangChain/issues
Priority place for ideas and general questions: https://github.com/tryAGI/LangChain/discussions
Discord: https://discord.gg/Ca2xhfBf3v
| 1,585.8 ns |
| 1,586.5 ns |
| 1.00 |
| 0.5188 |
| 0.0019 |
| - |
| 3264 B |
| 1.00 |
| TiktokenSharpV1_0_9_ | CountTokens | Hello, World! | 5,806.8 ns | 5,805.7 ns | 3.66 | 2.1286 | 0.0381 | 0.0076 | 13344 B | 4.09 |
| TokenizerLibV1_3_3_ | CountTokens | Hello, World! | 766.2 ns | 766.7 ns | 0.48 | 0.2356 | - | - | 1480 B | 0.45 |
| Tiktoken_ | CountTokens | Hello, World! | 210.9 ns | 210.2 ns | 0.13 | 0.0420 | - | - | 264 B | 0.08 |
| SharpTokenV1_2_16_ | CountTokens | King(...)edy. [275] | 13,851.9 ns | 13,808.5 ns | 1.00 | 4.1351 | 0.0153 | - | 25968 B | 1.00 |
| TiktokenSharpV1_0_9_ | CountTokens | King(...)edy. [275] | 13,387.6 ns | 13,395.3 ns | 0.97 | 5.0659 | 0.1984 | 0.0153 | 31712 B | 1.22 |
| TokenizerLibV1_3_3_ | CountTokens | King(...)edy. [275] | 10,861.4 ns | 10,865.2 ns | 0.78 | 3.0975 | 0.1526 | 0.0153 | 19344 B | 0.74 |
| Tiktoken_ | CountTokens | King(...)edy. [275] | 3,162.3 ns | 3,162.0 ns | 0.23 | 0.6447 | - | - | 4064 B | 0.16 |
| SharpTokenV1_2_16_Encode | Encode | 1. (...)57. [19866] | 1,327,775.1 ns | 1,330,166.1 ns | 1.00 | 294.9219 | 142.5781 | 1.9531 | 1846151 B | 1.00 |
| TiktokenSharpV1_0_9_Encode | Encode | 1. (...)57. [19866] | 1,016,985.4 ns | 994,095.3 ns | 0.80 | 250.0000 | 125.0000 | - | 1570772 B | 0.85 |
| TokenizerLibV1_3_3_Encode | Encode | 1. (...)57. [19866] | 804,657.4 ns | 803,549.7 ns | 0.61 | 247.0703 | 108.3984 | 0.9766 | 1547678 B | 0.84 |
| Tiktoken_Encode | Encode | 1. (...)57. [19866] | 331,107.8 ns | 331,142.1 ns | 0.25 | 59.5703 | 2.4414 | - | 375601 B | 0.20 |
| SharpTokenV1_2_16_Encode | Encode | Hello, World! | 1,891.1 ns | 1,894.6 ns | 1.00 | 0.5188 | 0.0019 | - | 3264 B | 1.00 |
| TiktokenSharpV1_0_9_Encode | Encode | Hello, World! | 5,816.9 ns | 5,824.0 ns | 3.08 | 2.1210 | 0.0381 | - | 13344 B | 4.09 |
| TokenizerLibV1_3_3_Encode | Encode | Hello, World! | 496.7 ns | 496.8 ns | 0.26 | 0.2356 | - | - | 1480 B | 0.45 |
| Tiktoken_Encode | Encode | Hello, World! | 265.3 ns | 264.7 ns | 0.14 | 0.1030 | - | - | 648 B | 0.20 |
| SharpTokenV1_2_16_Encode | Encode | King(...)edy. [275] | 17,497.7 ns | 17,480.3 ns | 1.00 | 4.1199 | 0.0305 | - | 25968 B | 1.00 |
| TiktokenSharpV1_0_9_Encode | Encode | King(...)edy. [275] | 13,374.0 ns | 13,348.4 ns | 0.76 | 5.0659 | 0.1984 | 0.0153 | 31712 B | 1.22 |
| TokenizerLibV1_3_3_Encode | Encode | King(...)edy. [275] | 7,333.9 ns | 7,338.7 ns | 0.42 | 3.0899 | 0.1450 | 0.0076 | 19344 B | 0.74 |
| Tiktoken_Encode | Encode | King(...)edy. [275] | 3,450.2 ns | 3,452.9 ns | 0.20 | 0.7973 | - | - | 5024 B | 0.19 |