The fastest tokenizer for GPT-3.5 and GPT-4 inspired by Tiktoken.
$ dotnet add package TiktokenThis implementation aims for maximum performance, especially in the token count operation.
There's also a benchmark console app here for easy tracking of this.
We will be happy to accept any PR.
cl100k_baser50k_basep50k_basep50k_editvar encoding = Tiktoken.Encoding.ForModel("gpt-4");
var tokens = encoding.Encode("hello world"); // [15339, 1917]
var text = encoding.Decode(tokens); // hello world
var numberOfTokens = encoding.CountTokens(text); // 2
var encoding = Tiktoken.Encoding.Get("p50k_base");
var tokens = encoding.Encode("hello world"); // [31373, 995]
var text = encoding.Decode(tokens); // hello world
<!--BENCHMARKS_START-->
<!--BENCHMARKS_END-->
You can view the reports for each version here
BenchmarkDotNet=v0.13.5, OS=macOS Ventura 13.3.1 (a) (22E772610a) [Darwin 22.4.0]
Apple M1 Pro, 1 CPU, 10 logical and 10 physical cores
.NET SDK=7.0.203
[Host] : .NET 7.0.5 (7.0.523.17405), Arm64 RyuJIT AdvSIMD
Job-MEZVRD : .NET 7.0.5 (7.0.523.17405), Arm64 RyuJIT AdvSIMD DEBUG
BuildConfiguration=Debug
| Method | Categories | Data | Mean | Ratio | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
|---|---|---|---|---|---|---|---|---|---|
| SharpTokenV1_0_28_ | CountTokens | 1. (...)57. [19866] | 5,247,614.0 ns | 1.00 | 601.5625 | 289.0625 | - | 3805771 B | 1.00 |
| TiktokenSharpV1_0_5_ | CountTokens | 1. (...)57. [19866] | 1,509,944.3 ns | 0.29 | 250.0000 | 125.0000 | - | 1571154 B | 0.41 |
| Tiktoken_ | CountTokens | 1. (...)57. [19866] | 411,159.5 ns | 0.08 | 49.3164 | - | - | 309449 B | 0.08 |
| SharpTokenV1_0_28_ | CountTokens | Hello, World! | 3,217.5 ns | 1.00 | 0.6752 | - | - | 4240 B | 1.00 |
| TiktokenSharpV1_0_5_ | CountTokens | Hello, World! | 6,631.3 ns | 2.06 | 2.1820 | 0.0381 | - | 13728 B | 3.24 |
| Tiktoken_ | CountTokens | Hello, World! | 329.0 ns | 0.10 | 0.0420 | - | - | 264 B | 0.06 |
| SharpTokenV1_0_28_ | CountTokens | King(...)edy. [275] | 62,301.4 ns | 1.00 | 8.5449 | 0.3662 | - | 54160 B | 1.00 |
| TiktokenSharpV1_0_5_ | CountTokens | King(...)edy. [275] | 21,541.3 ns | 0.35 | 5.0964 | 0.2136 | - | 32096 B | 0.59 |
| Tiktoken_ | CountTokens | King(...)edy. [275] | 4,575.6 ns | 0.07 | 0.6409 | - | - | 4032 B | 0.07 |
| SharpTokenV1_0_28_Encode | Encode | 1. (...)57. [19866] | 4,962,747.6 ns | 1.00 | 601.5625 | 296.8750 | - | 3805769 B | 1.00 |
| TiktokenSharpV1_0_5_Encode | Encode | 1. (...)57. [19866] | 1,643,795.4 ns | 0.33 | 251.9531 | 126.9531 | 1.9531 | 1571156 B | 0.41 |
| Tiktoken_Encode | Encode | 1. (...)57. [19866] | 427,241.4 ns | 0.09 | 59.5703 | 29.7852 | - | 375665 B | 0.10 |
| SharpTokenV1_0_28_Encode | Encode | Hello, World! | 3,281.3 ns | 1.00 | 0.6752 | 0.0038 | - | 4240 B | 1.00 |
| TiktokenSharpV1_0_5_Encode | Encode | Hello, World! | 6,636.0 ns | 2.02 | 2.1820 | 0.0458 | - | 13728 B | 3.24 |
| Tiktoken_Encode | Encode | Hello, World! | 433.3 ns | 0.13 | 0.1135 | 0.0005 | - | 712 B | 0.17 |
| SharpTokenV1_0_28_Encode | Encode | King(...)edy. [275] | 62,339.9 ns | 1.00 | 8.5449 | 0.4883 | - | 54160 B | 1.00 |
| TiktokenSharpV1_0_5_Encode | Encode | King(...)edy. [275] | 21,609.6 ns | 0.35 | 5.0964 | 0.3052 | - | 32096 B | 0.59 |
| Tiktoken_Encode | Encode | King(...)edy. [275] | 5,224.8 ns | 0.08 | 0.8011 | 0.0153 | - | 5056 B | 0.09 |
Priority place for bugs: https://github.com/tryAGI/LangChain/issues
Priority place for ideas and general questions: https://github.com/tryAGI/LangChain/discussions
Discord: https://discord.gg/Ca2xhfBf3v