Tiktoken

This implementation aims for maximum performance, especially in the token count operation.
There's also a benchmark console app here for easy tracking of this.
We will be happy to accept any PR.
Implemented encodings
cl100k_base
r50k_base
p50k_base
p50k_edit
Usage
var encoding = Tiktoken.Encoding.ForModel("gpt-4");
var tokens = encoding.Encode("hello world"); // [15339, 1917]
var text = encoding.Decode(tokens); // hello world
var numberOfTokens = encoding.CountTokens(text); // 2
var encoding = Tiktoken.Encoding.Get("p50k_base");
var tokens = encoding.Encode("hello world"); // [31373, 995]
var text = encoding.Decode(tokens); // hello world
Benchmarks
You can view the reports for each version here
<!--BENCHMARKS_START-->
BenchmarkDotNet=v0.13.5, OS=macOS Ventura 13.4 (22F66) [Darwin 22.5.0]
Apple M1 Pro, 1 CPU, 10 logical and 10 physical cores
.NET SDK=7.0.304
[Host] : .NET 7.0.7 (7.0.723.27404), Arm64 RyuJIT AdvSIMD
Job-EVPJQP : .NET 7.0.7 (7.0.723.27404), Arm64 RyuJIT AdvSIMD DEBUG
BuildConfiguration=Debug
| Method | Categories | Data | Mean | Ratio | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
|---|
| SharpTokenV1_0_28_ | CountTokens | 1. (...)57. [19866] | 5,416,309.6 ns | 1.00 | 601.5625 | 289.0625 | - | 3805771 B | 1.00 |
| TiktokenSharpV1_0_5_ | CountTokens | 1. (...)57. [19866] | 1,532,135.7 ns | 0.28 | 250.0000 | 125.0000 | - | 1571155 B | 0.41 |
| TokenizerLibV1_3_2_ | CountTokens | 1. (...)57. [19866] | 856,737.2 ns | 0.16 | 246.0938 | 87.8906 | - | 1547674 B | 0.41 |
| Tiktoken_ | CountTokens | 1. (...)57. [19866] | 413,599.2 ns | 0.08 | 49.3164 | - | - | 309449 B | 0.08 |
| | | | | | | | | |
| SharpTokenV1_0_28_ | CountTokens | Hello, World! | 3,322.0 ns | 1.00 | 0.6752 | - |
<!--BENCHMARKS_END-->
Possible optimizations
- Modes - Fast(without special token regex)/Strict
- SIMD?
- Parallelism?
- string as dictionary key?
Support
Priority place for bugs: https://github.com/tryAGI/LangChain/issues
Priority place for ideas and general questions: https://github.com/tryAGI/LangChain/discussions
Discord: https://discord.gg/Ca2xhfBf3v