Package Description
$ dotnet add package BpeTokenizerBpeTokenizer is a C# implementation of tiktoken written by OpenAI. It is a byte pair encoding tokenizer that can be used to tokenize text into subword units.
This library is built for x64 architectures.
As a BpeTokenizer derived from tiktoken, it can be used as a token counter. Useful to ensure that when streaming tokens from the OpenAI API for GPT Chat Completions, you could keep track of the cost related to the software calling the API.
To Install BpeTokenizer, run the following command in the Package Manager Console
Install-Package BpeTokenizer
If you'd prefer to use the .NET CLI, run this command instead:
dotnet add package BpeTokenizer
To use BpeTokenizer, import the namespace:
using BpeTokenizer;
Then create an encoder by its model or encoding name:
// By its encoding name:
var encoder = await BytePairEncodingRegistry.GetEncodingAsync("cl100k_base");
// By its model:
var encoder = await BytePairEncodingModels.EncodingForModelAsync("gpt-4");Both variants are async so you can await them, since they will either access a remote server to download the model or load it from the local cache.
Once you have an encoding, you can encode your text:
var tokens = encoder.Encode("Hello BPE world!"); //Results in: [9906, 426, 1777, 1917, 0]To decode a stream of tokens, you can use the following:
var text = encoder.Decode(tokens); //Results in: "Hello BPE world!"BpeTokenizer supports the following encodings:
You can use these encoding names when creating an encoder:
var cl100kBaseEncoder = await BytePairEncodingRegistry.GetEncodingAsync("cl100k_base");
var p50kEditEncoder = await BytePairEncodingRegistry.GetEncodingAsync("p50k_edit");
var p50kBaseEncoder = await BytePairEncodingRegistry.GetEncodingAsync("p50k_base");
var r50kBaseEncoder = await BytePairEncodingRegistry.GetEncodingAsync("r50k_base");
var gpt2Encoder = await BytePairEncodingRegistry.GetEncodingAsync("gpt2");The following models are supported (from tiktoken source, embedding in parentheses):
You can use these model names when creating an encoder (list not exhaustive):
var gpt4Encoder = await BytePairEncodingModels.EncodingForModelAsync("gpt-4");
var textDavinci003Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-davinci-003");
var textDavinci001Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-davinci-001");
var codeDavinci002Encoder = await BytePairEncodingModels.EncodingForModelAsync("code-davinci-002");
var textDavinciEdit001Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-davinci-edit-001");
var textEmbeddingAda002Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-embedding-ada-002");
var textSimilarityDavinci001Encoder = await BytePairEncodingModels.EncodingForModelAsync("text-similarity-davinci-001");
var gpt2Encoder = await BytePairEncodingModels.EncodingForModelAsync("gpt2");Several of the older models are being deprecated at the start of 2024:
To count tokens in a given string, you can use the following:
var tokenCount = encoder.CountTokens("Hello BPE world!"); //Results in: 5