The Microsoft.ML.Tokenizers.Data.Cl100kBase class includes the Tiktoken tokenizer data file cl100k_base.tiktoken, which is utilized by models such as GPT-4.
$ dotnet add package Microsoft.ML.Tokenizers.Data.Cl100kBaseThe Microsoft.ML.Tokenizers.Data.Cl100kBase includes the Tiktoken tokenizer data file cl100k_base.tiktoken, which is utilized by models such as GPT-4.
Reference this package in your project to use the Tiktoken tokenizer with the specified models.
// Create a tokenizer for the specified model or any other listed model name
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("gpt-4");
// Create a tokenizer for the specified encoding
Tokenizer tokenizer = TiktokenTokenizer.CreateForEncoding("cl100k_base");
Users shouldn't use any types exposed by this package directly. This package is intended to provide tokenizer data files.
Microsoft.ML.Tokenizers
Microsoft.ML.Tokenizers.Data.Cl100kBase is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.