The Microsoft.ML.Tokenizers.Data.P50kBase includes the Tiktoken tokenizer data file p50k_base.tiktoken, which is utilized by models such as text-davinci-002
$ dotnet add package Microsoft.ML.Tokenizers.Data.P50kBaseThe Microsoft.ML.Tokenizers.Data.P50kBase includes the Tiktoken tokenizer data file p50k_base.tiktoken, which is utilized by models such as text-davinci-002.
p50k_base.tiktoken file, which is used by the Tiktoken tokenizer. This data file is used by the following models:
1. text-davinci-002
2. text-davinci-003
3. code-davinci-001
4. code-davinci-002
5. code-cushman-001
6. code-cushman-002
7. davinci-codex
8. cushman-codexReference this package in your project to use the Tiktoken tokenizer with the specified models.
// Create a tokenizer for the specified model or any other listed model name
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("text-davinci-002");
// Create a tokenizer for the specified encoding
Tokenizer tokenizer = TiktokenTokenizer.CreateForEncoding("p50k_base");
Users shouldn't use any types exposed by this package directly. This package is intended to provide tokenizer data files.
Microsoft.ML.Tokenizers
Microsoft.ML.Tokenizers.Data.P50kBase is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.