The Microsoft.ML.Tokenizers.Data.R50kBase includes the Tiktoken tokenizer data file r50k_base.tiktoken, which is utilized by models such as text-davinci-001
$ dotnet add package Microsoft.ML.Tokenizers.Data.R50kBaseThe Microsoft.ML.Tokenizers.Data.R50kBase includes the Tiktoken tokenizer data file r50k_base.tiktoken, which is utilized by models such as text-davinci-001.
r50k_base.tiktoken file, which is used by the Tiktoken tokenizer. This data file is used by the following models:
1. text-davinci-001
2. text-curie-001
3. text-babbage-001
4. text-ada-001
5. davinci
6. curie
7. babbage
8. ada
9. text-similarity-davinci-001
10. text-similarity-curie-001
11. text-similarity-babbage-001
12. text-similarity-ada-001
13. text-search-davinci-doc-001
14. text-search-curie-doc-001
15. text-search-babbage-doc-001
16. text-search-ada-doc-001
17. code-search-babbage-code-001
18. code-search-ada-code-001Reference this package in your project to use the Tiktoken tokenizer with the specified models.
// Create a tokenizer for the specified model or any other listed model name
Tokenizer tokenizer = TiktokenTokenizer.CreateForModel("text-davinci-001");
// Create a tokenizer for the specified encoding
Tokenizer tokenizer = TiktokenTokenizer.CreateForEncoding("r50k_base");
Users shouldn't use any types exposed by this package directly. This package is intended to provide tokenizer data files.
Microsoft.ML.Tokenizers
Microsoft.ML.Tokenizers.Data.R50kBase is released as open source under the MIT license. Bug reports and contributions are welcome at the GitHub repository.