Found 144 packages
Tokenizer extracts structured information from blocks of text and reflects them onto .NET objects
Tokenizer for OpenAI large language models.
Fast and memory-efficient WordPiece tokenizer as it is used by BERT and others. Tokenizes text for further processing using NLP/language models.
Anthropic Claude BPE Tokenizer unofficial implementation
MimeKit is an Open Source library for creating and parsing MIME, S/MIME and PGP messages on desktop and mobile platforms. It also supports parsing of Unix mbox files. Unlike any other .NET MIME parser, MimeKit's parser does not need to parse string input nor does it use a TextReader. Instead, it parses raw byte streams, thus allowing it to better support undeclared 8bit text in headers as well as message bodies. It also means that MimeKit's parser is significantly faster than other .NET MIME parsers. MimeKit's parser also uses a real tokenizer when parsing the headers rather than regex or string.Split() like most other .NET MIME parsers. This means that MimeKit is much more RFC-compliant than any other .NET MIME parser out there, including the commercial implementations. In addition to having a far superior parser implementation, MimeKit's object tree is not a derivative of System.Net.Mail objects and thus does not suffer from System.Net.Mail's limitations. API documentation can be found on the web at http://www.mimekit.net/docs For those that need SMTP, POP3 or IMAP support, check out https://github.com/jstedfast/MailKit
ChatGPT Tokenizer and Token Estimator for .NET
Thai string tokenizer is a dotnet Library tokenizer and Substring for Thai language
Open AI Chat Completion Models (GPT 3.5/GPT 4) BPE Tokenizer unofficial implementation
OpenAI GPT utils, e.g. GPT3 Tokenizer
Package Description
Provide tokenizers to allow counting content tokens for text and embeddings
C# Expression parser and evaluator, inspired from jokenizer project.
Portable content pipleline library for game development. Part of TOE: Tiny Open Engine.
This package contains tokenizers for following models: · BERT Base · BERT Large · BERT German · BERT Multilingual · BERT Base Uncased · BERT Large Uncased
The Microsoft.ML.Tokenizers.Data.O200kBase includes the Tiktoken tokenizer data file o200k_base.tiktoken, which is utilized by models such as gpt-4o.
Model tokenizer SDK, requires the modeltokenizer docker image
SentencePieceTokenizer is a wrapper around the google SentencePiece tokenizer. Used to tokenize text for language models and other NLP tasks.
A strong tokenizer for you needs on C#
The fastest tokenizer for GPT-3.5 and GPT-4 inspired by Tiktoken.
LumTokenizer for BPE tokenizer