84 packages tagged with “tokenizer”
Tokenizer extracts structured information from blocks of text and reflects them onto .NET objects
Fast and memory-efficient WordPiece tokenizer as it is used by BERT and others. Tokenizes text for further processing using NLP/language models.
C# Expression parser and evaluator, inspired from jokenizer project.
Provide tokenizers to allow counting content tokens for text and embeddings
OpenAI GPT utils, e.g. GPT3 Tokenizer
This package contains tokenizers for following models: · BERT Base · BERT Large · BERT German · BERT Multilingual · BERT Base Uncased · BERT Large Uncased
Anthropic Claude BPE Tokenizer unofficial implementation
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks are usually required to build more advanced text processing services. OpenNLP also included maximum entropy and perceptron based machine learning.
A shared package used by Loretta. Do not install this package manually, it will be added as a prerequisite by other packages that require it.
A GLua/Lua lexer, parser, code analysis, transformation and generation library.
Lexi: A regular expression based lexer for dotnet.
.NET wrapper of HuggingFace Tokenizers library
A .NET class library that makes it easier to parse text. The library tracks the current position within the text, ensures your code never attempts to access a character at an invalid index, and includes many methods that make parsing easier. The library makes your text-parsing code more concise and more robust. Includes support for regular expressions.
Native(Rust) wrapper of HuggingFace Tokenizers library.
Bindings for rust huggingface/tokenizers.
Open AI Chat Completion Models (GPT 3.5/GPT 4) BPE Tokenizer unofficial implementation
Experimental code that might become part of Loretta.CodeAnalysis.Lua.
Tokenizador (Xamarin) para Conekta. Necesitas tener alguna libreria de servidor para usar el token.
NET Standard 2.1 library to produces embeddings using C# Bert Tokenizer and Onnx All-Mini-LM-L6-v2 model.
NLTK python library wrapper for .NET
VBF.Compilers.Scanners is a scanner builder. It contains a regular expression to DFA engine, can generate high performance scanners for unicode source text.
Trl.PegParser contains a tokenizer and a parser. The tokenizer uses regular expressions to define tokens, and exposes both matched and unmatched character ranges. The PEG Parser uses parsing expression grammers with tokens produced by the tokenizer. Trl.PegParser is build on .NET Standard 2.1 for cross-platform compatibility.
Package Description
The Stringe is a wrapper for the .NET String object that tracks line, column, offset, and other metadata for substrings.
NLQuery: natural language query parser recognizes entities in context of structured sources (like tabular dataset). Can be used for building natural language interface to SQL database or OLAP cube, implementing custom app-specific search. Usage examples: https://www.nrecosite.com/nlp_ner_net.aspx Online demo: http://nlquery.nrecosite.com/
SentencePieceTokenizer is a wrapper around the google SentencePiece tokenizer. Used to tokenize text for language models and other NLP tasks.
A html preprocessor which allows you to write html code using a beautiful syntax
Model tokenizer SDK, requires the modeltokenizer docker image