Text processing infrastructure for LMSupply - tokenization, vocabulary management, and encoding utilities
$ dotnet add package LMSupply.Text.CoreCore text processing infrastructure for LMSupply packages.
This package provides centralized tokenization and text processing utilities used by LMSupply packages that work with text data (Embedder, Reranker, Translator, etc.).
| Interface | Purpose | Use Case |
|---|---|---|
ITextTokenizer | Basic encode/decode | General tokenization |
ISequenceTokenizer | Single sequence with special tokens | Embeddings |
IPairTokenizer | Sentence pair encoding | Rerankers, cross-encoders |
// Auto-detect and create appropriate tokenizer
var tokenizer = await TokenizerFactory.CreateAutoAsync(modelDir, maxLength: 512);
// Specific tokenizer types
var wordpiece = await TokenizerFactory.CreateWordPieceAsync(modelDir, maxLength);
var sentencepiece = await TokenizerFactory.CreateSentencePieceAsync(modelDir, maxLength);
// Auto-detect tokenizer type and create pair tokenizer (recommended)
var pairTokenizer = await TokenizerFactory.CreateAutoPairAsync(modelDir, maxLength: 512);
// Specific pair tokenizer types
var wordpiecePair = await TokenizerFactory.CreateWordPiecePairAsync(modelDir, maxLength);
var sentencepiecePair = await TokenizerFactory.CreateSentencePiecePairAsync(modelDir, maxLength);
| Type | Detection | Example Models |
|---|---|---|
| WordPiece | vocab.txt or tokenizer.json with type: WordPiece | BERT, MiniLM, BGE-v1 |
| Unigram | tokenizer.json with type: Unigram | bge-reranker-base, XLM-RoBERTa |
| BPE | vocab.json + merges.txt or tokenizer.json with type: BPE | GPT-2, RoBERTa |
| SentencePiece | .spm or .model files | mBART, translation models |
The CreateAutoPairAsync method automatically detects the tokenizer type:
model.type field:
WordPiece → WordPiece tokenizerUnigram or BPE → SentencePiece-compatible tokenizerusing LMSupply.Text;
// Create tokenizer
var tokenizer = await TokenizerFactory.CreateAutoAsync(modelPath);
// Encode text
var encoded = tokenizer.EncodeSequence("Hello, world!");
Console.WriteLine($"Tokens: {encoded.InputIds.Length}");
// Decode tokens
var decoded = tokenizer.Decode(encoded.InputIds, skipSpecialTokens: true);
using LMSupply.Text;
// Create pair tokenizer (auto-detects WordPiece/Unigram/BPE)
var pairTokenizer = await TokenizerFactory.CreateAutoPairAsync(modelPath, maxLength: 512);
// Encode query-document pair
var encoded = pairTokenizer.EncodePair(
"What is machine learning?",
"Machine learning is a branch of AI..."
);
// Batch encode for multiple documents
var batch = pairTokenizer.EncodePairBatch(
"What is machine learning?",
new[] { "Doc 1...", "Doc 2...", "Doc 3..." }
);
var tokenizer = await TokenizerFactory.CreateAutoAsync(modelPath);
var texts = new[] { "First text", "Second text", "Third text" };
var batch = tokenizer.EncodeBatch(texts, maxLength: 256);
// Access batch tensors
long[,] inputIds = batch.InputIds;
long[,] attentionMask = batch.AttentionMask;
Single encoded sequence with special tokens:
public record EncodedSequence(
long[] InputIds, // Token IDs with [CLS], [SEP]
long[] AttentionMask, // 1 for real tokens, 0 for padding
int ActualLength // Length before padding
);
Encoded sentence pair for cross-encoders:
public record EncodedPair(
long[] InputIds, // [CLS] text1 [SEP] text2 [SEP]
long[] AttentionMask, // Attention mask
long[] TokenTypeIds, // 0 for text1, 1 for text2
int ActualLength // Length before padding
);
Batched versions for efficient inference:
public class EncodedBatch
{
public long[,] InputIds { get; }
public long[,] AttentionMask { get; }
public int BatchSize { get; }
public int SequenceLength { get; }
}
| Token | WordPiece | SentencePiece |
|---|---|---|
| Start | [CLS] | <s> / <bos> |
| Separator | [SEP] | </s> / <eos> |
| Padding | [PAD] | <pad> |
| Unknown | [UNK] | <unk> |
The tokenizer automatically handles special token differences between tokenizer types.
CreateAutoPairAsync for automatic tokenizer type detectionSentencePiecePairTokenizer for Unigram/BPE pair encodingCreateSentencePiecePairAsync for explicit SentencePiece pair tokenizers