High-performance WordPiece/BERT tokenizer in C# (port of FlashTokenizer)
$ dotnet add package FlashTokenizer
FlashTokenizer is a high-performance, production-ready tokenization library for .NET 8 applications. It provides blazing-fast implementations of popular tokenization algorithms including BERT WordPiece and GPT-2 style BPE (Byte Pair Encoding).
Tokenization is the process of breaking down text into smaller units (tokens) that machine learning models can understand. These tokens can be:
Performance: Up to 12.7M tokens/sec throughput
Flexible: 8 different tokenizer classes for various use cases
Optimized: SIMD acceleration, parallel processing, async streaming
Production-Ready: Memory efficient, well-tested, comprehensive documentation
Multi-Language: Supports Chinese, multilingual text processing
Easy Integration: Simple NuGet package, clean APIs
dotnet add package FlashTokenizer
Or via Package Manager Console in Visual Studio:
Install-Package FlashTokenizer
using FlashTokenizer;
// Simple string tokenization
var tokenizer = new Tokenizer();
List<string> tokens = tokenizer.Tokenize("Hello, world!");
// BERT WordPiece tokenization
var bertTokenizer = new FlashBertTokenizerOptimized("vocab.txt");
List<int> ids = bertTokenizer.Encode("Hello, world!");
Tokenizer - Simple String TokenizationBasic text preprocessing that returns string tokens.
var tokenizer = new Tokenizer(
doLowerCase: true, // Convert to lowercase
tokenizeChineseChars: true // Add spaces around CJK characters
);
List<string> tokens = tokenizer.Tokenize("Hello, 世界!");
// Output: ["hello", ",", "世", "界", "!"]
Use cases:
FlashBertTokenizer - Standard BERT WordPieceBasic BERT tokenizer with WordPiece algorithm.
var tokenizer = new FlashBertTokenizer(
vocabFile: "path/to/vocab.txt",
doLowerCase: true,
modelMaxLength: 512, // Standard BERT length
tokenizeChineseChars: true
);
// Encode text to token IDs
List<int> ids = tokenizer.Encode("Hello, world!");
// Decode back to text
string text = tokenizer.Decode(ids);
// With explicit parameters
List<int> ids2 = tokenizer.Encode(
text: "Hello, world!",
padding: "max_length", // "max_length" or "longest"
maxLength: 512
);
Use cases:
FlashBertTokenizerOptimized - High-Performance BERTOptimized version with better performance for production use.
var tokenizer = new FlashBertTokenizerOptimized(
vocabFile: "vocab.txt",
doLowerCase: true,
modelMaxLength: -1, // -1 = unlimited length
tokenizeChineseChars: true
);
// For large documents, use unlimited length
List<int> ids = tokenizer.Encode(
text: largeDocument,
padding: "longest", // No padding for large docs
maxLength: -1 // Unlimited
);
Performance tips:
// Warmup for consistent benchmarking
GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();
var warmupIds = tokenizer.Encode(text.Substring(0, Math.Min(1000, text.Length)));
// Actual tokenization
GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();
var stopwatch = Stopwatch.StartNew();
var ids = tokenizer.Encode(text, "longest", -1);
stopwatch.Stop();
Use cases:
FlashBertTokenizerParallel - Multi-threaded BERTParallel processing for very large documents.
var tokenizer = new FlashBertTokenizerParallel(
vocabFile: "vocab.txt",
doLowerCase: true,
modelMaxLength: -1,
tokenizeChineseChars: true,
maxDegreeOfParallelism: Environment.ProcessorCount, // Use all CPU cores
chunkSize: 256 * 1024 // 256KB chunks
);
List<int> ids = tokenizer.Encode(veryLargeDocument);
// Don't forget to dispose
tokenizer.Dispose();
Configuration:
maxDegreeOfParallelism: Number of threads (default: CPU cores)chunkSize: Size of text chunks in bytes (default: 256KB)Use cases:
AsyncTokenizerPipeline - Async File ProcessingAsynchronous streaming tokenization for files.
using var pipeline = new AsyncTokenizerPipeline(
vocabFile: "vocab.txt",
doLowerCase: true,
modelMaxLength: -1,
tokenizeChineseChars: true,
maxDegreeOfParallelism: Environment.ProcessorCount,
chunkSize: 128 * 1024, // 128KB chunks
bufferSize: 1024 * 1024 // 1MB buffer
);
// Process file directly
List<int> ids = await pipeline.ProcessFileAsync("large_file.txt");
// Process text asynchronously
List<int> ids2 = await pipeline.ProcessTextAsync(largeText);
Use cases:
FlashBertTokenizerBidirectional - Robust FallbackUses bidirectional heuristic for improved quality.
var tokenizer = new FlashBertTokenizerBidirectional(
vocabFile: "vocab.txt",
doLowerCase: true,
modelMaxLength: -1,
tokenizeChineseChars: true
);
List<int> ids = tokenizer.Encode(
text: complexText,
padding: "longest",
maxLength: -1
);
How it works:
Use cases:
BpeTokenizer - GPT-2 Style BPEByte Pair Encoding for GPT-2 style models.
var tokenizer = new BpeTokenizer(
vocabJsonPath: "vocab.json",
mergesPath: "merges.txt"
);
List<int> ids = tokenizer.Encode("The quick brown fox jumps over the lazy dog");
string text = tokenizer.Decode(ids);
Use cases:
FlashTokenizer - Unified FacadeHigh-level facade that auto-selects the appropriate tokenizer.
// BERT WordPiece
var bertTokenizer = new FlashTokenizer(new TokenizerOptions
{
VocabPath = "vocab.txt",
DoLowerCase = true,
ModelMaxLength = -1, // Unlimited
EnableBidirectional = false,
Type = TokenizerType.Bert
});
// BPE
var bpeTokenizer = new FlashTokenizer(new TokenizerOptions
{
Type = TokenizerType.BPE,
BpeVocabJsonPath = "vocab.json",
BpeMergesPath = "merges.txt"
});
// Enable bidirectional fallback
var robustTokenizer = new FlashTokenizer(new TokenizerOptions
{
VocabPath = "vocab.txt",
DoLowerCase = true,
ModelMaxLength = -1,
EnableBidirectional = true, // More robust
Type = TokenizerType.Bert
});
| Text Size | Recommended Class | Reason |
|---|---|---|
| < 1KB | Tokenizer, FlashBertTokenizer | Simple, low overhead |
| 1KB - 100KB | FlashBertTokenizerOptimized | Best single-thread performance |
| 100KB - 10MB | FlashBertTokenizerParallel | Multi-threading helps |
| > 10MB | AsyncTokenizerPipeline | Memory-efficient streaming |
| Any size + quality | FlashBertTokenizerBidirectional | Most robust |
// ✅ Good - unlimited length
var tokenizer = new FlashBertTokenizerOptimized("vocab.txt", true, -1);
var ids = tokenizer.Encode(text, "longest", -1);
// ❌ Bad - causes early stopping
var tokenizer = new FlashBertTokenizerOptimized("vocab.txt", true, 512);
var ids = tokenizer.Encode(text); // Stops at 512 tokens
// For large documents (no padding needed)
var ids = tokenizer.Encode(text, "longest", -1);
// For fixed-size batches
var ids = tokenizer.Encode(text, "max_length", 512);
// Warmup JIT and GC
GC.Collect(); GC.WaitForPendingFinalizers(); GC.Collect();
var warmup = tokenizer.Encode("warmup text");
// Actual measurement
var sw = Stopwatch.StartNew();
var ids = tokenizer.Encode(actualText);
sw.Stop();
var tokenizer = new FlashBertTokenizerParallel(
"vocab.txt", true, -1, true,
Environment.ProcessorCount, // Match CPU cores
256 * 1024 // Tune chunk size for your data
);
// Dispose parallel tokenizers
using var tokenizer = new FlashBertTokenizerParallel(...);
// Or manually
var tokenizer = new FlashBertTokenizerParallel(...);
try
{
var ids = tokenizer.Encode(text);
}
finally
{
tokenizer.Dispose();
}
using FlashTokenizer;
class SimpleApp
{
private static readonly FlashBertTokenizerOptimized _tokenizer =
new("vocab.txt", true, -1);
public List<int> TokenizeText(string text)
{
return _tokenizer.Encode(text, "longest", -1);
}
}
public async Task<List<List<int>>> ProcessFiles(string[] filePaths)
{
using var pipeline = new AsyncTokenizerPipeline(
"vocab.txt", true, -1, true,
Environment.ProcessorCount, 128 * 1024, 1024 * 1024);
var results = new List<List<int>>();
foreach (var filePath in filePaths)
{
var ids = await pipeline.ProcessFileAsync(filePath);
results.Add(ids);
}
return results;
}
public class TokenizerFactory
{
public static ITokenizer Create(string configType, string vocabPath)
{
return configType.ToLower() switch
{
"fast" => new FlashBertTokenizerOptimized(vocabPath, true, -1),
"parallel" => new FlashBertTokenizerParallel(vocabPath, true, -1, true,
Environment.ProcessorCount, 256 * 1024),
"robust" => new FlashBertTokenizerBidirectional(vocabPath, true, -1),
_ => new FlashBertTokenizer(vocabPath, true, -1)
};
}
}
public List<int> TokenizeWithFallback(string text)
{
// Try fast tokenizer first
var fastTokenizer = new FlashBertTokenizerOptimized("vocab.txt", true, -1);
var ids = fastTokenizer.Encode(text, "longest", -1);
// If result seems poor, use bidirectional
if (ShouldUseBidirectional(text, ids))
{
var robustTokenizer = new FlashBertTokenizerBidirectional("vocab.txt", true, -1);
ids = robustTokenizer.Encode(text, "longest", -1);
}
return ids;
}
// Problem: Early stopping due to max length
var ids = tokenizer.Encode(text); // Uses default max length
// Solution: Explicit unlimited
var ids = tokenizer.Encode(text, "longest", -1);
// Problem: Not disposing parallel tokenizers
var tokenizer = new FlashBertTokenizerParallel(...);
// Memory leak!
// Solution: Use using statement
using var tokenizer = new FlashBertTokenizerParallel(...);
Error NU1108: Cycle detected
FlashTokenizer -> FlashTokenizer (>= 1.0.1)
Solution: Rename your project to something other than "FlashTokenizer".
Expected performance on a 4MB file (~759K tokens):
| Tokenizer | Time | Throughput | Memory | Use Case |
|---|---|---|---|---|
FlashBertTokenizer | ~200ms | ~3.8M tokens/sec | ~500MB | Standard |
FlashBertTokenizerOptimized | ~110ms | ~6.9M tokens/sec | ~740MB | Recommended |
FlashBertTokenizerParallel | ~60ms | ~12.7M tokens/sec | ~800MB | Large files |
AsyncTokenizerPipeline | ~80ms | ~9.5M tokens/sec | ~600MB | File processing |
FlashBertTokenizerBidirectional | ~150ms | ~5.1M tokens/sec | ~750MB | Quality-first |
Results may vary based on hardware and text complexity.
// For memory-constrained environments
var tokenizer = new FlashBertTokenizerParallel(
"vocab.txt", true, -1, true,
maxDegreeOfParallelism: 2, // Fewer threads
chunkSize: 64 * 1024 // Smaller chunks
);
// For high-memory systems
var tokenizer = new FlashBertTokenizerParallel(
"vocab.txt", true, -1, true,
maxDegreeOfParallelism: Environment.ProcessorCount * 2,
chunkSize: 1024 * 1024 // 1MB chunks
);
using var pipeline = new AsyncTokenizerPipeline(
"vocab.txt", true, -1, true,
Environment.ProcessorCount,
chunkSize: 256 * 1024,
bufferSize: 4 * 1024 * 1024 // 4MB buffer for large files
);
public void ConfigureServices(IServiceCollection services)
{
services.AddSingleton<ITokenizer>(provider =>
new FlashBertTokenizerOptimized("vocab.txt", true, -1));
}
[ApiController]
public class TokenizerController : ControllerBase
{
private readonly ITokenizer _tokenizer;
public TokenizerController(ITokenizer tokenizer)
{
_tokenizer = tokenizer;
}
[HttpPost("tokenize")]
public ActionResult<List<int>> Tokenize([FromBody] string text)
{
var ids = _tokenizer.Encode(text);
return Ok(ids);
}
}
class Program
{
static async Task Main(string[] args)
{
if (args.Length < 2)
{
Console.WriteLine("Usage: app <vocab_path> <input_file>");
return;
}
string vocabPath = args[0];
string inputFile = args[1];
using var pipeline = new AsyncTokenizerPipeline(
vocabPath, true, -1, true,
Environment.ProcessorCount, 128 * 1024, 1024 * 1024);
var stopwatch = Stopwatch.StartNew();
var ids = await pipeline.ProcessFileAsync(inputFile);
stopwatch.Stop();
Console.WriteLine($"Tokenized {ids.Count:N0} tokens in {stopwatch.ElapsedMilliseconds:F2}ms");
Console.WriteLine($"Throughput: {ids.Count / stopwatch.Elapsed.TotalSeconds:F0} tokens/sec");
}
}
We welcome contributions! Please see our contributing guidelines in the repository.
Real-world performance results:
FlashTokenizer is released under the MIT License. See the LICENSE file in the repository for details.