EasyReasy.KnowledgeBase.BertTokenization

Name: EasyReasy.KnowledgeBase.BertTokenization
Author: Adam Tovatt

BERT-based tokenizer implementation for EasyReasy.KnowledgeBase. Provides accurate token counting and text processing using the FastBertTokenizer library with BERT base uncased vocabulary.

Features

🤖 BERT Tokenization: Industry-standard BERT base uncased vocabulary
📊 Token Counting: Accurate token count for chunking and size management
🔄 Encode/Decode: Full tokenization and detokenization support
⚡ Async Creation: Async initialization with Hugging Face model loading
🛡️ Truncation Control: Configurable maximum token limits

Quick Start

Installation

dotnet add package EasyReasy.KnowledgeBase.BertTokenization

Basic Usage

using EasyReasy.KnowledgeBase.BertTokenization;

// Create tokenizer
BertTokenizer tokenizer = await BertTokenizer.CreateAsync();

// Count tokens
int tokenCount = tokenizer.CountTokens("Hello, world!");

// Encode text to tokens
int[] tokens = tokenizer.Encode("This is a test sentence.");

// Decode tokens back to text
string decoded = tokenizer.Decode(tokens);

Console.WriteLine($"Token count: {tokenCount}");
Console.WriteLine($"Tokens: [{string.Join(", ", tokens)}]");
Console.WriteLine($"Decoded: {decoded}");

Using with KnowledgeBase

using EasyReasy.KnowledgeBase.BertTokenization;
using EasyReasy.KnowledgeBase.Chunking;

// Create tokenizer for use with document processing
BertTokenizer tokenizer = await BertTokenizer.CreateAsync();

// Use with section reader factory
SectionReaderFactory factory = new SectionReaderFactory(embeddingService, tokenizer);
using Stream stream = File.OpenRead("document.md");
SectionReader reader = factory.CreateForMarkdown(stream, maxTokensPerChunk: 100, maxTokensPerSection: 1000);

await foreach (List<KnowledgeFileChunk> chunks in reader.ReadSectionsAsync())
{
    // Process sections with accurate token counts
}

AdamTovatt/EasyReasy.KnowledgeBase.BertTokenizationv1.0.0

Get Started

Readme

EasyReasy.KnowledgeBase.BertTokenization

Features

Quick Start

Installation

Basic Usage

Using with KnowledgeBase

Custom Configuration

API Reference

BertTokenizer

Implementation Details

Dependencies

License