A smart C# text splitting library that intelligently chunks text while preserving semantic boundaries. Uses a hierarchical approach with configurable overlap and detailed metadata.
$ dotnet add package RecursiveTextSplitterThe RecursiveTextSplitter is a C# library that provides intelligent text splitting functionality with semantic awareness. Unlike simple character-based splitting, this library attempts to preserve meaningful boundaries by using a hierarchical approach to text segmentation, from paragraph breaks down to character-level splitting as a last resort.
Install the RecursiveTextSplitter package from NuGet:
dotnet add package RecursiveTextSplitter
Or via Package Manager Console in Visual Studio:
Install-Package RecursiveTextSplitter
Or search for "RecursiveTextSplitter" in the Visual Studio NuGet Package Manager UI.
NuGet Package: https://www.nuget.org/packages/RecursiveTextSplitter/
Add the namespace to your C# project:
using RecursiveTextSplitting;
The most straightforward way to split text is using the RecursiveSplit extension method:
string document = "Artificial intelligence is transforming every industry. From healthcare to finance, automation is becoming smarter and more adaptive. However, challenges like bias, interpretability, and safety remain important areas of research.";
var chunks = document.RecursiveSplit(chunkSize: 80, chunkOverlap: 0);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk: {chunk}");
Console.WriteLine("---");
}
For more detailed information about each chunk, use the AdvancedRecursiveSplit method:
string document = "Artificial intelligence is transforming every industry. From healthcare to finance, automation is becoming smarter and more adaptive. However, challenges like bias, interpretability, and safety remain important areas of research.";
var chunks = document.AdvancedRecursiveSplit(chunkSize: 80, chunkOverlap: 0);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Text}");
Console.WriteLine($"Start Position: {chunk.StartPosition}");
Console.WriteLine($"End Position: {chunk.EndPosition}");
Console.WriteLine($"Separator Used: {chunk.SeparatorUsed}");
Console.WriteLine("---");
}Overlap allows consecutive chunks to share some content, which is particularly useful for maintaining context in applications like search indexing or machine learning.
string document = "Artificial intelligence is transforming every industry. From healthcare to finance, automation is becoming smarter and more adaptive. However, challenges like bias, interpretability, and safety remain important areas of research.";
// Split with 25 characters of overlap
var chunks = document.RecursiveSplit(chunkSize: 80, chunkOverlap: 25);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk: {chunk}");
Console.WriteLine("---");
}string document = "Artificial intelligence is transforming every industry. From healthcare to finance, automation is becoming smarter and more adaptive. However, challenges like bias, interpretability, and safety remain important areas of research.";
var chunks = document.AdvancedRecursiveSplit(chunkSize: 80, chunkOverlap: 25);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}:");
Console.WriteLine($" Full Text: {chunk.Text}");
Console.WriteLine($" Overlap: '{chunk.OverlapText}'");
Console.WriteLine($" Original Content: '{chunk.ChunkText}'");
Console.WriteLine($" Position: {chunk.StartPosition}-{chunk.EndPosition}");
Console.WriteLine("---");
}The TextChunk class provides comprehensive metadata about each split segment:
public class TextChunk
{
public string Text { get; set; } // Complete text including overlap
public string OverlapText { get; set; } // Only the overlap portion
public string ChunkText { get; set; } // Original chunk without overlap
public int StartPosition { get; set; } // Start position in original text
public int EndPosition { get; set; } // End position in original text
public string SeparatorUsed { get; set; } // Separator that created this chunk
public int ChunkIndex { get; set; } // Sequential chunk number
}The library uses a hierarchical approach to splitting, trying larger semantic units first:
\n\n) - Largest semantic units.\n, !\n, ?\n):\n, ;\n)\n) - Line breaks. , ! , ? ); , , ) ) - Single spacesWe welcome contributions to make RecursiveTextSplitter even better! Here are some ways you can help:
Your star helps others discover this library and motivates continued development.
We're open to pull requests! Whether you want to:
Please feel free to fork the repository and submit a pull request. For larger changes, consider opening an issue first to discuss your approach.
Found a bug or have a suggestion? Please open an issue with: