A smart C# text splitting library that intelligently chunks text while preserving semantic boundaries. Uses a hierarchical approach with configurable overlap and detailed metadata.
$ dotnet add package RecursiveTextSplitterThe RecursiveTextSplitter is a C# library that provides intelligent text splitting functionality with semantic awareness. Unlike simple character-based splitting, this library attempts to preserve meaningful boundaries by using a hierarchical approach to text segmentation, from paragraph breaks down to character-level splitting as a last resort.
Install the RecursiveTextSplitter package from NuGet:
dotnet add package RecursiveTextSplitter
Or via Package Manager Console in Visual Studio:
Install-Package RecursiveTextSplitter
Or search for "RecursiveTextSplitter" in the Visual Studio NuGet Package Manager UI.
NuGet Package: https://www.nuget.org/packages/RecursiveTextSplitter/
Add the namespace to your C# project:
using RecursiveTextSplitting;
The most straightforward way to split text is using the RecursiveSplit extension method:
string document = "Artificial intelligence is transforming every industry.\nFrom healthcare to finance, automation is becoming smarter and more adaptive.\n\nHowever, challenges like bias, interpretability, and safety remain important areas of research.";
var chunks = document.RecursiveSplit(chunkSize: 80, chunkOverlap: 0);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk: {chunk}");
Console.WriteLine("---");
}
For more detailed information about each chunk, including line and column positions, use the AdvancedRecursiveSplit method:
string document = "Artificial intelligence is transforming every industry.\nFrom healthcare to finance, automation is becoming smarter and more adaptive.\n\nHowever, challenges like bias, interpretability, and safety remain important areas of research.";
var chunks = document.AdvancedRecursiveSplit(chunkSize: 80, chunkOverlap: 0);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Text}");
Console.WriteLine($"Start Position: {chunk.StartPosition} (Line {chunk.StartLine}, Column {chunk.StartColumn})");
Console.WriteLine($"End Position: {chunk.EndPosition} (Line {chunk.EndLine}, Column {chunk.EndColumn})");
Console.WriteLine($"Separator Used: {chunk.SeparatorUsed}");
Console.WriteLine("---");
}
Overlap allows consecutive chunks to share some content, which is particularly useful for maintaining context in applications like search indexing or machine learning.
string document = "Artificial intelligence is transforming every industry.\nFrom healthcare to finance, automation is becoming smarter and more adaptive.\n\nHowever, challenges like bias, interpretability, and safety remain important areas of research.";
// Split with 25 characters of overlap
var chunks = document.RecursiveSplit(chunkSize: 80, chunkOverlap: 25);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk: {chunk}");
Console.WriteLine("---");
}
string document = "Artificial intelligence is transforming every industry.\nFrom healthcare to finance, automation is becoming smarter and more adaptive.\n\nHowever, challenges like bias, interpretability, and safety remain important areas of research.";
var chunks = document.AdvancedRecursiveSplit(chunkSize: 80, chunkOverlap: 25);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}:");
Console.WriteLine($" Full Text: {chunk.Text}");
Console.WriteLine($" Overlap: '{chunk.OverlapText}'");
Console.WriteLine($" Original Content: '{chunk.ChunkText}'");
Console.WriteLine($" Position: {chunk.StartPosition}-{chunk.EndPosition}");
Console.WriteLine($" Location: Lines {chunk.StartLine}-{chunk.EndLine}");
Console.WriteLine("---");
}
The TextChunk class provides comprehensive metadata about each split segment:
public class TextChunk
{
public string Text { get; set; } // Complete text including overlap
public string OverlapText { get; set; } // Only the overlap portion
public string ChunkText { get; set; } // Original chunk without overlap
public int StartPosition { get; set; } // 1-based start position in original text
public int EndPosition { get; set; } // 1-based end position in original text
public string SeparatorUsed { get; set; } // Separator that created this chunk
public int ChunkIndex { get; set; } // Sequential chunk number (1-based)
public int StartColumn { get; set; } // 1-based column where chunk starts
public int StartLine { get; set; } // 1-based line where chunk starts
public int EndColumn { get; set; } // 1-based column where chunk ends
public int EndLine { get; set; } // 1-based line where chunk ends
}
The library now provides detailed position tracking with both character-level and line/column coordinates:
StartPosition and EndPosition provide 1-based character indices in the original textStartLine, StartColumn, EndLine, EndColumn provide 1-based line and column coordinatesYou can provide your own separator hierarchy for specialized splitting needs:
string document = "Section 1|Subsection A;Item 1,Item 2|Section 2;Item 3";
// Custom separators prioritizing sections, then subsections, then items
string[] customSeparators = { "|", ";", "," };
var chunks = document.AdvancedRecursiveSplit(
chunkSize: 20,
chunkOverlap: 0,
separators: customSeparators
);
foreach (var chunk in chunks)
{
Console.WriteLine($"Chunk: {chunk.Text}");
Console.WriteLine($"Split using: {chunk.SeparatorUsed}");
Console.WriteLine($"At line {chunk.StartLine}, column {chunk.StartColumn}");
Console.WriteLine("---");
}
The library uses a hierarchical approach to splitting, trying larger semantic units first:
\r\n\r\n, \n\n) - Largest semantic units.\r\n, !\r\n, ?\r\n, :\r\n, ;\r\n)\r\n).\n, !\n, ?\n, :\n, ;\n)\n). , ! , ? ); , , ) ) - Single spaces"") - Last resortWe welcome contributions to make RecursiveTextSplitter even better! Here are some ways you can help:
Your star helps others discover this library and motivates continued development.
We're open to pull requests! Whether you want to:
Please feel free to fork the repository and submit a pull request. For larger changes, consider opening an issue first to discuss your approach.
Found a bug or have a suggestion? Please open an issue with: