A powerful .NET library for intelligent document structure analysis and chunking. Automatically identifies and parses various document patterns including Markdown headings, numeric outlines, legal sections, and appendices. Features hierarchical content organization, advanced keyword extraction with ML.NET, and ONNX vectorization support for semantic embeddings.
$ dotnet add package MarkdownStructureChunkerA powerful .NET library for intelligent document structure analysis and chunking, designed to extract hierarchical content from various document formats with advanced keyword extraction and vectorization capabilities.
dotnet add package MarkdownStructureChunker
# Clone the repository
git clone https://github.com/DevelApp-ai/MarkdownStructureChunker.git
cd MarkdownStructureChunker
# Build the solution
dotnet build
# Run tests
dotnet test
using MarkdownStructureChunker.Core;
using MarkdownStructureChunker.Core.Extractors;
using MarkdownStructureChunker.Core.Strategies;
// Create chunking strategy and keyword extractor
var strategy = new PatternBasedStrategy(PatternBasedStrategy.CreateDefaultRules());
var extractor = new SimpleKeywordExtractor();
// Initialize the chunker
var chunker = new StructureChunker(strategy, extractor);
// Process a document
var document = @"
# Introduction
This document introduces machine learning concepts.
## Background
Machine learning is a subset of artificial intelligence.
### Applications
ML has numerous applications in various industries.
";
var result = await chunker.ProcessAsync(document, "ml-guide");
// Access the structured chunks
foreach (var chunk in result.Chunks)
{
Console.WriteLine($"Level {chunk.Level}: {chunk.CleanTitle}");
Console.WriteLine($"Keywords: {string.Join(", ", chunk.Keywords)}");
Console.WriteLine($"Content: {chunk.Content.Substring(0, Math.Min(100, chunk.Content.Length))}...");
Console.WriteLine();
}
# Level 1 Heading
## Level 2 Heading
### Level 3 Heading
#### Level 4 Heading
##### Level 5 Heading
###### Level 6 Heading
1. First Level
1.1 Second Level
1.1.1 Third Level
1.2 Another Second Level
2. Another First Level
§ 42 Compliance Requirements
§ 43 Data Protection Standards
Appendix A: Technical Specifications
Appendix B: Reference Materials
A. First Section
B. Second Section
C. Third Section
The library follows a modular architecture with clear separation of concerns:
MarkdownStructureChunker.Core/
├── Models/
│ ├── ChunkNode.cs # Individual chunk data structure
│ ├── DocumentGraph.cs # Complete document structure
│ └── ChunkingRule.cs # Pattern matching rules
├── Interfaces/
│ ├── IChunkingStrategy.cs # Strategy pattern interface
│ ├── IKeywordExtractor.cs # Keyword extraction interface
│ └── ILocalVectorizer.cs # Vectorization interface
├── Strategies/
│ └── PatternBasedStrategy.cs # Default pattern-based implementation
├── Extractors/
│ ├── SimpleKeywordExtractor.cs # Frequency-based extraction
│ └── MLNetKeywordExtractor.cs # ML.NET-powered extraction
├── Vectorizers/
│ └── OnnxVectorizer.cs # ONNX model integration
└── StructureChunker.cs # Main orchestrator class
// Create custom rules for specific document patterns
var customRules = new List<ChunkingRule>
{
new ChunkingRule("CustomHeader", @"^SECTION\s+(\d+):\s+(.*)", level: 1, priority: 0),
new ChunkingRule("Subsection", @"^(\d+\.\d+)\s+(.*)", priority: 10),
// Add more custom patterns as needed
};
var strategy = new PatternBasedStrategy(customRules);
// Use ML.NET for more sophisticated keyword extraction
using var mlExtractor = new MLNetKeywordExtractor();
var chunker = new StructureChunker(strategy, mlExtractor);
var result = await chunker.ProcessAsync(document, "doc-id");
// Initialize with ONNX model for semantic embeddings
using var vectorizer = OnnxVectorizerFactory.CreateDefault();
// Vectorize chunk content with context
var enrichedContent = OnnxVectorizer.EnrichContentWithContext(
chunk.Content,
GetAncestralTitles(chunk)
);
var embedding = await vectorizer.VectorizeAsync(enrichedContent, isQuery: false);
The library comes with pre-configured rules that handle common document patterns:
# ## ### #### ##### ######1. 1.1 1.1.1 2.3.4.5§ 42 Section TitleAppendix A: TitleA. B. C.// Simple extractor with custom parameters
var simpleExtractor = new SimpleKeywordExtractor();
var keywords = await simpleExtractor.ExtractKeywordsAsync(text, maxKeywords: 10);
// ML.NET extractor with advanced processing
using var mlExtractor = new MLNetKeywordExtractor();
var advancedKeywords = await mlExtractor.ExtractKeywordsAsync(text, maxKeywords: 15);
[ApiController]
[Route("api/[controller]")]
public class DocumentController : ControllerBase
{
private readonly StructureChunker _chunker;
public DocumentController(StructureChunker chunker)
{
_chunker = chunker;
}
[HttpPost("analyze")]
public async Task<IActionResult> AnalyzeDocument([FromBody] DocumentRequest request)
{
try
{
var result = await _chunker.ProcessAsync(request.Content, request.DocumentId);
return Ok(result);
}
catch (Exception ex)
{
return BadRequest($"Error processing document: {ex.Message}");
}
}
}
// Program.cs or Startup.cs
services.AddSingleton<IChunkingStrategy>(provider =>
new PatternBasedStrategy(PatternBasedStrategy.CreateDefaultRules()));
services.AddSingleton<IKeywordExtractor, MLNetKeywordExtractor>();
services.AddSingleton<StructureChunker>();
public async Task ProcessDocumentBatch(IEnumerable<string> documents)
{
var tasks = documents.Select(async (doc, index) =>
{
var result = await chunker.ProcessAsync(doc, $"doc-{index}");
return result;
});
var results = await Task.WhenAll(tasks);
// Process results...
}
The library provides comprehensive error handling:
try
{
var result = await chunker.ProcessAsync(document, documentId);
}
catch (ArgumentException ex)
{
// Handle invalid input parameters
Console.WriteLine($"Invalid input: {ex.Message}");
}
catch (InvalidOperationException ex)
{
// Handle processing errors
Console.WriteLine($"Processing error: {ex.Message}");
}
catch (Exception ex)
{
// Handle unexpected errors
Console.WriteLine($"Unexpected error: {ex.Message}");
}
The library includes comprehensive test coverage:
# Run all tests
dotnet test
# Run with coverage
dotnet test --collect:"XPlat Code Coverage"
# Run specific test category
dotnet test --filter Category=Integration
Test categories:
git checkout -b feature/your-featuredotnet testgit commit -m "Add your feature"git push origin feature/your-featureThis project is licensed under the MIT License - see the LICENSE file for details.
For questions, issues, or contributions, please:
MarkdownStructureChunker - Intelligent document structure analysis for modern applications.