Unified, read-only document extraction facade for OfficeIMO (Word/Excel/PowerPoint/Markdown/PDF) intended for AI ingestion.
$ dotnet add package OfficeIMO.ReaderOfficeIMO.Reader is an optional, read-only facade that normalizes extraction across:
.docx, .docm) -> Markdown chunks.xlsx, .xlsm) -> table chunks + optional Markdown table previews.pptx, .pptm) -> slide-aligned Markdown chunks (optionally including notes).md, .markdown) -> heading-aware text chunks.pdf) -> page-aware text chunksThe goal is to make it easy for tools like chat bots to ingest content deterministically.
using OfficeIMO.Reader;
foreach (var chunk in DocumentReader.Read(@"C:\Docs\Policy.docx")) {
Console.WriteLine(chunk.Id);
Console.WriteLine(chunk.Location.HeadingPath);
Console.WriteLine(chunk.Markdown ?? chunk.Text);
}
using OfficeIMO.Reader;
// Stream (does not close the stream)
using var fs = File.OpenRead(@"C:\Docs\Policy.docx");
var chunksFromStream = DocumentReader.Read(fs, "Policy.docx").ToList();
// Bytes
var bytes = File.ReadAllBytes(@"C:\Docs\Policy.docx");
var chunksFromBytes = DocumentReader.Read(bytes, "Policy.docx").ToList();
using OfficeIMO.Reader;
var chunks = DocumentReader.ReadFolder(
folderPath: @"C:\Docs",
folderOptions: new ReaderFolderOptions {
Recurse = true,
MaxFiles = 500,
MaxTotalBytes = 500L * 1024 * 1024,
SkipReparsePoints = true,
DeterministicOrder = true
},
options: new ReaderOptions {
MaxChars = 8_000
}).ToList();
using OfficeIMO.Reader;
var result = DocumentReader.ReadFolderDetailed(
folderPath: @"C:\KnowledgeBase",
folderOptions: new ReaderFolderOptions { Recurse = true, MaxFiles = 10_000 },
options: new ReaderOptions { ComputeHashes = true },
includeChunks: true,
onProgress: p => Console.WriteLine($"{p.Kind}: scanned={p.FilesScanned}, parsed={p.FilesParsed}, skipped={p.FilesSkipped}, chunks={p.ChunksProduced}"));
Console.WriteLine($"Files parsed: {result.FilesParsed}");
Console.WriteLine($"Files skipped: {result.FilesSkipped}");
Console.WriteLine($"Chunks: {result.ChunksProduced}");
using OfficeIMO.Reader;
foreach (var doc in DocumentReader.ReadFolderDocuments(
folderPath: @"C:\KnowledgeBase",
folderOptions: new ReaderFolderOptions { Recurse = true, MaxFiles = 10_000, DeterministicOrder = true },
options: new ReaderOptions { ComputeHashes = true, MaxChars = 4_000 },
onProgress: p => Console.WriteLine($"{p.Kind}: parsed={p.FilesParsed}, skipped={p.FilesSkipped}, chunks={p.ChunksProduced}"))) {
if (!doc.Parsed) {
Console.WriteLine($"SKIP {doc.Path}: {string.Join("; ", doc.Warnings ?? Array.Empty<string>())}");
continue;
}
// Upsert your "sources" table keyed by doc.SourceId/doc.SourceHash,
// then upsert chunk rows from doc.Chunks keyed by chunk.ChunkHash.
Console.WriteLine($"{doc.Path} => {doc.ChunksProduced} chunks, ~{doc.TokenEstimateTotal} tokens");
}
using OfficeIMO.Reader;
using System.Text;
var chunks = DocumentReader.ReadFolder(
folderPath: @"C:\KnowledgeBase",
folderOptions: new ReaderFolderOptions { Recurse = true, DeterministicOrder = true },
options: new ReaderOptions { MaxChars = 4000 }).ToList();
var context = new StringBuilder();
foreach (var chunk in chunks) {
var source = chunk.Location.Path ?? "unknown";
var pointer = chunk.Location.Page.HasValue
? $"page {chunk.Location.Page.Value}"
: chunk.Location.HeadingPath ?? $"block {chunk.Location.BlockIndex ?? 0}";
context.AppendLine($"[source: {source} | {pointer}]");
context.AppendLine(chunk.Markdown ?? chunk.Text);
context.AppendLine();
}
using OfficeIMO.Reader;
var options = new ReaderOptions {
MaxChars = 8_000,
MaxTableRows = 200,
IncludeWordFootnotes = true,
IncludePowerPointNotes = true,
ExcelHeadersInFirstRow = true,
ExcelChunkRows = 200,
ExcelSheetName = "Data",
ExcelA1Range = "A1:Z500",
MarkdownChunkByHeadings = true,
ComputeHashes = true
};
var chunks = DocumentReader.Read(@"C:\Docs\Workbook.xlsx", options).ToList();
DocumentReader.Read(...) is synchronous and streaming (returns IEnumerable<T>).DocumentReader.ReadFolder(...) is best-effort: unreadable/corrupt/oversized files emit warning chunks and ingestion continues.DocumentReader.ReadFolderDocuments(...) yields one source payload at a time (ReaderSourceDocument) for easy DB upserts.DocumentReader.ReadFolderDetailed(...) returns ingestion counts/file statuses and can surface progress callback events.SourceId/SourceHash/ChunkHash + token estimate for incremental indexing and prompt budgeting..doc, .xls, .ppt) are not supported.