Complete document processing library for AI, ML, and analytics. DocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for AI, machine learning, processing, analytics, and general analysis.
$ dotnet add package DocumentAtomDocumentAtom provides a light, fast library for breaking input documents into constituent parts (atoms), useful for text processing, analysis, and artificial intelligence.
DocumentAtom requires that Tesseract v5.0 be installed on the host. This is required as certain document types can have embedded images which are parsed using OCR via Tesseract.
SDKs are available for multiple languages in the sdk/ directory:
| SDK | Location | Description |
|---|---|---|
| TypeScript/JavaScript | sdk/typescript/ | Full-featured SDK for Node.js and browser |
| Python | sdk/python/ | Python SDK for data science workflows |
| C# | sdk/csharp/ | .NET SDK client library |
v3.0 replaces the raw-binary POST API with a structured JSON envelope. Every extraction request now carries an optional Settings object alongside the document data, giving callers per-request control over parsing, processing, and chunking — without any server-side configuration changes.
In v2, callers uploaded raw bytes and got back atoms produced with whatever defaults the server happened to have. If you needed OCR, you appended ?ocr=true. If you needed different CSV delimiters, paragraph grouping, or chunk sizes, you had to change server configuration and redeploy.
v3 puts the caller in control:
| delimiters and has no header row — in the same request that uploads the file. Process one HTML page with script extraction enabled and another without."Settings": null and the server behaves exactly like v2 defaults. Override only what you need./atom/* endpoints now accept application/json with base64-encoded document data instead of raw binary upload?ocr query parameter removed: Use Settings.ExtractAtomsFromImages in the JSON body insteadbool extractOcrAll atom extraction requests use this format:
{
"Settings": {
"TrimText": true,
"ExtractAtomsFromImages": true,
"Chunking": {
"Enable": true,
"Strategy": "SentenceBased",
"FixedTokenCount": 256,
"OverlapCount": 2,
"OverlapStrategy": "SentenceBoundaryAware"
}
},
"Data": "<base64-encoded-document>"
}
When Settings is null, the server uses default processor settings for the target document type. Any field you omit from Settings retains its server default — you only specify the values you want to override.
Every setting in the Settings object is optional. Omitted fields keep server defaults.
Common settings (available on all processor types except HTML):
| Setting | Type | Description |
|---|---|---|
TrimText | bool | Trim whitespace from extracted text |
RemoveBinaryFromText | bool | Strip binary data from text output |
ExtractAtomsFromImages | bool | Enable OCR on embedded images (requires Tesseract on host) |
Chunking | object | Chunking configuration (see below) |
Type-specific settings:
| Setting | Type | Applies To | Description |
|---|---|---|---|
RowDelimiter | string | CSV | Row delimiter string |
ColumnDelimiter | char | CSV | Column delimiter character |
HasHeaderRow | bool | CSV | Whether first row is a header |
RowsPerAtom | int | CSV | Number of rows per atom |
BuildHierarchy | bool | Excel, HTML, JSON, Markdown, Word, PowerPoint, XML | Build hierarchical atom structure from headings/sections |
Delimiters | string[] | Markdown, Text | Custom content delimiters |
MaxDepth | int | JSON, XML | Maximum nesting depth to process |
IncludeAttributes | bool | XML | Include XML attributes in atoms |
PreserveWhitespace | bool | HTML, XML | Preserve whitespace in output |
HeaderRowScoreThreshold | int | Excel | Threshold score for header row detection |
ProcessInlineStyles | bool | HTML | Process inline CSS styles |
ProcessMetaTags | bool | HTML | Include meta tag content |
ProcessScripts | bool | HTML | Include script content |
ProcessComments | bool | HTML | Include HTML comments |
MaxTextLength | int | HTML | Maximum text length to process |
ProcessSvg | bool | HTML | Process SVG elements |
ExtractDataAttributes | bool | HTML | Extract data-* attributes |
LineThreshold | int | OCR, PNG | Line detection threshold |
ParagraphThreshold | int | OCR, PNG | Paragraph detection threshold |
HorizontalLineLength | int | OCR, PNG | Minimum horizontal line length |
VerticalLineLength | int | OCR, PNG | Minimum vertical line length |
TableMinArea | int | OCR, PNG | Minimum area for table detection |
ColumnAlignmentTolerance | int | OCR, PNG | Column alignment tolerance |
ProximityThreshold | int | OCR, PNG | Element proximity threshold |
v3.0 introduces server-side chunking with 11 strategies. When enabled, each Atom in the response includes a Chunks array of content fragments — ready for embedding, vector storage, or retrieval without any client-side post-processing.
Chunking Strategies:
| Strategy | Description | Use Case |
|---|---|---|
FixedTokenCount | Splits text into fixed token-count windows using cl100k_base | Embedding models with fixed context windows |
SentenceBased | Groups sentences up to a token budget | RAG pipelines, semantic search |
ParagraphBased | Groups paragraphs up to a token budget | Summarization, longer context |
RegexBased | Splits on a user-supplied regex pattern | Domain-specific delimiters |
WholeList | Serializes an entire list atom as one chunk | Short lists that shouldn't be split |
ListEntry | Each list item becomes its own chunk | FAQ lists, bullet-point extraction |
Row | Each table row becomes a chunk (no headers) | Simple tabular data |
RowWithHeaders | Each row becomes a chunk prefixed with column headers | Tabular data needing context |
RowGroupWithHeaders | Groups N rows together with headers | Large tables, batch processing |
KeyValuePairs | Each row becomes Header: Value pairs | Structured data extraction |
WholeTable | Entire table serialized as one chunk | Small tables, preserving structure |
Overlap Strategies (for text-based chunking):
| Strategy | Description |
|---|---|
SlidingWindow | Overlaps by raw token/sentence/paragraph count |
SentenceBoundaryAware | Overlap snaps to sentence boundaries |
SemanticBoundaryAware | Overlap snaps to paragraph boundaries |
ChunkingConfiguration Fields:
| Field | Type | Default | Description |
|---|---|---|---|
Enable | bool | false | Enable/disable chunking |
Strategy | string | FixedTokenCount | One of the 11 strategies above |
FixedTokenCount | int | 256 | Token budget per chunk (min: 1) |
OverlapCount | int | 0 | Number of overlap units (min: 0) |
OverlapPercentage | double | null | Overlap as percentage (0.0-1.0); when set, takes precedence over OverlapCount |
OverlapStrategy | string | SlidingWindow | One of the 3 overlap strategies |
RowGroupSize | int | 5 | Rows per group for RowGroupWithHeaders (min: 1) |
ContextPrefix | string | null | Text prepended to each chunk |
RegexPattern | string | null | Split pattern for RegexBased strategy |
An atom can have both quarks (structure, hierarchy, parent-child relationships) and chunks (fragments of atom data based on chunking strategy).
C#:
using DocumentAtom.Sdk;
using DocumentAtom.Core.Api;
var sdk = new DocumentAtomSdk("http://localhost:8000");
byte[] data = File.ReadAllBytes("document.pdf");
// Without settings (server defaults)
List<Atom>? atoms = await sdk.Atom.ProcessPdf(data);
// With settings: enable OCR and sentence-based chunking
var settings = new ApiProcessorSettings
{
ExtractAtomsFromImages = true,
Chunking = new ChunkingConfiguration
{
Enable = true,
Strategy = ChunkStrategyEnum.SentenceBased,
FixedTokenCount = 256,
OverlapCount = 2,
OverlapStrategy = OverlapStrategyEnum.SentenceBoundaryAware
}
};
List<Atom>? atoms = await sdk.Atom.ProcessPdf(data, settings);
TypeScript:
import DocumentAtomSdk from 'document-atom-sdk';
const sdk = new DocumentAtomSdk({ endpoint: 'http://localhost:8000' });
const fileBuffer = fs.readFileSync('document.pdf');
// Without settings
const atoms = await sdk.extractAtom.pdf(fileBuffer);
// With settings: enable OCR and sentence-based chunking
const atoms = await sdk.extractAtom.pdf(fileBuffer, {
ExtractAtomsFromImages: true,
Chunking: {
Enable: true,
Strategy: 'SentenceBased',
FixedTokenCount: 256,
OverlapCount: 2,
OverlapStrategy: 'SentenceBoundaryAware',
},
});
Python:
from document_atom_sdk import DocumentAtomSdk, ApiProcessorSettingsModel, ChunkingConfigurationModel
sdk = DocumentAtomSdk(endpoint="http://localhost:8000")
with open("document.pdf", "rb") as f:
data = f.read()
# Without settings
atoms = sdk.atom.extract_atoms_pdf(data)
# With settings: enable OCR and sentence-based chunking
settings = ApiProcessorSettingsModel(
extract_atoms_from_images=True,
chunking=ChunkingConfigurationModel(
enable=True,
strategy="SentenceBased",
fixed_token_count=256,
overlap_count=2,
overlap_strategy="SentenceBoundaryAware",
),
)
atoms = sdk.atom.extract_atoms_pdf(data, settings=settings)
API calls: Replace raw binary POST with JSON envelope:
# v2 (no longer supported)
POST /atom/pdf
Content-Type: application/octet-stream
Body: <raw bytes>
# v3
POST /atom/pdf
Content-Type: application/json
Body: { "Settings": null, "Data": "<base64>" }
OCR extraction: Replace ?ocr=true query parameter:
# v2 (no longer supported)
POST /atom/pdf?ocr=true
# v3
POST /atom/pdf
Body: { "Settings": { "ExtractAtomsFromImages": true }, "Data": "<base64>" }
C# SDK: Replace bool extractOcr parameter with ApiProcessorSettings?:
// v2
var atoms = await sdk.Atom.ProcessPdf(data, extractOcr: true);
// v3
var settings = new ApiProcessorSettings { ExtractAtomsFromImages = true };
var atoms = await sdk.Atom.ProcessPdf(data, settings);
TypeScript SDK: Replace individual parameters with settings object:
// v2
const atoms = await sdk.extractAtom.pdf(fileBuffer);
// v3 (same for no settings, but now accepts optional settings)
const atoms = await sdk.extractAtom.pdf(fileBuffer, { ExtractAtomsFromImages: true });
Python SDK: Replace ocr parameter with settings:
# v2
atoms = sdk.atom.extract_atoms_pdf(data, ocr=True)
# v3
settings = ApiProcessorSettingsModel(extract_atoms_from_images=True)
atoms = sdk.atom.extract_atoms_pdf(data, settings=settings)
Directory.Build.propsDocumentAtom.DataIngestion) for RAG/AI pipeline integration
Microsoft.Extensions.DependencyInjectionBuildHierarchy in settings) - heading-based for markdown/HTML/Word, page-based for PowerPointDocumentAtom.McpServer) for exposing DocumentAtom operations via Model Context Protocol to AI assistantsParsing documents and extracting constituent parts is one part science and one part black magic. If you find ways to improve processing and extraction in any way that is horizontally useful, I'd would love your feedback on ways to make this library more accurate, more useful, faster, and overall better. My goal in building this library is to make it easier to analyze input data assets and make them more consumable by other systems including analytics and artificial intelligence.
Please feel free to file issues, enhancement requests, or start discussions about use of the library, improvements, or fixes.
DocumentAtom supports the following input file types:
Refer to the various Test projects for working examples.
The following example shows processing a markdown (.md) file.
using DocumentAtom.Core.Atoms;
using DocumentAtom.Text.Markdown;
MarkdownProcessorSettings settings = new MarkdownProcessorSettings();
MarkdownProcessor processor = new MarkdownProcessor(settings);
foreach (Atom atom in processor.Extract(filename))
Console.WriteLine(atom.ToString());
DocumentAtom parses input data assets into a variety of Atom objects. Each Atom includes top-level metadata including:
ParentGUID - globally-unique identifier of the parent atom, or, nullGUID - globally-unique identifierType - including Text, Image, Binary, Table, and ListPageNumber - where available; some document types do not explicitly indicate page numbers, and page numbers are inferred when renderedPosition - the ordinal position of the Atom, relative to othersLength - the length of the Atom's contentMD5Hash - the MD5 hash of the Atom contentSHA1Hash - the SHA1 hash of the Atom contentSHA256Hash - the SHA256 hash of the Atom contentQuarks - structural sub-atoms from the document (e.g., cells in a table row, items in a list)Chunks - content fragments produced by the chunking engine when chunking is enabled via SettingsThe AtomBase class provides the aforementioned metadata, and several type-specific Atoms are returned from the various processors, including:
BinaryAtom - includes a Bytes propertyDocxAtom - includes Text, HeaderLevel, UnorderedList, OrderedList, Table, and Binary propertiesImageAtom - includes BoundingBox, Text, UnorderedList, OrderedList, Table, and Binary propertiesMarkdownAtom - includes Formatting, Text, UnorderedList, OrderedList, and Table propertiesPdfAtom - includes BoundingBox, Text, UnorderedList, OrderedList, Table, and Binary propertiesPptxAtom - includes Title, Subtitle, Text, UnorderedList, OrderedList, Table, and Binary propertiesTableAtom - includes Rows, Columns, Irregular, and Table propertiesTextAtom - includes TextXlsxAtom - includes SheetName, CellIdentifier, Text, Table, and Binary propertiesTable objects inside of Atom objects are always presented as SerializableDataTable objects (see SerializableDataTable for more information) to provide simple serialization and conversion to native System.Data.DataTable objects.
DocumentAtom is built on the shoulders of several libraries, without which, this work would not be possible.
Each of these libraries were integrated as NuGet packages, and no source was included or modified from these packages.
My libraries used within DocumentAtom:
The DocumentAtom.DataIngestion package provides a high-level API for processing documents and producing chunks ready for embedding and vector storage. It's designed to integrate seamlessly with RAG (Retrieval-Augmented Generation) applications and AI pipelines.
using DocumentAtom.DataIngestion;
using DocumentAtom.DataIngestion.Processors;
// Create processor with RAG-optimized settings
AtomDocumentProcessorOptions options = AtomDocumentProcessorOptions.ForRag();
using AtomDocumentProcessor processor = new AtomDocumentProcessor(options);
// Process a document and get chunks
await foreach (IngestionChunk chunk in processor.ProcessAsync("document.pdf"))
{
Console.WriteLine($"Chunk {chunk.ChunkIndex}: {chunk.Content.Substring(0, 100)}...");
// Access metadata for filtering
if (chunk.Metadata.TryGetValue("atom:page_number", out object? page))
Console.WriteLine($" Page: {page}");
}
using DocumentAtom.DataIngestion.Extensions;
// In your service configuration
services.AddDocumentAtomIngestionForRag();
// Or with custom options
services.AddDocumentAtomIngestion(
reader => {
reader.EnableOcr = true;
reader.BuildHierarchy = true;
},
chunker => {
chunker.Chunking = new ChunkingConfiguration
{
Enable = true,
Strategy = ChunkStrategyEnum.SentenceBased,
FixedTokenCount = 500,
OverlapCount = 2,
OverlapStrategy = OverlapStrategyEnum.SentenceBoundaryAware
};
});
| Method | Strategy | Token Budget | Best For |
|---|---|---|---|
AtomDocumentProcessorOptions.ForRag() | SentenceBased | 256 | Vector database ingestion, semantic search |
AtomDocumentProcessorOptions.ForSummarization() | ParagraphBased | 1024 | Document summarization, analysis |
AtomDocumentProcessorOptions.ForLargeContext() | ParagraphBased | 2048 | Large context window models |
Run the DocumentAtom.Server project to start a RESTful server listening on localhost:8000. Modify the documentatom.json file to change the webserver, logging, or Tesseract settings. Alternatively, you can pull jchristn77/documentatom from Docker Hub. Refer to the Docker directory in the project for assets for running in Docker.
Refer to the Postman collection for examples exercising the APIs.
cd src/DocumentAtom.Server
dotnet run
docker pull jchristn77/documentatom:v1.1.0
Create a documentatom.json configuration file (see Docker/documentatom.json for an example)
Run the container:
# Windows
docker run -p 8000:8000 -v .\documentatom.json:/app/documentatom.json -v .\logs\:/app/logs/ jchristn77/documentatom:v1.1.0
# Linux/macOS
docker run -p 8000:8000 -v ./documentatom.json:/app/documentatom.json -v ./logs/:/app/logs/ jchristn77/documentatom:v1.1.0
Alternatively, use the provided scripts in the Docker directory:
# Windows
Dockerrun.bat v1.1.0
# Linux/macOS
IMG_TAG=v1.1.0 ./Dockerrun.sh
The DocumentAtom.McpServer project provides a Model Context Protocol (MCP) server that exposes DocumentAtom operations to AI assistants and LLM-based tools. The MCP server acts as a front-end to the DocumentAtom.Server RESTful API, enabling AI agents to process documents via standardized MCP tool calls.
The MCP server supports three transport protocols:
/rpc (default port 8200)/mcp (default port 8202)The MCP server requires a running DocumentAtom.Server instance. Configure the endpoint in documentatom.json:
{
"DocumentAtom": {
"Endpoint": "http://localhost:8000",
"AccessKey": null
}
}
cd src/DocumentAtom.McpServer
dotnet run
Command-line options:
--config=<file> - Specify settings file path (default: ./documentatom.json)--showconfig - Display configuration and exit--help, -h - Show help messagedocker pull jchristn77/documentatom-mcp:v1.1.0
documentatom.json configuration file with MCP server settings:{
"Logging": {
"LogDirectory": "./logs/",
"LogFilename": "documentatom-mcp.log",
"ConsoleLogging": true,
"EnableColors": true,
"MinimumSeverity": 0
},
"DocumentAtom": {
"Endpoint": "http://host.docker.internal:8000",
"AccessKey": null
},
"Http": {
"Hostname": "0.0.0.0",
"Port": 8200
},
"Tcp": {
"Address": "0.0.0.0",
"Port": 8201
},
"WebSocket": {
"Hostname": "0.0.0.0",
"Port": 8202
},
"Storage": {
"BackupsDirectory": "./backups/",
"TempDirectory": "./temp/"
}
}
# Windows
docker run -p 8200:8200 -p 8201:8201 -p 8202:8202 ^
-v .\documentatom.json:/app/documentatom.json ^
-v .\logs\:/app/logs/ ^
-v .\temp\:/app/temp/ ^
-v .\backups\:/app/backups/ ^
jchristn77/documentatom-mcp:v1.1.0
# Linux/macOS
docker run -p 8200:8200 -p 8201:8201 -p 8202:8202 \
-v ./documentatom.json:/app/documentatom.json \
-v ./logs/:/app/logs/ \
-v ./temp/:/app/temp/ \
-v ./backups/:/app/backups/ \
jchristn77/documentatom-mcp:v1.1.0
Alternatively, use the provided scripts in src/DocumentAtom.McpServer:
# Windows
Dockerrun.bat v1.0.0
# Linux/macOS
IMG_TAG=v1.0.0 ./Dockerrun.sh
The MCP server supports the following environment variables to override configuration:
| Variable | Description |
|---|---|
DOCUMENTATOM_ENDPOINT | DocumentAtom server endpoint URL |
DOCUMENTATOM_ACCESS_KEY | Access key for authentication |
MCP_HTTP_HOSTNAME | HTTP server hostname |
MCP_HTTP_PORT | HTTP server port |
MCP_TCP_ADDRESS | TCP server address |
MCP_TCP_PORT | TCP server port |
MCP_WEBSOCKET_HOSTNAME | WebSocket server hostname |
MCP_WEBSOCKET_PORT | WebSocket server port |
CONSOLE_LOGGING | Enable console logging (1 or 0) |
To build the Docker images locally:
# Build DocumentAtom.Server image
cd Docker
Dockerbuild.bat v1.1.0 0 # 0 = don't push, 1 = push to Docker Hub
# Build DocumentAtom.McpServer image (from src directory)
cd src
docker buildx build -f DocumentAtom.McpServer/Dockerfile --platform linux/amd64,linux/arm64/v8 --tag jchristn77/documentatom-mcp:v1.1.0 --push .
Please refer to CHANGELOG.md for version history.
Special thanks to iconduck.com and the content authors for producing this icon.