SemanticDocIngestor.Core is a powerful .NET 9 SDK for document ingestion, semantic search, and retrieval-augmented generation (RAG) with hybrid search capabilities. Build intelligent document processing pipelines with vector and keyword search powered by Qdrant and Elasticsearch. Supports multi-source ingestion from local files, OneDrive, and Google Drive with real-time progress tracking and AI-powered answers using Ollama LLM models.
$ dotnet add package SemanticDocIngestor.CoreA powerful .NET 9 SDK for document ingestion, semantic search, and retrieval-augmented generation (RAG) with hybrid search capabilities. Build intelligent document processing pipelines with vector and keyword search powered by Qdrant and Elasticsearch.
Install the NuGet package:
dotnet add package SemanticDocIngestor.Core
The SDK requires:
.NET Aspire automatically manages service discovery and connection strings. The SDK integrates seamlessly with Aspire's orchestration.
In your Program.cs:
using SemanticDocIngestor.Core;
var builder = WebApplication.CreateBuilder(args);
// Add .NET Aspire service defaults (includes service discovery, telemetry, health checks)
builder.AddServiceDefaults();
// Add SemanticDocIngestor services
// Connection strings are automatically discovered via Aspire service discovery
builder.Services.AddSemanticDocIngestorCore(builder.Configuration);
var app = builder.Build();
// Map default Aspire endpoints (health checks, metrics)
app.MapDefaultEndpoints();
// Use SemanticDocIngestor middleware
var loggerFactory = app.Services.GetRequiredService<ILoggerFactory>();
app.UseSemanticDocIngestorCore(app.Configuration, loggerFactory);
app.Run();
In your Aspire AppHost project (e.g., Program.cs):
var builder = DistributedApplication.CreateBuilder(args);
// Add infrastructure services
var elasticsearch = builder.AddElasticsearch("elasticsearch")
.WithDataVolume();
var qdrant = builder.AddQdrant("qdrant")
.WithDataVolume();
var ollama = builder.AddOllama("ollama")
.WithDataVolume()
.WithOpenWebUI(); // Optional: Add Open WebUI for model management
// Add your API service with references
var apiService = builder.AddProject<Projects.SemanticDocIngestor_AppHost_ApiService>("apiservice")
.WithReference(elasticsearch)
.WithReference(qdrant)
.WithReference(ollama);
builder.Build().Run();
With Aspire, your appsettings.json only needs SDK-specific settings:
{
"AppSettings": {
"Ollama": {
"ChatModel": "llama3.2",
"EmbeddingModel": "nomic-embed-text",
"Temperature": 0.7,
"MaxTokens": 2048
},
"Qdrant": {
"CollectionName": "documents",
"VectorSize": 768,
"Distance": "Cosine"
},
"Elastic": {
"SemanticDocIndexName": "semantic_docs",
"DocRepoIndexName": "docs_repo"
}
},
"ResiliencyMiddlewareOptions": {
"RetryCount": 3,
"TimeoutSeconds": 30,
"ExceptionsAllowedBeforeCircuitBreaking": 5,
"CircuitBreakingDurationSeconds": 60
}
}
Note: Connection strings for elasticsearch, qdrant, and ollama are automatically injected by Aspire service discovery.
If you're not using .NET Aspire, you need to manually configure connection strings.
using SemanticDocIngestor.Core;
var builder = WebApplication.CreateBuilder(args);
// Add SemanticDocIngestor services
builder.Services.AddSemanticDocIngestorCore(builder.Configuration);
var app = builder.Build();
// Use SemanticDocIngestor middleware
var loggerFactory = app.Services.GetRequiredService<ILoggerFactory>();
app.UseSemanticDocIngestorCore(app.Configuration, loggerFactory);
app.Run();
Add all connection strings and settings to your appsettings.json:
{
"ConnectionStrings": {
"elasticsearch": "http://localhost:9200",
"qdrant": "http://localhost:6334",
"ollama": "http://localhost:11434"
},
"AppSettings": {
"Ollama": {
"ChatModel": "llama3.2",
"EmbeddingModel": "nomic-embed-text",
"Temperature": 0.7,
"MaxTokens": 2048
},
"Qdrant": {
"CollectionName": "documents",
"VectorSize": 768,
"Distance": "Cosine"
},
"Elastic": {
"SemanticDocIndexName": "semantic_docs",
"DocRepoIndexName": "docs_repo"
}
},
"ResiliencyMiddlewareOptions": {
"RetryCount": 3,
"TimeoutSeconds": 30,
"ExceptionsAllowedBeforeCircuitBreaking": 5,
"CircuitBreakingDurationSeconds": 60
}
}
The SDK expects the following connection string names in appsettings.json or via .NET Aspire:
{
"ConnectionStrings": {
"elasticsearch": "http://localhost:9200"
}
}
With Authentication:
{
"ConnectionStrings": {
"elasticsearch": "http://username:password@localhost:9200"
}
}
Cloud/Elastic Cloud:
{
"ConnectionStrings": {
"elasticsearch": "https://my-deployment.es.us-central1.gcp.cloud.es.io:9243"
}
}
{
"ConnectionStrings": {
"qdrant": "http://localhost:6334"
}
}
With API Key:
{
"ConnectionStrings": {
"qdrant": "Endpoint=http://localhost:6334;Key=your-api-key-here"
}
}
Qdrant Cloud:
{
"ConnectionStrings": {
"qdrant": "Endpoint=https://xyz-example.qdrant.io:6334;Key=your-cloud-api-key"
}
}
Note: Qdrant uses gRPC by default on port 6334. If you're using the HTTP REST API, change the port to 6333.
{
"ConnectionStrings": {
"ollama": "http://localhost:11434"
}
Remote Ollama:
{
"ConnectionStrings": {
"ollama": "http://your-ollama-server:11434"
}
Use appsettings.Development.json for local development:
{
"ConnectionStrings": {
"elasticsearch": "http://localhost:9200",
"qdrant": "http://localhost:6334",
"ollama": "http://localhost:11434"
}
}
Use appsettings.Production.json or environment variables:
{
"ConnectionStrings": {
"elasticsearch": "https://prod-es.company.com:9243",
"qdrant": "https://prod-qdrant.company.com:6334;ApiKey=${QDRANT_API_KEY}",
"ollama": "http://ollama-service:11434"
}
}
Set connection strings via environment variables (useful for Docker/Kubernetes):
export ConnectionStrings__elasticsearch="http://elasticsearch:9200"
export ConnectionStrings__qdrant="http://qdrant:6334"
export ConnectionStrings__ollama="http://ollama:11434"
In Docker Compose:
services:
api:
image: your-api-image
environment:
- ConnectionStrings__elasticsearch=http://elasticsearch:9200
- ConnectionStrings__qdrant=http://qdrant:6334
- ConnectionStrings__ollama=http://ollama:11434
depends_on:
- elasticsearch
- qdrant
- ollama
- ollama
When deploying to Azure, use Azure App Configuration or Key Vault:
builder.Configuration.AddAzureAppConfiguration(options =>
{
options.Connect(builder.Configuration["ConnectionStrings:AppConfig"])
.UseFeatureFlags();
});
// Or use Key Vault
builder.Configuration.AddAzureKeyVault(
new Uri($"https://{keyVaultName}.vault.azure.net/"),
new DefaultAzureCredential());
The SDK resolves connection strings in the following order:
ConnectionStrings__elasticsearch)appsettings.Production.json)When using .NET Aspire, the SDK automatically discovers services by their registered names:
| Service Name | SDK Uses For | Default Port |
|---|---|---|
elasticsearch | Keyword search and metadata storage | 9200 |
qdrant | Vector embeddings and semantic search | 6334 |
ollama | LLM chat and embedding generation | 11434 |
The SDK works seamlessly with Aspire's container resources:
// In AppHost Program.cs
var elasticsearch = builder.AddElasticsearch("elasticsearch")
.WithDataVolume()
.WithLifetime(ContainerLifetime.Persistent);
var qdrant = builder.AddContainer("qdrant", "qdrant/qdrant")
.WithBindMount("./qdrant_data", "/qdrant/storage")
.WithHttpEndpoint(port: 6334, targetPort: 6334, name: "qdrant");
var ollama = builder.AddContainer("ollama", "ollama/ollama")
.WithBindMount("./ollama_data", "/root/.ollama")
.WithHttpEndpoint(port: 11434, targetPort: 11434, name: "ollama");
The SDK supports Aspire's health check infrastructure:
// In your API service
builder.Services.AddHealthChecks()
.AddElasticsearch(builder.Configuration.GetConnectionString("elasticsearch")!)
.AddQdrant(builder.Configuration.GetConnectionString("qdrant")!);
// In AppHost, health checks are automatically monitored
var apiService = builder.AddProject<Projects.YourApi>("api")
.WithReference(elasticsearch)
.WithReference(qdrant)
.WithHealthCheck(); // Monitors the health endpoints
When running with Aspire, you can monitor the SDK's operations in the Aspire Dashboard:
Access the dashboard at: http://localhost:15888 (default)
using SemanticDocIngestor.Domain.Abstractions.Services;
public class DocumentController : ControllerBase
{
private readonly IDocumentIngestorService _documentIngestor;
public DocumentController(IDocumentIngestorService documentIngestor)
{
_documentIngestor = documentIngestor;
}
[HttpPost("ingest")]
public async Task<IActionResult> IngestDocuments(
[FromBody] List<string> filePaths,
CancellationToken cancellationToken)
{
// Ingest local files
await _documentIngestor.IngestDocumentsAsync(
filePaths,
maxChunkSize: 500,
cancellationToken: cancellationToken);
return Ok("Ingestion completed");
}
}
[HttpGet("search")]
public async Task<IActionResult> Search(
[FromQuery] string query,
[FromQuery] ulong limit = 10,
CancellationToken cancellationToken = default)
{
var results = await _documentIngestor.SearchDocumentsAsync(
query,
limit: limit,
cancellationToken: cancellationToken);
return Ok(results);
}
[HttpGet("ask")]
public async Task<IActionResult> AskQuestion(
[FromQuery] string question,
[FromQuery] ulong contextLimit = 5,
CancellationToken cancellationToken = default)
{
var response = await _documentIngestor.SearchAndGetRagResponseAsync(
question,
limit: contextLimit,
cancellationToken: cancellationToken);
return Ok(new
{
answer = response.Answer,
sources = response.ReferencesPath.Keys
});
}
The SDK supports ingesting from multiple sources:
var localFiles = new[]
{
@"C:\Documents\report.pdf",
@"C:\Documents\presentation.pptx"
};
await _documentIngestor.IngestDocumentsAsync(localFiles);
// Using OneDrive URIs
var oneDriveFiles = new[]
{
"onedrive://{driveId}/{itemId}",
"https://1drv.ms/u/s!xxxxxxxxxxxxxx",
"https://contoso.sharepoint.com/:b:/g/documents/report.pdf"
};
await _documentIngestor.IngestDocumentsAsync(oneDriveFiles);
OneDrive Configuration (add to appsettings.json):
{
"AzureAd": {
"Instance": "https://login.microsoftonline.com/",
"Domain": "contoso.onmicrosoft.com",
"TenantId": "your-tenant-id",
"ClientId": "your-client-id",
"ClientSecret": "your-client-secret",
"CallbackPath": "/signin-oidc"
}
}
Required Microsoft Graph permissions: Files.Read, Files.Read.All
// Using Google Drive URIs
var googleDriveFiles = new[]
{
"gdrive://{fileId}",
"https://drive.google.com/file/d/{fileId}/view"
};
await _documentIngestor.IngestDocumentsAsync(googleDriveFiles);
Google Drive Configuration (add to appsettings.json):
{
"Google": {
"ClientId": "your-client-id.apps.googleusercontent.com",
"ClientSecret": "your-client-secret",
"ApplicationName": "YourAppName"
}
}
Required Google API scope: https://www.googleapis.com/auth/drive.readonly
Monitor ingestion progress with events:
public class IngestionService
{
private readonly IDocumentIngestorService _documentIngestor;
public IngestionService(IDocumentIngestorService documentIngestor)
{
_documentIngestor = documentIngestor;
// Subscribe to progress events
_documentIngestor.OnProgress += OnIngestionProgress;
_documentIngestor.OnCompleted += OnIngestionCompleted;
}
private async void OnIngestionProgress(object? sender, IngestionProgress e)
{
Console.WriteLine($"Progress: {e.Completed}/{e.Total} - {e.FilePath}");
}
private async void OnIngestionCompleted(object? sender, IngestionProgress e)
{
Console.WriteLine($"Ingestion completed: {e.Total} documents processed");
}
}
Or query progress directly:
var progress = await _documentIngestor.GetProgressAsync(cancellationToken);
Console.WriteLine($"{progress.Completed} of {progress.Total} files processed");
var ingestedDocs = await _documentIngestor.ListIngestedDocumentsAsync(cancellationToken);
foreach (var doc in ingestedDocs)
{
Console.WriteLine($"File: {doc.Metadata.FileName}");
Console.WriteLine($"Source: {doc.Metadata.Source}");
Console.WriteLine($"Path: {doc.Metadata.FilePath}");
Console.WriteLine($"Ingested: {doc.CreatedAt}");
}
// Remove all ingested documents from both vector and keyword stores
await _documentIngestor.FlushAsync(cancellationToken);
SemanticDocIngestor.Core
??? Services
? ??? DocumentIngestorService # Main orchestration service
??? Domain
? ??? Abstractions
? ? ??? Services
? ? ? ??? IDocumentIngestorService
? ? ? ??? IRagService
? ? ??? Factories
? ? ? ??? IDocumentProcessor
? ? ? ??? ICloudFileResolver
? ? ??? Persistence
? ? ??? IVectorStore
? ? ??? IElasticStore
? ??? Entities
? ? ??? DocumentChunk
? ??? DTOs
??? Infrastructure
??? Factories
???? DocumentProcessor # PDF, DOCX, XLSX processing
? ??? OneDriveFileResolver # Microsoft Graph integration
? ??? GoogleDriveFileResolver # Google Drive API integration
??? Persistence
? ??? VectorStore # Qdrant implementation
? ??? ElasticStore # Elasticsearch implementation
??? Middlewares
??? ResiliencyMiddleware # Polly-based resilience
??? RequestLoggingMiddleware
??? ExceptionHandlingMiddleware
Main service interface for document ingestion and search.
IngestDocumentsAsync
Task IngestDocumentsAsync(
IEnumerable<string> documentPaths,
int maxChunkSize = 500,
CancellationToken cancellationToken = default)
Ingest documents from local paths or cloud URIs.
IngestFolderAsync
Task IngestFolderAsync(
string folderPath,
CancellationToken cancellationToken = default)
Recursively ingest all supported documents from a folder.
SearchDocumentsAsync
Task<List<DocumentChunkDto>> SearchDocumentsAsync(
string query,
ulong limit = 10,
CancellationToken cancellationToken = default)
Perform hybrid search across ingested documents.
SearchAndGetRagResponseAsync
Task<SearchAndGetRagResponseDto> SearchAndGetRagResponseAsync(
string search,
ulong limit = 5,
CancellationToken cancellationToken = default)
Search documents and generate an AI-powered answer using RAG.
SearchAndGetRagStreamResponseAsync
Task<SearchAndGetRagStreamingResponseDto> SearchAndGetRagStreamResponseAsync(
string search,
ulong limit = 5,
CancellationToken cancellationToken = default)
Search documents and stream an AI-powered answer in real-time.
ListIngestedDocumentsAsync
Task<List<DocumentRepoItemDto>> ListIngestedDocumentsAsync(
CancellationToken cancellationToken = default)
Get a list of all ingested documents with metadata.
GetProgressAsync
Task<IngestionProgress?> GetProgressAsync(
CancellationToken cancellationToken = default)
Get current ingestion progress.
FlushAsync
Task FlushAsync(CancellationToken cancellationToken = default)
Delete all ingested documents from both stores.
OnProgress
event EventHandler<IngestionProgress>? OnProgress
Raised during ingestion with progress updates.
OnCompleted
event EventHandler<IngestionProgress>? OnCompleted
Raised when ingestion completes.
| Property | Description | Default |
|---|---|---|
ChatModel | Model for RAG chat completions | llama3.2 |
EmbeddingModel | Model for vector embeddings | nomic-embed-text |
Temperature | Response randomness (0.0-1.0) | 0.7 |
MaxTokens | Maximum response tokens | 2048 |
| Property | Description | Default |
|---|---|---|
CollectionName | Vector collection name | documents |
VectorSize | Embedding vector dimensions | 768 |
Distance | Distance metric | Cosine |
| Property | Description | Default |
|---|---|---|
SemanticDocIndexName | Index for document chunks | semantic_docs |
DocRepoIndexName | Index for document metadata | docs_repo |
| Property | Description | Default |
|---|---|---|
RetryCount | Number of retry attempts | 3 |
TimeoutSeconds | Request timeout | 30 |
ExceptionsAllowedBeforeCircuitBreaking | Circuit breaker threshold | 5 |
CircuitBreakingDurationSeconds | Circuit breaker open duration | 60 |
Complete example from the reference implementation:
using SemanticDocIngestor.Core;
using SemanticDocIngestor.Domain.Abstractions.Services;
using Microsoft.AspNetCore.Mvc;
var builder = WebApplication.CreateBuilder(args);
// Add SemanticDocIngestor
builder.Services.AddSemanticDocIngestorCore(builder.Configuration);
builder.Services.AddControllers();
var app = builder.Build();
var loggerFactory = app.Services.GetRequiredService<ILoggerFactory>();
app.UseSemanticDocIngestorCore(app.Configuration, loggerFactory);
app.MapControllers();
app.Run();
// Controller
[ApiController]
[Route("[controller]")]
public class IngestionController : ControllerBase
{
private readonly IDocumentIngestorService _documentIngestor;
private readonly IWebHostEnvironment _env;
public IngestionController(
IDocumentIngestorService documentIngestor,
IWebHostEnvironment env)
{
_documentIngestor = documentIngestor;
_env = env;
}
[HttpPost("ingest-files")]
public async Task<IActionResult> IngestFiles(
[FromBody] List<string> filesPath,
CancellationToken cancellationToken)
{
var fullPaths = filesPath.Select(f =>
Path.Combine(_env.WebRootPath, f));
await _documentIngestor.IngestDocumentsAsync(
fullPaths,
cancellationToken: cancellationToken);
return Created();
}
[HttpGet("search")]
public async Task<IActionResult> Search(
[FromQuery] string search,
[FromQuery] ulong limit = 5,
CancellationToken cancellationToken = default)
{
var response = await _documentIngestor
.SearchAndGetRagResponseAsync(
search,
limit,
cancellationToken);
return Ok(response);
}
[HttpGet("ingested-files")]
public async Task<IActionResult> GetIngestedFiles(
CancellationToken cancellationToken)
{
var files = await _documentIngestor
.ListIngestedDocumentsAsync(cancellationToken);
return Ok(files);
}
[HttpGet("progress")]
public async Task<IActionResult> GetProgress(
CancellationToken cancellationToken)
{
var progress = await _documentIngestor
.GetProgressAsync(cancellationToken);
return Ok(progress);
}
[HttpDelete("flush-db")]
public async Task<IActionResult> Flush(
CancellationToken cancellationToken)
{
await _documentIngestor.FlushAsync(cancellationToken);
return Ok();
}
}
using Microsoft.Extensions.Configuration;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Logging;
using SemanticDocIngestor.Core;
using SemanticDocIngestor.Domain.Abstractions.Services;
var configuration = new ConfigurationBuilder()
.AddJsonFile("appsettings.json")
.Build();
var services = new ServiceCollection();
services.AddLogging(builder => builder.AddConsole());
services.AddSemanticDocIngestorCore(configuration);
var serviceProvider = services.BuildServiceProvider();
var documentIngestor = serviceProvider
.GetRequiredService<IDocumentIngestorService>();
// Subscribe to progress
documentIngestor.OnProgress += (sender, progress) =>
{
Console.WriteLine($"Progress: {progress.Completed}/{progress.Total}");
};
// Ingest documents
var files = new[]
{
@"C:\docs\report.pdf",
@"C:\docs\presentation.pptx"
};
await documentIngestor.IngestDocumentsAsync(files);
// Search with RAG
var question = "What are the key findings in the report?";
var response = await documentIngestor
.SearchAndGetRagResponseAsync(question, limit: 5);
Console.WriteLine($"Answer: {response.Answer}");
Console.WriteLine($"Sources: {string.Join(", ", response.ReferencesPath.Keys)}");
Implement IDocumentProcessor for custom file types:
public class CustomDocumentProcessor : IDocumentProcessor
{
public List<string> SupportedFileExtensions => new() { ".custom" };
public async Task<List<DocumentChunk>> ProcessDocument(
string filePath,
int maxChunkSize = 500,
CancellationToken cancellationToken = default)
{
// Your custom processing logic
var chunks = new List<DocumentChunk>();
// ... parse and chunk the document
return chunks;
}
}
// Register in DI
builder.Services.AddSingleton<IDocumentProcessor, CustomDocumentProcessor>();
Implement ICloudFileResolver for custom cloud storage:
public class S3FileResolver : ICloudFileResolver
{
public bool CanResolve(string input)
{
return input.StartsWith("s3://");
}
public async Task<ResolvedCloudFile> ResolveAsync(
string input,
CancellationToken ct = default)
{
// Download from S3 to temp location
var localPath = await DownloadFromS3(input, ct);
return new ResolvedCloudFile(
localPath,
input, // identity
IngestionSource.Cloud);
}
}
// Register in DI
builder.Services.AddSingleton<ICloudFileResolver, S3FileResolver>();
Service not discovered:
# Check Aspire dashboard (http://localhost:15888)
# Ensure services are running and healthy
# Verify service names match in AppHost configuration
Container startup issues:
# View container logs in Aspire dashboard
# Check Docker Desktop or container runtime
# Verify port availability (9200, 6334, 11434)
Elasticsearch not reachable:
# Test connection
curl http://localhost:9200
# Check connection string in appsettings.json
# Verify Elasticsearch is running: docker ps
Qdrant not reachable:
# Test gRPC connection (port 6334)
curl http://localhost:6333/collections
# Check connection string in appsettings.json
# Verify Qdrant is running: docker ps
# Ensure you're using the correct port:
# - gRPC (default): port 6334
# - HTTP REST API: port 6333
Ollama not reachable:
# Test connection
curl http://localhost:11434/api/tags
# Check if Ollama is running
# Verify models are installed: ollama list
Add connection string validation at startup:
var elasticConnection = builder.Configuration.GetConnectionString("elasticsearch");
if (string.IsNullOrEmpty(elasticConnection))
{
throw new InvalidOperationException("Elasticsearch connection string is not configured");
}
var qdrantConnection = builder.Configuration.GetConnectionString("qdrant");
if (string.IsNullOrEmpty(qdrantConnection))
{
throw new InvalidOperationException("Qdrant connection string is not configured");
}
Model not found:
# Pull required models
ollama pull llama3.2
ollama pull nomic-embed-text
# Verify models are available
ollama list
Embedding dimension mismatch:
Ensure VectorSize in Qdrant settings matches your embedding model output:
- nomic-embed-text: 768 dimensions
- all-minilm: 384 dimensions
- text-embedding-ada-002: 1536 dimensions
For manual deployment without Aspire:
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
ports:
- "9200:9200"
volumes:
- elasticsearch-data:/usr/share/elasticsearch/data
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333"
- "6334:6334"
volumes:
- qdrant-data:/qdrant/storage
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
api:
build: .
ports:
- "8080:8080"
environment:
- ConnectionStrings__elasticsearch=http://elasticsearch:9200
- ConnectionStrings__qdrant=http://qdrant:6334
- ConnectionStrings__ollama=http://ollama:11434
depends_on:
- elasticsearch
- qdrant
- ollama
volumes:
elasticsearch-data:
qdrant-data:
ollama-data:
dotnet run with AppHost projectdotnet publishazd up (Azure Developer CLI)Licensed under the MIT License.
Built with:
Made with ❤️ by Ramin Esfahani