OllamaFlow is a lightweight intelligent load-balancer for Ollama.
$ dotnet add package OllamaFlow.CoreOllamaFlow is a lightweight, intelligent orchestration layer that unifies multiple AI backend instances into a high-availability inference cluster. Supporting both Ollama and OpenAI API formats on the frontend with native transformation capabilities, OllamaFlow delivers scalability, high availability, and security control - enabling you to scale AI workloads across multiple backends while ensuring zero-downtime model serving and fine-grained control over inference and embeddings deployments.
# Pull the image
docker pull jchristn/ollamaflow:v1.1.0
# Run with default configuration
docker run -d \
-p 43411:43411 \
-v $(pwd)/ollamaflow.json:/app/ollamaflow.json \
-v $(pwd)/ollamaflow.db:/app/ollamaflow.db \
jchristn/ollamaflow:v1.1.0
# Clone the repository
git clone https://github.com/jchristn/ollamaflow.git
cd ollamaflow/src
# Build and run
dotnet build
cd OllamaFlow.Server/bin/Debug/net8.0
dotnet OllamaFlow.Server.dll
OllamaFlow uses a simple JSON configuration file named ollamaflow.json. Here's a minimal example:
{
"Webserver": {
"Hostname": "*",
"Port": 43411
},
"Logging": {
"MinimumSeverity": 6,
"ConsoleLogging": true
},
"Frontends": ["..."],
"Backends": ["..."]
}
Frontends define your virtual Ollama endpoints:
{
"Identifier": "main-frontend",
"Name": "Production Ollama Frontend",
"Hostname": "*",
"LoadBalancing": "RoundRobin",
"Backends": ["gpu-1", "gpu-2", "gpu-3"],
"RequiredModels": ["llama3", "all-minilm"],
"AllowEmbeddings": true,
"AllowCompletions": true,
"PinnedEmbeddingsProperties": {
"model": "all-minilm"
},
"PinnedCompletionsProperties": {
"model": "llama3",
"options": {
"num_ctx": 4096,
"temperature": 0.3
}
}
}
Backends represent your actual AI inference instances (Ollama, OpenAI, vLLM, SharpAI, etc.):
{
"Identifier": "gpu-1",
"Name": "GPU Server 1",
"Hostname": "192.168.1.100",
"Port": 11434,
"MaxParallelRequests": 4,
"HealthCheckMethod": "HEAD",
"HealthCheckUrl": "/",
"UnhealthyThreshold": 2,
"ApiFormat": "Ollama",
"AllowEmbeddings": true,
"AllowCompletions": true,
"PinnedEmbeddingsProperties": {
"model": "all-minilm"
},
"PinnedCompletionsProperties": {
"model": "llama3",
"options": {
"num_ctx": 4096,
"temperature": 0.3
}
}
}
OllamaFlow provides universal API compatibility with native transformation between formats:
Ollama API:
/api/generate - Text generation/api/chat/generate - Chat completions/api/pull - Model pulling/api/push - Model pushing/api/show - Model information/api/tags - List models/api/ps - Running models/api/embed - Embeddings/api/delete - Model deletionOpenAI API:
/v1/chat/completions - Chat completions/v1/completions - Text completions/v1/embeddings - Text embeddingsOllamaFlow provides fine-grained control over request types and parameters at both the frontend and backend levels:
Control which types of requests are allowed using AllowEmbeddings and AllowCompletions boolean properties:
true for a request to succeed - if either the frontend or backend disallows a request type, it will failExample use cases:
Force specific properties into requests using PinnedEmbeddingsProperties and PinnedCompletionsProperties dictionaries:
options objectExample use cases:
Property precedence (highest to lowest):
Merge behavior:
{
"Identifier": "secured-frontend",
"PinnedCompletionsProperties": {
"model": "llama3",
"options": {
"temperature": 0.3,
"num_ctx": 4096,
"stop": ["[DONE]", "\n\n"]
}
}
}
Test with multiple AI backend instances using Docker Compose:
cd Docker
docker compose -f compose-ollama.yaml up -d
This spins up 4 Ollama instances on ports 11435-11438 for testing load balancing and transformation capabilities.
Manage your cluster programmatically:
# List all backends
curl -H "Authorization: Bearer your-token" \
http://localhost:43411/v1.0/backends
# Add a new backend
curl -X PUT \
-H "Authorization: Bearer your-token" \
-H "Content-Type: application/json" \
-d '{"Identifier": "gpu-4", "Hostname": "192.168.1.104", "Port": 11434}' \
http://localhost:43411/v1.0/backends
A complete Postman collection (OllamaFlow.postman_collection.json) is included in the repository root with examples for all API endpoints, including Ollama API, OpenAI API, and administrative APIs with native transformation examples.
For interactive API testing and experimentation, the OllamaFlow API Explorer provides a web-based dashboard for exploring and testing all OllamaFlow endpoints.
For a visual interface, check out the OllamaFlow Web UI which provides a dashboard for cluster management and monitoring.
We welcome contributions! Whether it's:
Please check out our Contributing Guidelines and feel free to:
git checkout -b feature/AmazingFeature)git commit -m 'Add some AmazingFeature')git push origin feature/AmazingFeature)This project is licensed under the MIT License - see the LICENSE file for details.