Azure OpenAI and Azure AI Foundry integration for TokenRateGate. Eliminates HTTP 429 rate limit errors with intelligent TPM/RPM management, managed identity support, and multi-deployment capabilities.
$ dotnet add package TokenRateGate.AzureStop getting HTTP 429 "Rate limit exceeded" errors from Azure OpenAI, OpenAI, and Anthropic Claude APIs.
TokenRateGate is a .NET library that prevents rate limit errors by intelligently managing your token and request budgets. It tracks both TPM (Tokens-Per-Minute) and RPM (Requests-Per-Minute) limits, automatically queues requests when capacity is full, and ensures you never hit the dreaded 429 error again.
Getting this error when calling LLM APIs?
HTTP 429: Too Many Requests
Rate limit is exceeded. Try again in X seconds.
This happens when you exceed your API provider's limits:
TokenRateGate prevents these errors by managing your token budget and queueing requests before they hit the API.
For OpenAI or Azure OpenAI users:
dotnet add package TokenRateGate
Includes: Everything - base engine, DI, OpenAI integration, Azure integration.
For Anthropic Claude, Google Gemini, or custom APIs:
dotnet add package TokenRateGate.Base
Includes: Core engine, DI support, character-based token estimation. Use when: Building custom integrations without OpenAI/Azure SDKs.
# Base + OpenAI only
dotnet add package TokenRateGate.Base
dotnet add package TokenRateGate.OpenAI
# Base + Azure only
dotnet add package TokenRateGate.Base
dotnet add package TokenRateGate.Azure
The easiest way to use TokenRateGate is with dependency injection:
using TokenRateGate.Extensions.DependencyInjection;
// In your Program.cs or Startup.cs
var builder = WebApplication.CreateBuilder(args);
// Register TokenRateGate with configuration
builder.Services.AddTokenRateGate(options =>
{
options.TokenLimit = 150000; // 150K tokens per minute (Azure Standard tier)
options.WindowSeconds = 60; // 60-second sliding window
options.SafetyBufferPercentage = 0.05; // 5% safety buffer (avoids hitting exact limit)
options.MaxConcurrentRequests = 10; // Limit concurrent API calls
options.MaxRequestsPerMinute = 100; // Optional: Also enforce RPM limit
});
var app = builder.Build();
Or bind from configuration:
// appsettings.json
{
"TokenRateGate": {
"TokenLimit": 500000,
"WindowSeconds": 60,
"MaxConcurrentRequests": 10
}
}
builder.Services.AddTokenRateGate(
builder.Configuration.GetSection("TokenRateGate"));
For APIs without built-in token estimation (Anthropic Claude, Google Gemini, custom LLMs):
using TokenRateGate.Core;
using TokenRateGate.Core.TokenEstimation;
using TokenRateGate.Abstractions;
public class CustomLlmService
{
private readonly ITokenRateGate _rateGate;
private readonly ITokenEstimator _tokenEstimator;
private readonly ILogger<CustomLlmService> _logger;
public CustomLlmService(ITokenRateGate rateGate, ILogger<CustomLlmService> logger)
{
_rateGate = rateGate;
_logger = logger;
// Use character-based estimation (4 chars ≈ 1 token for most LLMs)
_tokenEstimator = new CharacterBasedTokenEstimator();
}
public async Task<string> CallCustomLlmAsync(string prompt)
{
// Estimate tokens using character-based estimator
int estimatedInputTokens = _tokenEstimator.EstimateTokens(prompt);
int estimatedOutputTokens = 1000; // Your estimated response size
// Reserve capacity before calling the LLM
await using var reservation = await _rateGate.ReserveTokensAsync(
estimatedInputTokens,
estimatedOutputTokens);
_logger.LogInformation("Reserved {Tokens} tokens", reservation.ReservedTokens);
// Make your custom LLM API call
var response = await CallYourCustomApiAsync(prompt);
// Record actual usage from the response (IMPORTANT for accurate tracking)
var actualTotalTokens = response.Usage.TotalTokens;
reservation.RecordActualUsage(actualTotalTokens);
_logger.LogInformation("Actual usage: {Tokens} tokens", actualTotalTokens);
return response.Content;
}
}
Using CharacterBasedTokenEstimator:
// Default: 4 characters per token
var estimator = new CharacterBasedTokenEstimator();
int tokens = estimator.EstimateTokens("Hello world!"); // ≈ 3 tokens
// Custom ratio for different languages
var chineseEstimator = new CharacterBasedTokenEstimator(charactersPerToken: 2.0);
int chineseTokens = chineseEstimator.EstimateTokens("你好世界"); // Better for non-Latin scripts
// Estimate multiple texts
var messages = new[] { "System prompt", "User message", "Assistant response" };
int totalTokens = estimator.EstimateTokens(messages);
TokenRateGate integrates seamlessly with the OpenAI SDK.
Note: Logging is optional. The WithRateLimit() extension method accepts an optional ILoggerFactory parameter for diagnostics. If not provided, logging is disabled.
using OpenAI.Chat;
using TokenRateGate.OpenAI;
using TokenRateGate.Abstractions;
public class ChatService
{
private readonly ITokenRateGate _rateGate;
private readonly string _apiKey;
public ChatService(
ITokenRateGate rateGate,
IConfiguration configuration)
{
_rateGate = rateGate;
_apiKey = configuration["OpenAI:ApiKey"];
}
public async Task<string> AskQuestionAsync(string question)
{
// Create OpenAI client and wrap with rate limiting
var client = new ChatClient("gpt-4", _apiKey);
var rateLimitedClient = client.WithRateLimit(_rateGate, "gpt-4");
// Make rate-limited API call - automatic token tracking!
var messages = new[] { new UserChatMessage(question) };
var response = await rateLimitedClient.CompleteChatAsync(messages);
return response.Content[0].Text;
}
public async Task<string> AskQuestionStreamingAsync(string question)
{
var client = new ChatClient("gpt-4", _apiKey);
var rateLimitedClient = client.WithRateLimit(_rateGate, "gpt-4");
var messages = new[] { new UserChatMessage(question) };
var result = new StringBuilder();
// Streaming support with automatic token tracking
await foreach (var chunk in rateLimitedClient.CompleteChatStreamingAsync(messages))
{
if (chunk.ContentUpdate.Count > 0)
{
var text = chunk.ContentUpdate[0].Text;
result.Append(text);
Console.Write(text);
}
}
return result.ToString();
}
}
Note: Logging is optional. The WithRateLimit() extension method accepts an optional ILoggerFactory parameter for diagnostics. If not provided, logging is disabled.
using Azure;
using Azure.AI.OpenAI;
using TokenRateGate.Azure;
public class AzureChatService
{
private readonly ITokenRateGate _rateGate;
public AzureChatService(ITokenRateGate rateGate)
{
_rateGate = rateGate;
}
public async Task<string> AskQuestionAsync(string question)
{
var azureClient = new AzureOpenAIClient(
new Uri("https://your-resource.openai.azure.com/"),
new AzureKeyCredential("your-api-key"));
// Wrap with rate limiting (deployment name + model name for token counting)
var rateLimitedClient = azureClient.WithRateLimit(
_rateGate,
deploymentName: "my-gpt4-deployment",
modelName: "gpt-4");
var messages = new[] { new UserChatMessage(question) };
var response = await rateLimitedClient.CompleteChatAsync(messages);
return response.Content[0].Text;
}
}
Support different rate limits for different users, models, or tenants:
// Registration in Program.cs
builder.Services.AddTokenRateGateFactory();
builder.Services.AddNamedTokenRateGate("basic-tier", options =>
{
options.TokenLimit = 100000; // 100K tokens/min for basic users
options.WindowSeconds = 60;
});
builder.Services.AddNamedTokenRateGate("premium-tier", options =>
{
options.TokenLimit = 1000000; // 1M tokens/min for premium users
options.WindowSeconds = 60;
});
// Usage in your service
public class MultiTenantChatService
{
private readonly ITokenRateGateFactory _factory;
public MultiTenantChatService(ITokenRateGateFactory factory)
{
_factory = factory;
}
public async Task<string> AskQuestionAsync(string question, string tier)
{
// Get rate gate for the tenant's tier
var rateGate = _factory.GetOrCreate(tier);
var client = new ChatClient("gpt-4", "your-api-key");
var rateLimitedClient = client.WithRateLimit(rateGate, "gpt-4");
var messages = new[] { new UserChatMessage(question) };
var response = await rateLimitedClient.CompleteChatAsync(messages);
return response.Content[0].Text;
}
}
You can also use TokenRateGate without dependency injection:
using TokenRateGate.Core;
using TokenRateGate.Core.Options;
using TokenRateGate.OpenAI;
using Microsoft.Extensions.Logging;
// Create rate gate manually
var options = new TokenRateGateOptions
{
TokenLimit = 500000,
WindowSeconds = 60,
MaxConcurrentRequests = 10
};
using var loggerFactory = LoggerFactory.Create(builder =>
{
builder.AddConsole();
});
var rateGate = new TokenRateGate(options, loggerFactory);
// Use with OpenAI
var client = new ChatClient("gpt-4", "your-api-key");
var rateLimitedClient = client.WithRateLimit(rateGate, "gpt-4", loggerFactory);
var messages = new[] { new UserChatMessage("Hello!") };
var response = await rateLimitedClient.CompleteChatAsync(messages);
Console.WriteLine(response.Content[0].Text);
public class MonitoringService
{
private readonly ITokenRateGate _rateGate;
public MonitoringService(ITokenRateGate rateGate)
{
_rateGate = rateGate;
}
public void LogCurrentUsage()
{
var stats = _rateGate.GetUsageStats();
Console.WriteLine($"Current Usage: {stats.CurrentUsage}/{stats.EffectiveCapacity} tokens");
Console.WriteLine($"Reserved: {stats.ReservedTokens} tokens");
Console.WriteLine($"Available: {stats.AvailableTokens} tokens");
Console.WriteLine($"Usage: {stats.UsagePercentage:F1}%");
Console.WriteLine($"Near Capacity: {stats.IsNearCapacity}");
}
}
| Option | Default | Description |
|---|---|---|
TokenLimit | 500000 | Maximum tokens per window (TPM limit) |
WindowSeconds | 60 | Time window in seconds for token tracking |
SafetyBufferPercentage | 0.05 (5%) | Percentage of TokenLimit reserved as safety buffer<br>Effective limit = TokenLimit * (1 - SafetyBufferPercentage) |
MaxConcurrentRequests | 1000 | Maximum concurrent active reservations |
MaxRequestsPerMinute | null | Optional RPM limit (enforced in addition to token limit)<br>If both are configured, whichever is more restrictive applies |
RequestWindowSeconds | 120 | Time window for RPM tracking (default: max(120s, 2×WindowSeconds)) |
MaxWaitTime | null (unlimited) | Maximum time to wait for capacity in the queue before timing out<br>Note: Only applies to capacity queue waiting, NOT semaphore waiting<br>Set to null for unlimited waiting (recommended for most use cases) |
OutputEstimationStrategy | FixedMultiplier | How to estimate output tokens when not provided |
OutputMultiplier | 0.5 | Multiplier for FixedMultiplier strategy |
DefaultOutputTokens | 1000 | Fixed output for FixedAmount strategy |
OutputMultiplier (default 0.5)DefaultOutputTokens (default 1000)TokenRateGate uses a dual-component capacity system:
Current Capacity = Historical Usage + Active Reservations
(Historical Usage + Active Reservations + Requested Tokens) <= (TokenLimit - SafetyBuffer)Current Request Count < MaxRequestsPerMinuteRecordActualUsage() with actual tokens from response
using block ends, reservation is released and queued requests are processedbuilder.Services.AddHealthChecks()
.AddTokenRateGate(name: "tokenrategate", tags: ["rate-limiting"]);
// Configure estimation strategy
builder.Services.AddTokenRateGate(options =>
{
options.OutputEstimationStrategy = OutputEstimationStrategy.Conservative;
// Now reserves 2x input tokens (assumes output = input)
});
TokenRateGate provides detailed structured logging:
builder.Services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Debug); // See detailed token tracking
});
Check the samples/ directory for complete examples:
See tests/TokenRateGate.PerformanceTests for benchmarks.
# Run all tests
dotnet test
# Run specific test categories
dotnet test --filter "Category=Integration"
dotnet test --filter "Category=Performance"
OpenAI NuGet packageAzure.AI.OpenAI NuGet packageYou don't need to install these individually - they're included in TokenRateGate.Base:
Contributions are welcome! Please:
MIT License - Copyright © 2025 Marko Mrdja
See LICENSE for details.