Stop HTTP 429 "Rate limit exceeded" errors from Azure OpenAI, OpenAI, and Anthropic APIs. Complete solution with rate limiting, dependency injection, and SDK integrations for OpenAI and Azure. Install this package for everything you need.
$ dotnet add package TokenRateGateStop getting HTTP 429 "Rate limit exceeded" errors from Azure OpenAI, OpenAI, and Anthropic Claude APIs.
TokenRateGate is a .NET library that prevents rate limit errors by intelligently managing your token and request budgets. It tracks both TPM (Tokens-Per-Minute) and RPM (Requests-Per-Minute) limits, automatically queues requests when capacity is full, and ensures you never hit the dreaded 429 error again.
Getting this error when calling LLM APIs?
HTTP 429: Too Many Requests
Rate limit is exceeded. Try again in X seconds.
This happens when you exceed your API provider's limits:
TokenRateGate prevents these errors by managing your token budget and queueing requests before they hit the API.
For OpenAI or Azure OpenAI users:
dotnet add package TokenRateGate
Includes: Everything - base engine, DI, OpenAI integration, Azure integration.
For Anthropic Claude, Google Gemini, or custom APIs:
dotnet add package TokenRateGate.Base
Includes: Core engine, DI support, character-based token estimation. Use when: Building custom integrations without OpenAI/Azure SDKs.
# Base + OpenAI only
dotnet add package TokenRateGate.Base
dotnet add package TokenRateGate.OpenAI
# Base + Azure only
dotnet add package TokenRateGate.Base
dotnet add package TokenRateGate.Azure
The easiest way to use TokenRateGate is with dependency injection:
using TokenRateGate.Extensions.DependencyInjection;
// In your Program.cs or Startup.cs
var builder = WebApplication.CreateBuilder(args);
// Register TokenRateGate with configuration
builder.Services.AddTokenRateGate(options =>
{
options.TokenLimit = 150000; // 150K tokens per minute (Azure Standard tier)
options.WindowSeconds = 60; // 60-second sliding window
options.SafetyBufferPercentage = 0.05; // 5% safety buffer (avoids hitting exact limit)
options.MaxConcurrentRequests = 10; // Limit concurrent API calls
options.MaxRequestsPerMinute = 100; // Optional: Also enforce RPM limit
});
var app = builder.Build();
Or bind from configuration:
// appsettings.json
{
"TokenRateGate": {
"TokenLimit": 500000,
"WindowSeconds": 60,
"MaxConcurrentRequests": 10
}
}
builder.Services.AddTokenRateGate(
builder.Configuration.GetSection("TokenRateGate"));
For APIs without built-in token estimation (Anthropic Claude, Google Gemini, custom LLMs):
using TokenRateGate.Core;
using TokenRateGate.Core.TokenEstimation;
using TokenRateGate.Abstractions;
public class CustomLlmService
{
private readonly ITokenRateGate _rateGate;
private readonly ITokenEstimator _tokenEstimator;
private readonly ILogger<CustomLlmService> _logger;
public CustomLlmService(ITokenRateGate rateGate, ILogger<CustomLlmService> logger)
{
_rateGate = rateGate;
_logger = logger;
// Use character-based estimation (4 chars ≈ 1 token for most LLMs)
_tokenEstimator = new CharacterBasedTokenEstimator();
}
public async Task<string> CallCustomLlmAsync(string prompt)
{
// Estimate tokens using character-based estimator
int estimatedInputTokens = _tokenEstimator.EstimateTokens(prompt);
int estimatedOutputTokens = 1000; // Your estimated response size
// Reserve capacity before calling the LLM
await using var reservation = await _rateGate.ReserveTokensAsync(
estimatedInputTokens,
estimatedOutputTokens);
_logger.LogInformation("Reserved {Tokens} tokens", reservation.ReservedTokens);
// Make your custom LLM API call
var response = await CallYourCustomApiAsync(prompt);
// Record actual usage from the response (IMPORTANT for accurate tracking)
var actualTotalTokens = response.Usage.TotalTokens;
reservation.RecordActualUsage(actualTotalTokens);
_logger.LogInformation("Actual usage: {Tokens} tokens", actualTotalTokens);
return response.Content;
}
}
Using CharacterBasedTokenEstimator:
// Default: 4 characters per token
var estimator = new CharacterBasedTokenEstimator();
int tokens = estimator.EstimateTokens("Hello world!"); // ≈ 3 tokens
// Custom ratio for different languages
var chineseEstimator = new CharacterBasedTokenEstimator(charactersPerToken: 2.0);
int chineseTokens = chineseEstimator.EstimateTokens("你好世界"); // Better for non-Latin scripts
// Estimate multiple texts
var messages = new[] { "System prompt", "User message", "Assistant response" };
int totalTokens = estimator.EstimateTokens(messages);
TokenRateGate integrates seamlessly with the OpenAI SDK.
Note: Logging is optional. The WithRateLimit() extension method accepts an optional ILoggerFactory parameter for diagnostics. If not provided, logging is disabled.
using OpenAI.Chat;
using TokenRateGate.OpenAI;
using TokenRateGate.Abstractions;
public class ChatService
{
private readonly ITokenRateGate _rateGate;
private readonly string _apiKey;
public ChatService(
ITokenRateGate rateGate,
IConfiguration configuration)
{
_rateGate = rateGate;
_apiKey = configuration["OpenAI:ApiKey"];
}
public async Task<string> AskQuestionAsync(string question)
{
// Create OpenAI client and wrap with rate limiting
var client = new ChatClient("gpt-4", _apiKey);
var rateLimitedClient = client.WithRateLimit(_rateGate, "gpt-4");
// Make rate-limited API call - automatic token tracking!
var messages = new[] { new UserChatMessage(question) };
var response = await rateLimitedClient.CompleteChatAsync(messages);
return response.Content[0].Text;
}
public async Task<string> AskQuestionStreamingAsync(string question)
{
var client = new ChatClient("gpt-4", _apiKey);
var rateLimitedClient = client.WithRateLimit(_rateGate, "gpt-4");
var messages = new[] { new UserChatMessage(question) };
var result = new StringBuilder();
// Streaming support with automatic token tracking
await foreach (var chunk in rateLimitedClient.CompleteChatStreamingAsync(messages))
{
if (chunk.ContentUpdate.Count > 0)
{
var text = chunk.ContentUpdate[0].Text;
result.Append(text);
Console.Write(text);
}
}
return result.ToString();
}
}
Note: Logging is optional. The WithRateLimit() extension method accepts an optional ILoggerFactory parameter for diagnostics. If not provided, logging is disabled.
using Azure;
using Azure.AI.OpenAI;
using TokenRateGate.Azure;
public class AzureChatService
{
private readonly ITokenRateGate _rateGate;
public AzureChatService(ITokenRateGate rateGate)
{
_rateGate = rateGate;
}
public async Task<string> AskQuestionAsync(string question)
{
var azureClient = new AzureOpenAIClient(
new Uri("https://your-resource.openai.azure.com/"),
new AzureKeyCredential("your-api-key"));
// Wrap with rate limiting (deployment name + model name for token counting)
var rateLimitedClient = azureClient.WithRateLimit(
_rateGate,
deploymentName: "my-gpt4-deployment",
modelName: "gpt-4");
var messages = new[] { new UserChatMessage(question) };
var response = await rateLimitedClient.CompleteChatAsync(messages);
return response.Content[0].Text;
}
}
Support different rate limits for different users, models, or tenants:
// Registration in Program.cs
builder.Services.AddTokenRateGateFactory();
builder.Services.AddNamedTokenRateGate("basic-tier", options =>
{
options.TokenLimit = 100000; // 100K tokens/min for basic users
options.WindowSeconds = 60;
});
builder.Services.AddNamedTokenRateGate("premium-tier", options =>
{
options.TokenLimit = 1000000; // 1M tokens/min for premium users
options.WindowSeconds = 60;
});
// Usage in your service
public class MultiTenantChatService
{
private readonly ITokenRateGateFactory _factory;
public MultiTenantChatService(ITokenRateGateFactory factory)
{
_factory = factory;
}
public async Task<string> AskQuestionAsync(string question, string tier)
{
// Get rate gate for the tenant's tier
var rateGate = _factory.GetOrCreate(tier);
var client = new ChatClient("gpt-4", "your-api-key");
var rateLimitedClient = client.WithRateLimit(rateGate, "gpt-4");
var messages = new[] { new UserChatMessage(question) };
var response = await rateLimitedClient.CompleteChatAsync(messages);
return response.Content[0].Text;
}
}
You can also use TokenRateGate without dependency injection:
using TokenRateGate.Core;
using TokenRateGate.Core.Options;
using TokenRateGate.OpenAI;
using Microsoft.Extensions.Logging;
// Create rate gate manually
var options = new TokenRateGateOptions
{
TokenLimit = 500000,
WindowSeconds = 60,
MaxConcurrentRequests = 10
};
using var loggerFactory = LoggerFactory.Create(builder =>
{
builder.AddConsole();
});
var rateGate = new TokenRateGate(options, loggerFactory);
// Use with OpenAI
var client = new ChatClient("gpt-4", "your-api-key");
var rateLimitedClient = client.WithRateLimit(rateGate, "gpt-4", loggerFactory);
var messages = new[] { new UserChatMessage("Hello!") };
var response = await rateLimitedClient.CompleteChatAsync(messages);
Console.WriteLine(response.Content[0].Text);
public class MonitoringService
{
private readonly ITokenRateGate _rateGate;
public MonitoringService(ITokenRateGate rateGate)
{
_rateGate = rateGate;
}
public void LogCurrentUsage()
{
var stats = _rateGate.GetUsageStats();
Console.WriteLine($"Current Usage: {stats.CurrentUsage}/{stats.EffectiveCapacity} tokens");
Console.WriteLine($"Reserved: {stats.ReservedTokens} tokens");
Console.WriteLine($"Available: {stats.AvailableTokens} tokens");
Console.WriteLine($"Usage: {stats.UsagePercentage:F1}%");
Console.WriteLine($"Near Capacity: {stats.IsNearCapacity}");
}
}
| Option | Default | Description |
|---|---|---|
TokenLimit | 500000 | Maximum tokens per window (TPM limit) |
WindowSeconds | 60 | Time window in seconds for token tracking |
SafetyBufferPercentage | 0.05 (5%) | Percentage of TokenLimit reserved as safety buffer Effective limit = TokenLimit * (1 - SafetyBufferPercentage) |
MaxConcurrentRequests | 1000 | Maximum concurrent active reservations |
MaxRequestsPerMinute | null | Optional RPM limit (enforced in addition to token limit) If both are configured, whichever is more restrictive applies |
RequestWindowSeconds | 120 | Time window for RPM tracking (default: max(120s, 2×WindowSeconds)) |
MaxWaitTime | null (unlimited) | Maximum time to wait for capacity in the queue before timing out Note: Only applies to capacity queue waiting, NOT semaphore waiting Set to null for unlimited waiting (recommended for most use cases) |
OutputEstimationStrategy | FixedMultiplier | How to estimate output tokens when not provided |
OutputMultiplier | 0.5 | Multiplier for FixedMultiplier strategy |
DefaultOutputTokens | 1000 | Fixed output for FixedAmount strategy |
OutputMultiplier (default 0.5)DefaultOutputTokens (default 1000)TokenRateGate uses a dual-component capacity system:
Current Capacity = Historical Usage + Active Reservations
(Historical Usage + Active Reservations + Requested Tokens) <= (TokenLimit - SafetyBuffer)Current Request Count < MaxRequestsPerMinuteRecordActualUsage() with actual tokens from response
using block ends, reservation is released and queued requests are processedbuilder.Services.AddHealthChecks()
.AddTokenRateGate(name: "tokenrategate", tags: ["rate-limiting"]);
// Configure estimation strategy
builder.Services.AddTokenRateGate(options =>
{
options.OutputEstimationStrategy = OutputEstimationStrategy.Conservative;
// Now reserves 2x input tokens (assumes output = input)
});
TokenRateGate provides detailed structured logging:
builder.Services.AddLogging(logging =>
{
logging.AddConsole();
logging.SetMinimumLevel(LogLevel.Debug); // See detailed token tracking
});
Check the samples/ directory for complete examples:
See tests/TokenRateGate.PerformanceTests for benchmarks.
# Run all tests
dotnet test
# Run specific test categories
dotnet test --filter "Category=Integration"
dotnet test --filter "Category=Performance"
OpenAI NuGet packageAzure.AI.OpenAI NuGet packageYou don't need to install these individually - they're included in TokenRateGate.Base:
Contributions are welcome! Please:
MIT License - Copyright © 2025 Marko Mrdja
See LICENSE for details.