Production readiness
Why you're reading this page: This page gives a short guide to rate limiting, fallback, and cost control when using Intentum in production with real embedding APIs. It is useful to read before going live.
Short guide to rate limiting, fallback, and cost control when using Intentum with real embedding APIs.
Rate limiting
- Intentum.Runtime: MemoryRateLimiter (in-memory fixed window) limits how often a key (e.g. user or session) can trigger a policy decision of type
RateLimit. Use it withintent.DecideWithRateLimit(policy, rateLimiter, options). - Embedding API: To avoid exceeding the provider's request rate (and 429s), limit how often you call the embedding provider. Options: (1) Wrap the provider in a rate-limiting layer (e.g. token bucket) before passing it to
LlmIntentModel; (2) Use a queue and throttle inference; (3) Cache embeddings (see AI providers how-to) so repeated behavior keys do not call the API again. - See Embedding API error handling for retry and 429 handling.
Fallback
When the embedding API fails (timeout, 429, 5xx):
- Catch at app layer: Wrap
model.Infer(space)in try/catch; onHttpRequestException, log and either return a fallback intent (e.g. low confidence, single signal) or rethrow. - Rule-based fallback: Use ChainedIntentModel: try LLM first; if confidence below threshold or inference fails, fall back to a RuleBasedIntentModel. See examples/chained-intent and examples/ai-fallback-intent.
- Cache fallback: If you use a cached embedding provider, on API failure you can return a cached result for the same behavior key (if available) or a default low-confidence intent.
Cost control
- Cap embedding calls: For large behavior spaces, the number of dimensions (unique actor:action) equals the number of embedding calls. Use ToVectorOptions (e.g. CapPerDimension, normalization) to limit dimension count, or sample dimensions (e.g. top N by count) before calling the model.
- Cache: Use CachedEmbeddingProvider (or Redis adapter) so repeated behavior keys do not call the API. Reduces cost and latency.
- Benchmark: Run the benchmarks to see latency and throughput; use that to size timeouts and rate limits.
Summary
| Topic | Where to look |
|---|---|
| Rate limiting | api.md (MemoryRateLimiter, DecideWithRateLimit), embedding-api-errors.md |
| Fallback | ChainedIntentModel, examples/ai-fallback-intent, embedding-api-errors.md |
| Cost | ToVectorOptions (cap/sampling), CachedEmbeddingProvider, benchmarks |
Next step: When you're done with this page → Embedding API error handling or Benchmarks.