Table of Contents

Production readiness

Why you're reading this page: This page gives a short guide to rate limiting, fallback, and cost control when using Intentum in production with real embedding APIs. It is useful to read before going live.

Short guide to rate limiting, fallback, and cost control when using Intentum with real embedding APIs.

Rate limiting

  • Intentum.Runtime: MemoryRateLimiter (in-memory fixed window) limits how often a key (e.g. user or session) can trigger a policy decision of type RateLimit. Use it with intent.DecideWithRateLimit(policy, rateLimiter, options).
  • Embedding API: To avoid exceeding the provider's request rate (and 429s), limit how often you call the embedding provider. Options: (1) Wrap the provider in a rate-limiting layer (e.g. token bucket) before passing it to LlmIntentModel; (2) Use a queue and throttle inference; (3) Cache embeddings (see AI providers how-to) so repeated behavior keys do not call the API again.
  • See Embedding API error handling for retry and 429 handling.

Fallback

When the embedding API fails (timeout, 429, 5xx):

  • Catch at app layer: Wrap model.Infer(space) in try/catch; on HttpRequestException, log and either return a fallback intent (e.g. low confidence, single signal) or rethrow.
  • Rule-based fallback: Use ChainedIntentModel: try LLM first; if confidence below threshold or inference fails, fall back to a RuleBasedIntentModel. See examples/chained-intent and examples/ai-fallback-intent.
  • Cache fallback: If you use a cached embedding provider, on API failure you can return a cached result for the same behavior key (if available) or a default low-confidence intent.

Cost control

  • Cap embedding calls: For large behavior spaces, the number of dimensions (unique actor:action) equals the number of embedding calls. Use ToVectorOptions (e.g. CapPerDimension, normalization) to limit dimension count, or sample dimensions (e.g. top N by count) before calling the model.
  • Cache: Use CachedEmbeddingProvider (or Redis adapter) so repeated behavior keys do not call the API. Reduces cost and latency.
  • Benchmark: Run the benchmarks to see latency and throughput; use that to size timeouts and rate limits.

Summary

Topic Where to look
Rate limiting api.md (MemoryRateLimiter, DecideWithRateLimit), embedding-api-errors.md
Fallback ChainedIntentModel, examples/ai-fallback-intent, embedding-api-errors.md
Cost ToVectorOptions (cap/sampling), CachedEmbeddingProvider, benchmarks

Next step: When you're done with this page → Embedding API error handling or Benchmarks.