Cross-LLM consistency (optional)

For the same BehaviorSpace, different embedding providers (OpenAI, Azure, Gemini, Mistral, etc.) can produce different confidence scores and, when using domain-specific models, different intent names. This page describes how to measure and document that variance.

Why measure

Production choice: When selecting a single provider for production, comparing confidence and intent output across providers on a fixed set of spaces helps you understand variance and pick a stable option.
Regression: After changing a model or provider version, re-running the same spaces and comparing scores helps detect regressions.

How to run (with real API keys)

Pick a small set of BehaviorSpaces (e.g. 10–20) that represent typical and edge cases (login flow, retries, abuse-like patterns).
Run each space through each provider you care about (e.g. OpenAI, Azure OpenAI, Gemini), using the same LlmIntentModel + SimpleAverageSimilarityEngine and only swapping IIntentEmbeddingProvider.
Record per space: intent name (if domain-specific), confidence score, and optionally level (High/Medium/Low/Certain).
Summarize: e.g. mean absolute difference in confidence score between provider A and B; or fraction of spaces where intent name agrees.

Test in repo

The test CrossLLMConsistencyTests.CrossProvider_SameSpace_TwoProviders_ProduceValidIntents runs the same space through two mock providers and asserts both produce valid intents. It does not call real APIs. To run a real cross-LLM comparison, use the optional integration workflow (see Embedding API error handling) with multiple provider secrets and a small script or test that builds spaces, calls each provider, and writes a short report (e.g. to docs/case-studies/).

Summary

Goal	Action
Compare providers	Same N spaces → run with provider A and B → record scores and intent
Document variance	Mean \|score_A − score_B\| or agreement rate on intent name
Reproduce	Fix the set of spaces and re-run when changing model or provider

Table of Contents

Cross-LLM consistency (optional)

Why measure

How to run (with real API keys)

Test in repo

Summary