Behavioral contracts

This page lists what the beliefs engine promises to its callers — not what features it has, but what mathematical invariants you can rely on. Each promise is pinned by an integration test in the engine repo. A regression that breaks any of these is a P0 in the engine, not something your SDK code has to defend against.

The math itself (entropy, KL, log-linear pool, isotonic calibration) stays internal. This page is the guarantees of those internals.

1. Conjugate updates preserve the family

Beta beliefs (true/false) stay Beta. Dirichlet beliefs (categorical) stay Dirichlet. Gaussian beliefs (continuous) stay Gaussian. Noisy evidence is handled by moment-matching within the family — never by dropping conjugacy.

Why it matters: confidence numbers have a stable meaning across turns. The same belief at the same evidence count yields the same posterior.

2. Fusion is associative and order-independent

fuse([A, B, C]) == fuse([C, B, A]) == fuse([B, A, C]) — to 12 decimal places.

Why it matters: retrying a failed after() after a peer's write doesn't change the outcome. Multi-agent pipelines have no hidden ordering bug class.

3. Multi-agent fusion is trust-weighted in closed form

When multiple agents submit conflicting evidence, the engine combines them as a log-linear pool weighted by each agent's reliability. The result is byte-exact to poolBeta([{α₁,β₁}, …], [w₁, …]). KL divergence between successive rounds shrinks monotonically — repeated fusion converges, never oscillates.

Why it matters: trust knobs translate directly into evidence weight. Downweighting an unreliable agent has the predictable effect on final confidence.

4. Temporal decay scales evidence linearly

Older evidence is downweighted by a freshness factor f ∈ (0, 1]. Decayed evidence at weight f contributes exactly f × (α-1, β-1) above prior — to within 1e-12.

Why it matters: "older = less weight" is honored exactly; no silent clamps that would let a stale claim still dominate a fresh one.

5. Entropy decreases as evidence accumulates

For a belief receiving consistent supporting evidence, entropy is strictly monotonically decreasing. Inconsistent evidence raises entropy.

Why it matters: "more evidence ⇒ tighter posterior" is a mechanical guarantee. Confidence-trend UIs can trust the direction.

6. Deltas are idempotent

Replaying the same (idempotencyKey, scope) on add / after / observe yields the same state. Network retries are safe.

Why it matters: at-least-once delivery from queues, webhooks, or flaky networks doesn't double-count evidence.

7. Supersession drops the superseded belief from active candidates

When belief B explicitly supersedes belief A, A is removed from the active candidate set. read() and list() no longer return A; trace() still surfaces it.

Why it matters: an agent updating its position on a claim doesn't leave the prior position competing for attention.

8. Family isolation: writing one type doesn't contaminate others

Mixed Beta / Dirichlet / Gaussian states are safe to compose. Applying a delta to one family leaves the others byte-identical.

Why it matters: multi-modal world models are safe; a new categorical claim won't perturb an existing propositional one.

9. Empty deltas are the identity element

after('') or applying an empty BeliefDelta returns state unchanged, byte-identical. No-op writes are free.

Why it matters: defensive patterns ("call after every turn even if nothing happened") are zero-cost and zero-risk.

10. Calibration is gated in CI

The engine enforces an Expected Calibration Error gate: ECE ≤ 0.05. A change that pushes ECE above the gate fails CI in the engine repo and never ships.

Why it matters: when the SDK reports confidence: 'high', that label is calibrated against held-out outcomes — not heuristic thresholding of an underlying probability.

11. Real-Postgres parity with the in-memory test harness

The engine's pglite backend and a real Postgres backend produce byte-identical state for the same write sequence.

Why it matters: local test cycles are fast; production cycles are correct; the engine guarantees they agree.

12. Fragility reflects evidence thickness, not just point estimate

Two beliefs with the same posterior mean can have different fragility — derived as E_e[KL(post_e ∥ current)], which is higher when supporting evidence is thin.

Why it matters: "70% backed by 100 observations" is not the same signal as "70% backed by 2 observations," and the SDK surfaces the difference.

13. Trust overrides combine multiplicatively at fusion time

A user's beliefs.trust override on an agent and/or source contributes a scalar factor to the log-linear pool's weight:

Agent factor: mean of the user-provided Beta sensitivity, replacing the engine's baseline for that agent.
Source factor: mean of the user-provided Beta sensitivity for the source kind. When no source override is set, the factor is 1.0 (not the Jeffreys mean 0.5) — collapsing unset sources to Jeffreys would silently halve every source-tagged delta.

A locked override (lock: true) is treated as a delta function; the learned-posterior write-back never drifts it. Unlocked overrides serve as a starting prior; learned data continues to refine them via sequential Beta-Binomial updates.

Both fusion (multi-delta) and single-delta apply paths honor the same override math — beliefs.add produces the byte-equal posterior as the same delta fused in isolation.

Why it matters: trust knobs are predictable scalar products, not opaque heuristics. Setting confidence: 0.1, strength: 50 on an agent attenuates that agent's contributions by ~10× without affecting any other source.

What's deliberately NOT promised

Extraction LLM choice — the model behind after() and observe() may change between releases. Only the shape of the resulting BeliefDelta is contracted.
Absolute confidence numbers across version bumps — calibration shifts on model swaps; the ECE gate keeps the quality bound, not the exact numbers.
Cost or token usage — telemetry is intentionally not part of the public SDK contract until pricing is finalized.
Internal IP exposure — calibration curves, NLI judge probabilities, log-linear pool internals, and reliability posteriors are not part of the SDK surface.

How these contracts are kept

The engine repo runs the L1 suite (evals/scenarios/L1*.test.ts) on every PR. Any breakage to a contract above is a release blocker. SDK releases that change behavior in a way that affects these promises require an explicit changelog note and a major-version bump.

If you find a case where SDK behavior appears to violate one of these contracts, file it as a P0.

The math itself (entropy, KL, log-linear pool, isotonic calibration) stays internal. This page is the guarantees of those internals.

Agent factor: mean of the user-provided Beta sensitivity, replacing the engine's baseline for that agent.
Source factor: mean of the user-provided Beta sensitivity for the source kind. When no source override is set, the factor is 1.0 (not the Jeffreys mean 0.5) — collapsing unset sources to Jeffreys would silently halve every source-tagged delta.

Both fusion (multi-delta) and single-delta apply paths honor the same override math — beliefs.add produces the byte-equal posterior as the same delta fused in isolation.

What's deliberately NOT promised

Extraction LLM choice — the model behind after() and observe() may change between releases. Only the shape of the resulting BeliefDelta is contracted.
Absolute confidence numbers across version bumps — calibration shifts on model swaps; the ECE gate keeps the quality bound, not the exact numbers.
Cost or token usage — telemetry is intentionally not part of the public SDK contract until pricing is finalized.
Internal IP exposure — calibration curves, NLI judge probabilities, log-linear pool internals, and reliability posteriors are not part of the SDK surface.

How these contracts are kept

If you find a case where SDK behavior appears to violate one of these contracts, file it as a P0.