This page lists what the beliefs engine promises to its callers — not what features it has, but what mathematical invariants you can rely on. Each promise is pinned by an integration test in the engine repo. A regression that breaks any of these is a P0 in the engine, not something your SDK code has to defend against.
The math itself (entropy, KL, log-linear pool, isotonic calibration) stays internal. This page is the guarantees of those internals.
1. Conjugate updates preserve the family
Beta beliefs (true/false) stay Beta. Dirichlet beliefs (categorical) stay Dirichlet. Gaussian beliefs (continuous) stay Gaussian. Noisy evidence is handled by moment-matching within the family — never by dropping conjugacy.
Why it matters: confidence numbers have a stable meaning across turns. The same belief at the same evidence count yields the same posterior.
2. Fusion is associative and order-independent
fuse([A, B, C]) == fuse([C, B, A]) == fuse([B, A, C]) — to 12 decimal places.
Why it matters: retrying a failed after() after a peer's write doesn't change the outcome. Multi-agent pipelines have no hidden ordering bug class.
3. Multi-agent fusion is trust-weighted in closed form
When multiple agents submit conflicting evidence, the engine combines them as a log-linear pool weighted by each agent's reliability. The result is byte-exact to poolBeta([{α₁,β₁}, …], [w₁, …]). KL divergence between successive rounds shrinks monotonically — repeated fusion converges, never oscillates.
Why it matters: trust knobs translate directly into evidence weight. Downweighting an unreliable agent has the predictable effect on final confidence.
4. Temporal decay scales evidence linearly
Older evidence is downweighted by a freshness factor f ∈ (0, 1]. Decayed evidence at weight f contributes exactly f × (α-1, β-1) above prior — to within 1e-12.
Why it matters: "older = less weight" is honored exactly; no silent clamps that would let a stale claim still dominate a fresh one.
5. Entropy decreases as evidence accumulates
For a belief receiving consistent supporting evidence, entropy is strictly monotonically decreasing. Inconsistent evidence raises entropy.
Why it matters: "more evidence ⇒ tighter posterior" is a mechanical guarantee. Confidence-trend UIs can trust the direction.
6. Deltas are idempotent
Replaying the same (idempotencyKey, scope) on add / after / observe yields the same state. Network retries are safe.
Why it matters: at-least-once delivery from queues, webhooks, or flaky networks doesn't double-count evidence.
7. Supersession drops the superseded belief from active candidates
When belief B explicitly supersedes belief A, A is removed from the active candidate set. read() and list() no longer return A; trace() still surfaces it.
Why it matters: an agent updating its position on a claim doesn't leave the prior position competing for attention.
8. Family isolation: writing one type doesn't contaminate others
Mixed Beta / Dirichlet / Gaussian states are safe to compose. Applying a delta to one family leaves the others byte-identical.
Why it matters: multi-modal world models are safe; a new categorical claim won't perturb an existing propositional one.
9. Empty deltas are the identity element
after('') or applying an empty BeliefDelta returns state unchanged, byte-identical. No-op writes are free.
Why it matters: defensive patterns ("call after every turn even if nothing happened") are zero-cost and zero-risk.
10. Calibration is gated in CI
The engine enforces an Expected Calibration Error gate: ECE ≤ 0.05. A change that pushes ECE above the gate fails CI in the engine repo and never ships.
Why it matters: when the SDK reports confidence: 'high', that label is calibrated against held-out outcomes — not heuristic thresholding of an underlying probability.
11. Real-Postgres parity with the in-memory test harness
The engine's pglite backend and a real Postgres backend produce byte-identical state for the same write sequence.
Why it matters: local test cycles are fast; production cycles are correct; the engine guarantees they agree.
12. Fragility reflects evidence thickness, not just point estimate
Two beliefs with the same posterior mean can have different fragility — derived as E_e[KL(post_e ∥ current)], which is higher when supporting evidence is thin.
Why it matters: "70% backed by 100 observations" is not the same signal as "70% backed by 2 observations," and the SDK surfaces the difference.
13. Trust overrides combine multiplicatively at fusion time
A user's beliefs.trust override on an agent and/or source contributes a scalar factor to the log-linear pool's weight:
- Agent factor: mean of the user-provided Beta sensitivity, replacing the engine's baseline for that agent.
- Source factor: mean of the user-provided Beta sensitivity for the source kind. When no source override is set, the factor is
1.0(not the Jeffreys mean0.5) — collapsing unset sources to Jeffreys would silently halve every source-tagged delta.
A locked override (lock: true) is treated as a delta function; the learned-posterior write-back never drifts it. Unlocked overrides serve as a starting prior; learned data continues to refine them via sequential Beta-Binomial updates.
Both fusion (multi-delta) and single-delta apply paths honor the same override math — beliefs.add produces the byte-equal posterior as the same delta fused in isolation.
Why it matters: trust knobs are predictable scalar products, not opaque heuristics. Setting confidence: 0.1, strength: 50 on an agent attenuates that agent's contributions by ~10× without affecting any other source.
What's deliberately NOT promised
- Extraction LLM choice — the model behind
after()andobserve()may change between releases. Only the shape of the resultingBeliefDeltais contracted. - Absolute confidence numbers across version bumps — calibration shifts on model swaps; the ECE gate keeps the quality bound, not the exact numbers.
- Cost or token usage — telemetry is intentionally not part of the public SDK contract until pricing is finalized.
- Internal IP exposure — calibration curves, NLI judge probabilities, log-linear pool internals, and reliability posteriors are not part of the SDK surface.
How these contracts are kept
The engine repo runs the L1 suite (evals/scenarios/L1*.test.ts) on every PR. Any breakage to a contract above is a release blocker. SDK releases that change behavior in a way that affects these promises require an explicit changelog note and a major-version bump.
If you find a case where SDK behavior appears to violate one of these contracts, file it as a P0.