Calibration loop
How wrong is our backtest? We measure it.
Every night, the adaptive engine compares its backtest predictions against actual forward-test results for every strategy with 8 or more live trades. When a gap shows up, we diagnose it (slippage, entry timing, exit execution, regime mismatch, stale data) and update the model. Every adjustment is logged here.
Most backtesters do not publish this. The published numbers tend to come from the backtest, not from how the backtest compares to live execution. We do both and publish the gap.
Live calibration state
4 gap measurements logged · tracking since 4/23/2026
Entry delay (bars)
Number of bars added to entry timing in backtests to simulate the cron-cycle delay between signal and real-money execution. 0 = no delay needed.
Slippage multiplier
How much real fill prices differ from backtest assumed prices. 1.0 = backtest matches reality. >1.0 = real slippage is worse than backtest assumed (the backtest was too optimistic).
Stop overshoot %
Average distance real stop fills miss the stop level by, as % of price. Used to calibrate stop-loss assumptions in backtests so reported P&L matches reality.
TP tightening factor
How much we tighten the take-profit threshold in backtests to match real-world execution. 1.0 = no adjustment needed. >1.0 = real fills happen later than backtest assumed.
Win-rate offset
Forward-test win rate minus backtest predicted win rate, averaged across calibrated genes. Positive = forward beats backtest. Negative = forward underperforms backtest predictions.
Performance by context
Where the engine actually performs vs where it does not. Sliced from 35 live v2 trades (overall avg +0.65% per trade). The CI-lower column is the pessimistic edge of a 90% confidence interval. Small slices with high variance show negative CI even when the average is positive, which is the correct read. As the dataset grows, per-slice gap-vs-backtest calibration becomes available.
By coin
| Slice | N | Avg | CI low |
|---|---|---|---|
| ETH | 16 | +0.19% | -0.33% |
| BTC | 7 | +1.23% | +0.35% |
| XRP | 7 | +0.32% | -0.11% |
| BNB | 3 | +1.70% | +0.45% |
| SOL | 2 | +1.85% | -0.79% |
By 1h regime
| Slice | N | Avg | CI low |
|---|---|---|---|
| ranging high vol | 12 | +1.25% | +0.63% |
| strong trend med vol | 6 | -0.12% | -0.28% |
| ranging low vol | 5 | +0.70% | +0.20% |
| unlabeled | 4 | +2.02% | +0.77% |
| weak trend med vol | 3 | +0.70% | +0.70% |
| trending high vol | 2 | -1.35% | -1.54% |
| weak trend low vol | 2 | -0.37% | -0.54% |
| trending low vol | 1 | -1.83% | -1.83% |
By timeframe
| Slice | N | Avg | CI low |
|---|---|---|---|
| 5m | 19 | +1.23% | +0.89% |
| 1H | 9 | +0.03% | -1.04% |
| 15m | 7 | -0.14% | -0.28% |
Predictions tracked across the product
The calibration loop above is one self-correction system. We are extending the same pattern (claim, observe, gap, diagnose, act) to every system that makes a prediction or claim: the /prove verdict, the AI verdict text, the reasoning narratives, the reliability score, and so on. Each row is a domain we are now instrumenting publicly. Resolution takes time: most loops have a 30+ day window between claim and outcome, so most rows below will show 0 resolved for the first month.
| Domain | Total | Resolved | Pending | Earliest pending | Next resolves |
|---|---|---|---|---|---|
| prove_verdict | 2 | 0 | 2 | 4/25/2026 | 5/25/2026 |
| ai_verdict_text | 1 | 0 | 1 | 4/25/2026 | — |
Schema, helper library, and full domain list at src/lib/predictions.ts in the codebase. Adding a new self-correction loop is a single recordPrediction() call.
Recent gap measurements
Across 9 forward trades the live P&L came in -0.704% vs backtest, with win rate 33% (vs 89% predicted). Diagnosed as slippage drift: backtest was too optimistic on fill prices. Action: adjust_backtest.
10 forward trades show a small gap (+0.964% P&L, +13.4pp win rate). Within noise; no calibration adjustment needed.
10 forward trades show a small gap (+0.964% P&L, +13.4pp win rate). Within noise; no calibration adjustment needed.
8 forward trades show a small gap (+0.966% P&L, +13.4pp win rate). Within noise; no calibration adjustment needed.
Methodology
Trigger.The calibrator runs at 02:00 UTC daily. For every active gene with 8 or more forward-test trades, it pulls the original backtest result and the live trade history and computes the gap on five dimensions: average P&L per trade, win rate, take-profit hit rate, time-exit rate, and average bars held.
Diagnosis.Gaps are categorized by signature. P&L drops with similar win rate suggests slippage drift. Win rate drops suggest entry timing problems (cron delay, gate misfire). Lower TP hit rate suggests exit execution lag. Pattern mismatch with regime suggests the backtest tested a different market structure than current.
Action.Slippage drift updates the global slippage multiplier, which feeds the next round of backtests. Entry timing adds entry-delay bars to backtest assumptions. Regime mismatches narrow the gene's activation window rather than adjusting the global model.
Honesty principle. When the gap is positive (forward beats backtest, like in the recent RSI(7) entries) we log it as a minor gap and take no action. Being right by accident is still being miscalibrated. Same scrutiny in both directions.
See how the calibrated engine actually trades: