Calibration loop

How wrong is our backtest? We measure it.

Every night, the adaptive engine compares its backtest predictions against actual forward-test results for every strategy with 8 or more live trades. When a gap shows up, we diagnose it (slippage, entry timing, exit execution, regime mismatch, stale data) and update the model. Every adjustment is logged here.

Most backtesters do not publish this. The published numbers tend to come from the backtest, not from how the backtest compares to live execution. We do both and publish the gap.

Current state

The nightly job ran last night and every night before that. The win-rate offset is currently -30% and has not changed since 2026-05-09 (35 days).

That is not a stale job. The model clamps itself at ±30% on purpose so a single noisy week cannot warp every backtest on the site. We have been pegged at the floor because the observed gap between backtest and forward-test win rate is wider than -30%. The backtest is more over-optimistic than the model will admit. That is its own finding, and we publish it instead of softening it.

Show the calibration detail

931 gap measurements logged · tracking since 4/23/2026 · 16 live metrics · knowledge map · methodology

Live calibration state

931 gap measurements logged · tracking since 4/23/2026

Entry delay (bars)

0.0 bars→ stable

Number of bars added to entry timing in backtests to simulate the cron-cycle delay between signal and real-money execution. 0 = no delay needed.

Sample size: 0·Confidence: 0%·Updated 55 days ago (2026-04-19 08:12 UTC)·tracking, too few measurements to act on

Slippage multiplier

1.00x1.50x→ stable

How much real fill prices differ from backtest assumed prices. 1.0 = backtest matches reality. >1.0 = real slippage is worse than backtest assumed (the backtest was too optimistic).

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:ranging_high_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:ranging_low_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:ranging_med_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:strong_trend_high_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:strong_trend_low_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:strong_trend_med_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:trending_high_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:trending_low_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:weak_trend_high_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:weak_trend_low_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

slippage_mult:weak_trend_med_vol

1.00x→ stable

Sample size: 0·Confidence: 0%·Updated 39 days ago (2026-05-05 12:11 UTC)·tracking, too few measurements to act on

Stop overshoot %

15.00%→ stable

Average distance real stop fills miss the stop level by, as % of price. Used to calibrate stop-loss assumptions in backtests so reported P&L matches reality.

Sample size: 200·Confidence: 80%·Updated 55 days ago (2026-04-19 01:30 UTC)

TP tightening factor

1.00x→ stable

How much we tighten the take-profit threshold in backtests to match real-world execution. 1.0 = no adjustment needed. >1.0 = real fills happen later than backtest assumed.

Sample size: 0·Confidence: 0%·Updated 55 days ago (2026-04-19 08:12 UTC)·tracking, too few measurements to act on

Win-rate offset

-30.00%-28.00%↓ drifting down

Forward-test win rate minus backtest predicted win rate, averaged across calibrated genes. Positive = forward beats backtest. Negative = forward underperforms backtest predictions.

Sample size: 18·Confidence: 90%·Updated 35 days ago (2026-05-09 02:00 UTC)·small sample, directional

Knowledge map: what we actually know

We have 1140 graduated strategies across 7 market regimes, which means 7980 cells in the (strategy, regime) grid. Most of those cells have zero forward observations. Each row below shows how many cells have any data, how many have statistical confidence (≥8 trades), and how many have graduated to high-confidence (≥20 trades with positive posterior).

Why this matters: a strategy with zero trades in the current regime has only its backtest prior as evidence. The post-launch explore/exploit policy will reserve 30% of paper-trading slots for genes with high uncertainty in the current regime, deliberately filling these gaps over time. The line is the live state today.

The structural gap, stated clearly: we can only forward-test a fixed number of strategies concurrently (24 active slots). On 2026-04-30 a gate audit found that ~99% of previously-approved strategies had Deflated Sharpe fluke probabilities flagging them as noise. 549 of 595 prior approvals were retroactively retired. The bench is now 15 graduated strategies that clear the corrected gate. With auto-rotation (retire after 30 forward trades or 21 days active) and the bandit selector, throughput now matches real-signal throughput rather than approval-volume throughput.

Regime	Observed	Confident (≥8)	Graduated (≥20)	Avg posterior
strong trend med vol	6/542(1.1%)	0	0	0.592
weak trend low vol	6/542(1.1%)	0	0	0.592
weak trend med vol	6/542(1.1%)	0	0	0.593
ranging high vol	4/542(0.7%)	0	0	0.594
ranging low volcurrent	4/542(0.7%)	0	0	0.594
trending high vol	1/542(0.2%)	0	0	0.593
trending low vol	1/542(0.2%)	0	0	0.593

Performance by context

Where the engine actually performs vs where it does not. Sliced from 369 live v2 trades (overall avg +0.30% per trade). The CI-lower column is the pessimistic edge of a 90% confidence interval. Small slices with high variance show negative CI even when the average is positive, which is the correct read. As the dataset grows, per-slice gap-vs-backtest calibration becomes available.

By coin

Slice	N	Avg	CI low
ETH	94	+0.30%	+0.01%
XRP	88	-0.48%	-0.80%
BTC	79	+0.55%	+0.31%
BNB	63	-0.09%	-0.33%
SOL	25	+1.44%	+0.82%
DOGE	20	+2.59%	+2.15%

By 1h regime

Slice	N	Avg	CI low
weak trend med vol	113	+0.46%	+0.16%
weak trend low vol	97	-0.13%	-0.37%
ranging low vol	72	+0.27%	+0.01%
ranging med vol	20	+0.43%	-0.53%
ranging high vol	17	+1.12%	+0.67%
strong trend med vol	17	-0.90%	-1.16%
strong trend low vol	17	+1.19%	+0.89%
unlabeled	5	+1.41%	+0.01%
strong trend high vol	5	+3.34%	+3.27%
weak trend high vol	3	-0.12%	-0.32%
trending high vol	2	-1.35%	-1.54%
trending low vol	1	-1.83%	-1.83%

By timeframe

Slice	N	Avg	CI low
5m	219	-0.18%	-0.33%
15m	113	+1.26%	+0.96%
1H	37	+0.19%	-0.17%

Predictions tracked across the product

The calibration loop above is one self-correction system. We are extending the same pattern (claim, observe, gap, diagnose, act) to every system that makes a prediction or claim: the /prove verdict, the AI verdict text, the reasoning narratives, the reliability score, and so on. Each row is a domain we are now instrumenting publicly. Resolution takes time: most loops have a 30+ day window between claim and outcome, so most rows below will show 0 resolved for the first month.

Domain	Total	Resolved	Pending	Earliest pending	Next resolves
prove_verdict	418	0	418	4/25/2026	5/25/2026
ai_verdict_text	303	0	303	4/25/2026	—

Schema, helper library, and full domain list at src/lib/predictions.ts in the codebase. Adding a new self-correction loop is a single recordPrediction() call.

Recent gap measurements

overfit or_lookahead_suspectedF_combo_rsi_dip_in_uptrend6/13/2026

13 forward trades, P&L gap -1.062%, win rate gap -61.5pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.

overfit or_lookahead_suspectedF_price_cross_sma_200_bull_tf606/13/2026

20 forward trades, P&L gap -0.788%, win rate gap -58.9pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.

overfit or_lookahead_suspectedF_combo_ma_stack_bear_tf606/13/2026

12 forward trades, P&L gap -0.653%, win rate gap -56.8pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.

minor gapE_A_rsi_7_below_25_b150_tp0.5_b800_adx306/13/2026

29 forward trades show a small gap (+0.513% P&L, -7.3pp win rate). Within noise; no calibration adjustment needed.

time exits_dominatingE_A_rsi_7_below_25_b150_tp1_b800_adx356/13/2026

20 forward trades, P&L gap +0.396%, win rate gap -17.0pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.

time exits_dominatingE_A_rsi_7_below_25_b150_tp3_b800_adx356/13/2026

15 forward trades, P&L gap +0.405%, win rate gap -13.5pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.

overfit or_lookahead_suspectedE_A_rsi_7_below_25_b150_tp4_b800_adx356/13/2026

8 forward trades, P&L gap -1.424%, win rate gap -57.0pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.

overfit or_lookahead_suspectedE_A_rsi_14_below_35_adx25_tp3_b800_adx356/13/2026

9 forward trades, P&L gap -0.964%, win rate gap -49.2pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.

slippage driftE_A_rsi_14_below_35_tp2_tp3_b800_adx356/13/2026

Across 9 forward trades the live P&L came in -0.253% vs backtest, with win rate 33% (vs 60% predicted). Diagnosed as slippage drift: backtest was too optimistic on fill prices. Action: adjust_backtest.

time exits_dominatingD_williams_r_-85_mfi_14_30_tp1.5_b1506/13/2026

19 forward trades, P&L gap +0.240%, win rate gap -8.6pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.

time exits_dominatingD_mfi_14_20_rsi_7_25_tp1_b1506/13/2026

13 forward trades, P&L gap +0.372%, win rate gap -4.0pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.

time exits_dominatingD_cci_20_-80_mfi_14_25_tp1.75_b1506/13/2026

18 forward trades, P&L gap +0.275%, win rate gap +6.5pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.

alignedD_mfi_14_30_williams_r_-85_tp1_b1206/13/2026

9 forward trades, P&L gap +0.040%, win rate gap +4.1pp. Diagnosis: aligned. Action: none.

time exits_dominatingD_mfi_14_25_cci_20_-80_tp1.5_b1506/13/2026

15 forward trades, P&L gap +0.191%, win rate gap -3.7pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.

time exits_dominatingD_stochrsi_k_25_mfi_14_25_tp1.75_b1506/13/2026

15 forward trades, P&L gap +0.284%, win rate gap -2.3pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.

time exits_dominatingD_stochrsi_k_25_mfi_14_25_tp1.5_b1506/13/2026

15 forward trades, P&L gap +0.292%, win rate gap -3.5pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.

time exits_dominatingD_mfi_14_30_williams_r_-85_tp1.25_b1206/13/2026

11 forward trades, P&L gap -0.051%, win rate gap -3.6pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.

slippage driftD_mfi_14_25_cci_20_-80_tp1.75_b1506/13/2026

Across 8 forward trades the live P&L came in -0.124% vs backtest, with win rate 38% (vs 49% predicted). Diagnosed as slippage drift: backtest was too optimistic on fill prices. Action: adjust_backtest.

overfit or_lookahead_suspectedM_8qn21z_tp3.256/13/2026

43 forward trades, P&L gap -3.576%, win rate gap -61.6pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.

overfit or_lookahead_suspectedM_9leroi_tp3.16/13/2026

9 forward trades, P&L gap -4.949%, win rate gap -85.4pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.

insufficient validation_samplePROBE_rsi14_below_40_5m6/13/2026

22 forward trades, P&L gap -0.101%, win rate gap -8.0pp. Diagnosis: insufficient_validation_sample. Action: investigate_backtest.

insufficient validation_samplePROBE_stochrsi_below_25_5m6/13/2026

8 forward trades, P&L gap -0.435%, win rate gap -29.8pp. Diagnosis: insufficient_validation_sample. Action: investigate_backtest.

overfit or_lookahead_suspectedF_combo_rsi_dip_in_uptrend6/12/2026

13 forward trades, P&L gap -1.062%, win rate gap -61.5pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.

overfit or_lookahead_suspectedF_price_cross_sma_200_bull_tf606/12/2026

20 forward trades, P&L gap -0.788%, win rate gap -58.9pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.

overfit or_lookahead_suspectedF_combo_ma_stack_bear_tf606/12/2026

12 forward trades, P&L gap -0.653%, win rate gap -56.8pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.

Methodology

Trigger.The calibrator runs at 02:00 UTC daily. For every active gene with 8 or more forward-test trades, it pulls the original backtest result and the live trade history and computes the gap on five dimensions: average P&L per trade, win rate, take-profit hit rate, time-exit rate, and average bars held.

Diagnosis.Gaps are categorized by signature. P&L drops with similar win rate suggests slippage drift. Win rate drops suggest entry timing problems (cron delay, gate misfire). Lower TP hit rate suggests exit execution lag. Pattern mismatch with regime suggests the backtest tested a different market structure than current.

Action.Slippage drift updates the global slippage multiplier, which feeds the next round of backtests. Entry timing adds entry-delay bars to backtest assumptions. Regime mismatches narrow the gene's activation window rather than adjusting the global model.

Honesty principle. When the gap is positive (forward beats backtest, like in the recent RSI(7) entries) we log it as a minor gap and take no action. Being right by accident is still being miscalibrated. Same scrutiny in both directions.

See how the calibrated engine actually trades:

Live proof →Research lab