Methodology

The six gates every strategy passes.

This is the canonical source of truth for how we validate. Other pages reference numbers from here. If anything elsewhere on the site contradicts this page, this page wins.

The six gates

Gate 1

Walk-forward train/test split

Train on the first 60% of the available bar history, validate on the held-out 40%, no peeking. Catches time-period overfit. Some engines use 70/30 or 50/50 when the strategy trades very rarely (we need more out-of-sample bars to reach statistical power) or very often (we can afford a larger train window). The per-engine split is stated on each engine page; this is the default.

Gate 2

Deflated Sharpe Ratio (Bailey/López de Prado)

Adjusts the observed Sharpe for the number of candidates tested. The raw Sharpe of the best of 10,000 strategies tested is meaningless without this correction. A strategy passes when its DSR fluke probability is below 50%.

Gate 3

Per-fill L2 cost modeling

Real Binance maker/taker fees, real per-coin spreads sampled continuously from live L2 order books, and slippage calibrated against forward-test results. Not OHLCV-only. Most retail backtesters skip this gate entirely.

Gate 4

Cross-coin transfer test

An edge that only works on one coin probably is not an edge. Strategies must show consistent behavior across BTC, ETH, SOL, and at least 7 other majors to graduate. Single-coin edges are flagged as candidates for further investigation, not graduated.

Gate 5

Frozen out-of-sample holdout window

A specific recent window the lab has never trained on. Genes that pass walk-forward then face this last test before graduation. If the holdout window result diverges from the walk-forward result, the gene is retired.

Gate 6

Regime-stratified consistency

No single market regime can blow the strategy up worse than 1% per trade. Catches the “wins one regime, dies in another” pattern that a global Sharpe number hides.

Why per-engine train/test splits vary

Different strategies need different splits for the same reason: statistical power. A strategy that trades 200 times a year needs less calendar time to reach a statistically meaningful sample than one that trades 5 times a year. Forcing every engine to use the same ratio would be a fiction that hid real per-engine differences.

60/40 — default for medium-frequency strategies (~30-100 trades/yr).
70/30 — used for high-frequency strategies (200+ trades/yr) where the train window can afford to be larger without losing OOS power.
50/50 — used for low-frequency strategies (<15 trades/yr) where we need maximum OOS calendar time.

Each engine page declares its specific split. If a number elsewhere on the site conflicts with the engine page, the engine page is canonical.

Live calibration

The gates above are static rules. Live forward trades are how we tell whether the rules are actually catching what they're supposed to catch. Every night the calibrator compares backtest predictions against forward-test reality on five dimensions (avg P&L per trade, win rate, take-profit hit rate, time-exit rate, average bars held).

When a gap shows up we diagnose it (slippage drift, entry timing, exit execution, regime mismatch, stale data) and either update the global model or narrow the specific gene's activation window. Every adjustment is logged on /calibration.

Correlation methodology

The numbers on /correlations come from Pearson correlation on log-returns of Binance daily candles, computed on a 90-day rolling window. Beta to BTC uses the standard CAPM ratio (Cov(coin, BTC) / Var(BTC)) on the same window. Clusters use single-linkage greedy clustering at a 0.70 correlation threshold.

Limitation worth naming: correlations change with regime and spike toward 1 during stress events. The 30-day comparison number on hover gives a near-term view.

Methodology incidents

We name the times the methodology was wrong. The corpus is suspect until each incident's re-test passes; that's the deal.

2026-04-30: DSR gate audit. ~99% of the previously-approved 595 graduated strategies had Deflated Sharpe fluke probabilities flagging them as noise under the corrected implementation. 549 of 595 were retroactively retired. The bench shrank to 15 genes that clear the corrected gate.
2026-05-13: silent zero-trade bug.A code typo caused 1,560+ prior research-lab tests to execute zero trades but report “passed” on insufficient-data grounds. All affected tests are flagged in the database; re-runs land in the registry as fresh entries with updated provenance.

See the methodology in action:

Live paper-trading proof →Calibration log Research lab