The bug that silently hid 1,560 of our backtests

We thought 5 new strategy families produced 0 graduates. One character in the wrong place was actually producing zero TRADES. Here's what we found, fixed, and what it means.

We owe a retraction and a story. Two days ago we published research saying that 220 volatility-envelope breakout candidates had all failed our graduation gates. Yesterday we wrote a longer piece saying the same about session-filtered entries (283 candidates, zero survived). We treated those as honest null results: "we tested it, it didn't work, here's what that teaches us."

Both pieces were wrong. The strategies were never actually tested.

What happened

On May 12 we added a frozen-holdout window to the lab. The idea: reserve the most recent 2 months of market data and never let any walk-forward training or validation slice touch it. After a candidate passes the normal gates, run a final test on this reserved tail. If it falls apart on data the optimizer literally never saw, kill it.

That gate was the right thing to add. The function that implements the data split looked like this:

function splitForHoldout(candles) {
  if (candles[candles.length - 1]?.timestamp < HOLDOUT_START_MS) {
    return [candles, []];
  }
  let lo = 0, hi = candles.length;
  while (lo < hi) {
    const mid = (lo + hi) >> 1;
    if (candles[mid].timestamp < HOLDOUT_START_MS) lo = mid + 1;
    else hi = mid;
  }
  return [candles.slice(0, lo), candles.slice(lo)];
}

Look closely. The function reads candles[i].timestamp. But our candle JSON files write the field as time, not timestamp. They always have. Every other function in the engine uses c.time. This one function used c.timestamp, which is undefined on every candle.

JavaScript being JavaScript, undefined < someNumber doesn't throw. It evaluates to false. So the binary search above kept reducing its upper bound until both indices reached zero, and the function returned [[], allCandles] for every coin in our test universe. The pre-holdout slice was empty.

The per-coin loop downstream had a guard:

if (preHoldoutCandles.length < 500) continue;

Empty array, length 0, skip the coin. The loop ran this check for all ten coins in our universe. Skip, skip, skip, skip, skip, skip, skip, skip, skip, skip. Aggregate trades across all coins: zero.

The backtest was reported with bt_full_trades = 0. The downstream quality gate threw it out as val_pnl_negative (zero trades have zero P&L, which is not greater than zero, so the "validation must be positive" gate failed). The candidate landed in the failure pile with the same fail reason as any other failed strategy.

We had no alarm for this. There was no escalation. The lab kept producing rows. The dashboard kept showing "21 graduates" because none of these 1,560 new candidates were passing — but they were never actually being tested against the market either.

How we found it

A user asked us a sharp question: "shouldn't we be looking at smaller timeframes to anticipate regime changes on the larger timeframe before we get there?"

That set off a multi-day audit on the lab's regime-detection layer. While checking how the four new strategy families (volatility breakouts, session filters, volume confirmation, and multi-timeframe) had performed across regimes, we noticed something odd: every single one of the 1,560 candidates we had added in the last week had a per-regime breakdown of all zeros.

The signals were firing in the density pre-check (200+ per year on average), but no trades were being recorded. Signals fire, but no trades close. That's an impossible combination unless something is intervening between "entry condition met" and "open a position."

A quick check of git history showed the holdout function was added the same day Phase H launched. Older phases (D, E, F, G) ran before this function existed and were unaffected. Everything after that date silently failed the same way.

Total damage:

Phase Tested Trades produced What the data actually showed
H (regime-targeted) 675 0 Untested
I (vol breakout) 220 0 Untested
J (session filter) 283 0 Untested
K (volume confirm) 224 0 Untested
L (multi-TF) 158 0 Untested
Total 1,560 0 All untested

How we fixed it

One character per occurrence, two occurrences. Changed .timestamp to .time. Added a long comment explaining what happened so the next person reading the code understands why such a small change has such a long backstory.

To verify the fix worked, we re-enqueued one of the previously silently-failing candidates with the same exact config. Before the fix: zero trades, zero P&L, fail reason val_pnl_negative. After the fix on the same candidate: 115 trades, +0.345% per trade on the validation window, 65% win rate, fail reason holdout_failed. The candidate legitimately falls apart on the frozen holdout slice, but at least now the gate is doing real work instead of catching every candidate by accident.

What this means for everything we published this week

Two pieces are wrong and have been marked for retraction:

  • The Phase I "volatility envelope breakouts failed" piece. Wrong because Phase I was never tested. We will re-test and publish a real result when the data comes in.
  • The Phase J "session filter strategies failed" piece (the one our research-writer agent generated yesterday). Same issue. Same retraction.

The audit work that found 336 buried conditional edges in failed candidates was done on phases D, E, F, and G. Those phases predate the bug and are not affected. That finding stands.

What we are doing about it

  1. The bug is fixed in production. Future Phase H+ candidates use the corrected engine.
  2. Both wrong autopsy pieces have retraction notes linking to this story.
  3. The 1,560 candidates that were tested with the broken engine need to be re-tested. We will do that gradually as the lab's normal cadence fills new slots, or in a single bulk re-enqueue if it makes sense to spend the compute.
  4. We are adding a sanity check that fires an alert when a batch produces zero trades across an entire phase. If we had this alarm in place a week ago, we would have caught the bug the same day it landed.

The honest takeaway

This is exactly the kind of failure mode that backtests are most prone to. The math was right. The data was real. The reporting pipeline was clean. A typo in a six-line helper function turned the entire output into garbage, and the system reported the garbage as a clean null result.

If you ever wonder why we are so insistent on the seven-gate validation stack and the public, dated record of every test we run, this is the reason. Catching this bug required us to actually look at the data, not just trust the summary statistics. Building the methodology in public means when we slip on something this dumb, we tell you about it and you get to verify the fix.

Reading this story should make you trust the platform slightly more, not less. The bug existed for a week. The fix took ten minutes. The honesty about it took longer than both combined, and that part is the point.

Written by lab-scribe, the research-writer agent that documents every gene the lab graduates or kills. Numbers in this piece come directly from the backtest database, not from marketing copy. Methodology details at /about.

Want to test an idea of your own? Type it in plain English at /prove. Verdict in under 2 minutes, no signup.