[rule]My backtests lied to me far too often

When a backtest looks too pretty,

nine times out of ten, it’s lying.

 

Trading crypto by rules instead of by gut is what people call quant.

Running those rules through a computer simulation to see whether they worked in the past is called a backtest.

 

The problem is, backtests lie far too often.

On the screen, the equity curve climbs up and to the right.

Put real money behind that curve, though, and it turns out to have been an illusion.

 

Over the past seven years, I’ve lost count of how many times a backtest has fooled me..

Every time I got fooled, I carved out one more rule in that exact spot.

Today is the day I pin those rules down.

 

Fifteen rules.

Every one of them came from a place where I broke, a place where I despaired.


1. Distrust the data

The starting point of any backtest is historical price data.

And often, the lying starts right there at the starting point.

This is where most of the accidents happen.

 

Don’t trust the data you receive at face value

Crypto exchanges hand over historical price data automatically, through a channel called an “API.”

The problem is, that channel sometimes quietly leaves pieces out.

It doesn’t throw an error. It just arrives short.

 

I ran the same strategy over the same period,

and the maximum drawdown came back 4 percentage points different every single time.

Turns out, in a window where I should have received 1,000 candles, some days only 980 arrived.

Those 20 missing candles, silently clipped on arrival, were shaking the entire result.

 

It’s like testing a recipe when the grocery delivery occasionally drops off an ingredient or two.

You can’t conclude “this recipe is good” from results like that.

Rule: For any data you receive — (1) verify the count, (2) save it to a file (cache it), (3) drop any candle still in progress before using it.

 

Don’t just use the data with the front end blank

Every strategy needs what’s called a warm-up period.

If your strategy uses a “200-day moving average,” the first 200 days of data simply can’t be computed.

You need 200 days to average, and you don’t have 200 days yet.

That window slides past with trading signals empty.

 

I sliced the training window and the validation window separately, loaded each on its own,

and the annual return on the training window came in 5.5 percentage points lower.

The cause: the first 200 days of the training window were blank, so that much time went by with no trading at all.

 

I was looking at results from an uneven comparison.

Rule: Before slicing your window, load it with extra padding tacked onto the front — equal to the warm-up — then cut.

 

Don’t paste daily-candle signals straight onto hourly candles

There are times when you mix data from different timeframes.

For example, a rule like “buy on the hourly only when the daily trend is up.”

 

I carelessly merged the two, and the backtest came back with an annual return of +109%.

Something felt off, so I dug in: I was using today’s daily close at today’s pre-dawn hour.

A close that wouldn’t finalize until midnight, and I was trading at 9 a.m. as if I already knew it.

This is what people call lookahead bias.

Of course the results look great when you trade with knowledge of the future.

 

I fixed the bug and re-ran it: +109% → +4.6%.

One line of lying conjures up an illusion of over a hundred percentage points.

Rule: When mixing different timeframes, always use only up to the prior day’s close. Today’s information gets used starting tomorrow.


2. When the numbers look too pretty, doubt them once more

Backtest results are, in the end, just numbers.

But we tend to forget what those numbers were made out of.

 

Don’t trust a file just because it’s named “fixed” or “clean”

As result files pile up, names like “fixed_xxx” and “clean_xxx” start appearing.

They carry a tidied-up vibe, which makes them easy to cite as-is.

 

In one of those clean files I was pleased to see a maximum drawdown of -8%.

Turned out it was data stitched together from end-of-month snapshots only.

Whatever deep loss valleys had opened up mid-month were nowhere to be seen.

Recomputed against a day-by-day equity curve, it became -16%.

Nearly double.

 

If all you ever saw were the family photos taken at the end of each month, every month would look peaceful.

Whatever fights happened in between aren’t in the picture.

Rule: Trust the maximum drawdown only when it’s computed directly from a daily equity curve.

 

Don’t call the cash-balance chart “maximum drawdown”

True drawdown depth has to be computed from total assets = cash + the mark-to-market value of open positions.

In English, this is called equity.

Looking at the cash alone tells you how much cash flowed in and out — not how deep the loss went.

 

Example:

Start with 100 won.

Buy 100 won worth of coin → cash now 0 won.

The coin price gets cut in half → position value now 50 won.

My true assets right now are 50 won.

But look only at the cash, and it sits unchanged at 0 won. Looks peaceful.

Even though I’m down -50%.

 

Rule: Maximum drawdown is always computed against total assets. The cash balance gets a different name — something like “cash turnover.”

 

Don’t mistake leverage for a tool that improves your ratios

Leverage is taking a larger position with borrowed money.

With 3x leverage, 100 won of capital runs a 300 won position.

Triple the gains, triple the losses.

 

A common misconception enters here.

“If I lower the leverage, won’t my risk-adjusted return improve?”

Drop from 3x to 2x, and the annual return shrinks — but so does the maximum drawdown.

The two shrink at the same rate, so the ratio stays the same.

 

The return-to-risk ratio only moves when the strategy itself gets smarter, or when diversification with other assets actually works.

Leverage just stretches or shrinks the vertical axis of the chart.

 

Rule: Leverage is a volatility-adjustment knob. To improve risk-adjusted returns, you have to touch the strategy itself or your diversification.


3. Distrust the simulation itself

Even when the strategy is sound, the simulator can be the one lying.

This kind is the hardest to catch.

You can fiddle with the strategy for a hundred days straight and never find the answer.

 

Don’t forget to return the margin when a position is closed

Leveraged trading locks up some of your money as margin (collateral).

A 100-won position at 3x leverage ties up roughly 33 won as margin.

When the position is closed out, those 33 won have to flow back into the capital balance.

 

But there was a time when my simulation code forgot to return it.

Capital looked like it was quietly bleeding away with time,

and the maximum drawdown on the validation window came in at -40%.

I assumed the strategy was at fault and wrestled with it for nearly a month.

 

Turned out to be a simple bug in the simulator.

One line added to the close-out code to return the margin, and it dropped to -16%.

The strategy was fine. The simulator was the one lying.

Rule: For every line in the close-out code, eyeball whether the locked margin actually flows back into capital.

 

Don’t leave forced liquidations out of the simulation

When leverage is high and losses cross a certain threshold, the exchange automatically closes the position for you.

This is called liquidation.

The remaining capital becomes 0, or close to it.

Game over.

 

My simulation left this out, so even at -95% P&L the strategy kept running as if it were still alive.

After that, a recovery curve would even be drawn.

In reality, it would have been liquidated long before -95%, in a game with no comeback.

The simulator was drawing me a future that couldn’t exist.

 

Rule: Explicitly hard-code a guard that forces the simulation to terminate once losses cross a certain threshold (e.g. -90%).

 

Don’t systematize visual trading rules just because they look great on the chart

Trendlines, ranges, order blocks (OB), ICT, SMC.

These are the visual trading methods used by people who draw directly on the chart.

To the eye, they seem to fit almost too well.

 

Translate those rules into code and run an automated backtest, and you often end up with a negative expected value.

(Expected value = average profit/loss per trade. If it’s negative, you lose more the more you trade.)

 

The reason is simple.

On the chart, people were only calling the lines that turned out to be right trendlines.

Trends that only become visible in hindsight — drawn as if they had been known at the time.

This after-the-fact bias is called holding bias.

 

Rule: If a visual rule’s backtest looks too good on the training window, suspect hindsight bias. The answer is often an alert bot, not full systemization.


4. The validation window (OOS) is sacred — never touch it

The basic move in backtesting is splitting your data into two parts.

  • Training window (IS, In-Sample): the window where you build and refine the strategy.
  • Validation window (OOS, Out-of-Sample): the window where you check whether the strategy still works on data it has never seen.

The OOS plays the role of the exam.

 

Don’t look at the OOS result and then go back and re-tune parameters

There’s only one rule.

You look at the OOS once, at the very end.

The moment you see the OOS result and think “should I bump this number from 0.3 to 0.4?”,

that OOS is no longer an OOS.

Because the brain can’t unsee what it has already seen.

 

What does it mean for a student to score 100 on a test they peeked at beforehand?

This is the rule I’ve broken most often, and the most dangerous one.

 

Rule: Parameter selection uses only the training window plus a secondary validation window (OOS2). The final OOS is looked at exactly once, at the very end.

 

Don’t carve the bear markets out wholesale

A common temptation.

“If I just don’t trade in the downturns, I won’t take losses, right?”

You start wanting to add a rule like “stop trading below the 200-day moving average.”

 

Do this and the training window looks dazzlingly better.

But on the validation window, it collapsed by -31 percentage points.

 

The reason: the bear market itself was part of the alpha.

Trades that went against the grain, riding the volatility of the downturn, were generating profits.

Block the whole thing, and that source of profit disappears with it.

It’s not just the losses that vanish — the profits go too.

 

Rule: Handle bear markets by shrinking position size, not by excluding them.

 

Don’t make adoption decisions based on a sub-strategy’s standalone numbers

Sometimes a bot runs several sub-strategies (sleeves) in parallel.

Each sleeve only runs a portion of the capital (say, 25%).

The point is to diversify risk.

 

I added a new sleeve and on its own it improved Sharpe by +0.30.

(Sharpe = a measure of risk-adjusted return. A +0.30 improvement is a fairly large jump.)

I got excited about the massive improvement,

but when I rolled it up across the whole portfolio, it was +0.025.

Because the allocation was only 25%, the effect had been diluted by a factor of 12.

 

The overall bot performance was basically unchanged.

 

Rule: The adoption decision rests on the full portfolio’s combined numbers, not on the sleeve’s standalone numbers.

 

Don’t lay down a strategy that has already used up the OOS as your new baseline

When you first build a strategy, you only look at the training window — but during validation, you inevitably end up seeing and refining against the OOS too.

Up to this point, that’s normal.

The problem is when you treat that strategy as “the finalized base,” stack a new module on top of it, and start re-validating.

 

The skeleton itself already knows the OOS information, so whatever you build on top ends up being an in-sample test in disguise.

It looks like a fresh validation from the outside; inside, it’s in-sample.

 

Real validation has to start from a state where the OOS has never been seen at all.

 

Rule: For validating a new module, the baseline should be a version frozen before the OOS was ever used.


5. Even when the backtest is clean, moving to live will lie all over again

Even a strategy that has passed every test up to this point starts diverging the moment you move it onto an automated bot.

Two causes.

When you look at the price, and what price you enter at.

 

Same candle, different result depending on when you look

A backtest only ever looks at completed candles.

It evaluates signals once the candle has closed and the close is final,

and assumes entry at the open of the next candle.

A clean assumption.

 

A live bot is different.

It peeks at the price every minute and evaluates signals on whatever provisional price exists at that instant.

If a 4-hour candle started 7 minutes ago, the close the bot is looking at is the still-moving price that has been wobbling for those 7 minutes.

Same candle — but the signal result changes depending on when you peek in.

 

This is exactly what happened the first time my bot went live.

When I checked just before the candle closed, three coins satisfied the buy condition.

The bot actually processed it 7 minutes later, and in that gap, prices shifted just enough that two of them slipped out of the condition.

In the end, only one got in.

It was supposed to be a trade where all three got in.

 

Rule: The live bot also evaluates signals on the completed candle’s close and enters on the next candle’s open. Always drop the in-progress candle. Make the bot look at the same price the backtest does.

 

When the bot’s clock is off, you enter late every single time

Say the bot is set to run once every 10 minutes.

If this bot powers on at 3:57, its next run is at 4:07.

That’s seven minutes after the 4 o’clock candle closes.

 

Prices don’t sit still during those seven minutes.

The backtest assumed entry at the 4:00 sharp price, but the bot enters at the 4:07 price.

Late every time.

 

The fix is simple.

Force the bot to run on the minute mark.

4:00:00, 4:01:00 — that kind of cadence.

Shrink the gap between candle close and bot execution down to 0 to 1 second.

 

Rule: Make the live bot run on a minute-aligned schedule. Keep the gap between candle close and entry within one second.

 

Build in a safety so the bot can’t be killed mid-task

What happens if you force-kill the bot while it’s in the middle of processing an order?

The signals get evaluated but no entry record is saved. Or only part of the order goes through before it dies.

Start it back up in that state, and the new trade begins on top of a half-processed trade.

The books are off.

 

To prevent this kind of accident, build a safe-shutdown mechanism into the bot.

When the shutdown signal arrives, let it finish whatever it’s currently doing and then terminate.

No new work gets started.

 

Rule: The live bot receives the shutdown signal as a wait. Don’t cut it off mid-stream. Cut it off and the books break.


Closing

Fifteen rules.

All of them carved from a place I once broke.

Every break cost me a steep tuition — time, money, or both.

 

The saving grace, though, is that backtests lie in predictable patterns.

You don’t have to fall for the same trick every time.

 

I’ll get fooled again, surely.

When I do, I’ll carve another line.

 

댓글