[Block Bootstrap Validation] Is a Backtest Really Just 'Luck'?

Is a History That Happened Once the Truth?

In my last post, I said my portfolio’s five-year maximum drawdown (MDD) was −8.8%. But should I take that number at face value?

Here’s the problem. −8.8% is a single path that “actually happened” just once.

The order in which the good days and the bad days arrived is, to a large extent, luck.

If a few of the bad days had happened to scatter apart, the drawdown would have been shallow; if they had happened to bunch up in one place, it would have been far deeper.

In other words, −8.8% could be a lucky path. When it comes to important decisions—like whether to add leverage—I didn’t want to lean on “the one time I got lucky.”

How to Peer Into the Histories That Didn’t Happen

The tool for this is the bootstrap. The idea is surprisingly simple.

Using the real return data as raw material, you generate thousands of imaginary histories that “could plausibly have happened but didn’t”, and then look at the distribution of the results.

The most naive approach is to drop each day into a hat and randomly draw them back out. But this is exactly where the trap lies.

The Trap: Shuffle Day by Day, and the Risk Evaporates

Market crashes don’t just dip for a single day and end. Bad days come clustered together (so-called volatility clustering). But if you shuffle day by day, this “worst week” gets shattered to pieces and scattered among the up days. The crash gets diluted.

Let’s look at a set of imaginary 12-day returns. Right in the middle, a textbook crash is lodged in: −5%, −6%, −4% hitting back to back.

Day	1	2	3	4	5	6	7	8	9	10	11	12
Return	+2	+1	+3	−1	+2	−5	−6	−4	+3	+2	+1	+2

This original sequence has a maximum drawdown of −14.3% (because −5/−6/−4 land as a combo right after the peak).

Now let’s change only the way we shuffle those same 12 numbers.

Method	MDD of one sample path	What happened
Original (the history that actually happened)	−14.3%	−5/−6/−4 stick together into a deep trough
Random day-by-day shuffle	about −7%	The crash scatters among up days and shrinks by half ❌
Shuffle in 3-day 'blocks'	about −10 to −14%	The crash cluster is preserved intact ✅

Shuffle day by day, and a +3% slips in after the −6%, canceling out the shock. That’s why it underestimates the risk by half. Move things in chunks (blocks), on the other hand, and −5/−6/−4 stay glued together—so the deep drawdown reappears at a realistic frequency.

Block Bootstrap, in One Line

Cut the return time series into blocks of a fixed length (preserving the clustering of crashes), randomly stitch those blocks back together to create thousands of imaginary histories, and then look at the distribution of the metric you care about (here, the drawdown).

You set the block length to match “how long a bad stretch usually lasts.” I use about 20 days. Too short, and the clusters break apart (you regress to day-by-day shuffling); too long, and the paths don’t vary enough.

Running It on My Actual Portfolio

I revived my equity curve as 5,000 imaginary histories, then gathered the maximum drawdown of each one and lined them up.

Measure	Value	Meaning
MDD that actually happened	−8.8%	A single path with luck mixed in
Block bootstrap 5th-percentile point (p5)	−11.7%	95% of the imaginary histories were shallower than this; only the unlucky 5% went deeper

Here’s how I read it: “−8.8% is a slightly lucky value, and the drawdown I should realistically brace for is around −11.7%.” So when I gauge risk, I use p5 (−11.7%) as my baseline, not the actual value (−8.8%). Push leverage up to 1.5×, and this brace line grows roughly 1.5× too (somewhere around −18%). The real question becomes whether I can stomach that. (p5 means the bottom 5%.)

What I’m Taking Away

A single backtest curve is just one snapshot of “the histories that could have happened.” Trust that one snapshot’s drawdown as your limit, and you end up betting on the lucky path.
When you shuffle, do it in chunks, not single days. Break the clustering of crashes, and the risk vanishes as if by magic—right when it matters most.
Brace yourself with the tail of the distribution (p5), not the actual value. This is a surprisingly cheap, one-line validation that separates “luck” from “skill.”

※ This post is meant to share a validation methodology; specific signals, tickers, and sizing are not disclosed. It is not investment advice.