What counts as a backtest in overfitting analysis?

Every distinct configuration evaluated counts as a separate backtest. Changing a lookback window, entry signal, or exit rule each constitutes a new backtest. Most practitioners undercount by a factor of 5 to 10.

What counts as a free parameter in a trading strategy?

Any tunable value: lookback windows, thresholds, moving average lengths, stop-loss levels, position sizing rules, and regime filters. A strategy using a 20-day moving average and a 2% stop-loss has at least 2 free parameters.

How does the Probability of Backtest Overfitting relate to the Deflated Sharpe Ratio?

The DSR addresses selection bias from testing multiple strategies. The PBO calculator extends this by also accounting for parameter overfitting — the Sharpe inflation from fitting free parameters to in-sample data. Both frameworks originate from López de Prado's research.

How can I reduce the probability of backtest overfitting?

Three approaches: (1) use longer backtests to reduce noise, (2) reduce the number of free parameters — simplicity beats complexity, (3) test fewer strategy variants by committing to a thesis before running backtests.

What is a haircut Sharpe Ratio?

The haircut Sharpe is the observed Sharpe minus the expected inflation from selection bias and parameter overfitting. It represents your best estimate of the true out-of-sample Sharpe. If negative, your edge is likely pure noise.

QUANTITATIVE TOOL

Probability Your Edge Is Overfit

Name: Probability of Backtest Overfitting Calculator
Author: AuditZK

Bailey, Borwein, López de Prado & Zhu (2017). Quantify the probability that your backtest result is an artifact of data mining, not genuine alpha. Combines selection bias and parameter overfitting into a single number.

All tools

— THE PROBLEM

Backtests lie. Here's how much.

Every parameter you tune and every configuration you try inflates your in-sample Sharpe. Test 50 strategies with 10 parameters each, and the winner will look spectacular, even if none have real edge. This calculator quantifies exactly how much of your observed performance is likely noise.

— CALCULATOR

Estimate overfitting probability

Your backtest Sharpe Ratio (annualized)

The annualized Sharpe of your best backtest. This is the number you're hoping is real.

Return frequency

Backtest length (trading days)

Number of trading days used.

How many backtests did you run?

Every time you changed settings and re-ran counts as one. Be honest — this is the whole point of the tool.

Tunable settings per strategy

Things you tweaked: lookback window, entry threshold, stop-loss, position size, etc. Count each one.

Return asymmetry (skewness)

0 = symmetric. Negative = occasional large losses. Most strategies: between −1 and 0.

Tail risk (kurtosis)

3 = normal distribution. Above 3 = more extreme days than expected. Most strategies: between 3 and 6.

— RESULTS

Probability of Overfitting

Probability that your strategy's edge is an artifact of data mining.

99.7%

0% (genuine)5% threshold100% (overfit)

Verdict

High overfitting risk. The observed performance is likely explained by data mining. Do not allocate capital based on this backtest.

Haircut Sharpe

0.0000

Sharpe after removing selection bias and parameter inflation. What you should actually expect out-of-sample.

Selection bias

1.6176

Expected max SR from N backtests under null (zero skill).

Parameter inflation

1.5875

SR inflation from fitting k free parameters to T observations.

Total SR inflation

3.2051

Combined expected Sharpe inflation from testing multiple strategies and tuning parameters.

Minimum backtest length

6,711

Days needed for your SR to be significant at 95% given current N and k.

— METHODOLOGY

Two sources of inflation

Selection bias

Testing N strategies and keeping the best one inflates the expected maximum Sharpe by ≈ √(2·ln(N)) / √T. With 100 backtests over 250 days, you'd expect SR ≈ 0.19 purely by chance — from strategies with zero true edge.

Parameter overfitting

Each free parameter your strategy uses (lookback window, threshold, MA length, etc.) fits to in-sample noise. The inflation is approximately √(k/T). A strategy with 10 parameters over 500 days inflates SR by ≈ 0.14.

Combined test

The total Sharpe threshold SR₀ combines both biases. We test whether your observed Sharpe significantly exceeds this threshold using the Probabilistic Sharpe Ratio framework, accounting for non-normal returns. PBO = 1 − Φ(z_adjusted).

— REFERENCE TABLE

How fast overfitting grows

Expected spurious Sharpe Ratio for different combinations of backtests and parameters (T = 500 days, normal returns).

Backtests \ Params	k=0	k=2	k=5	k=10	k=20
N=1	0.000	0.063	0.100	0.141	0.200
N=5	0.053	0.117	0.153	0.195	0.253
N=10	0.070	0.134	0.170	0.212	0.270
N=50	0.102	0.165	0.202	0.243	0.302
N=100	0.113	0.177	0.213	0.255	0.313
N=500	0.137	0.200	0.237	0.278	0.337

Values = expected spurious Sharpe Ratio (SR₀). Your observed SR must exceed these thresholds to have genuine edge.

— FAQ

Frequently asked questions

— REFERENCES

Source papers

Bailey, D.H., Borwein, J., López de Prado, M. & Zhu, Q.J. (2017). "The Probability of Backtest Overfitting." Journal of Computational Finance, 20(4), 39–69.
Bailey, D.H. & López de Prado, M. (2014). "The Deflated Sharpe Ratio." The Journal of Portfolio Management, 40(5), 94–107.
López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter 11: "The Dangers of Backtesting."

Get started

Verified performance. No self-reporting.

AuditZK computes institutional-grade metrics from verified exchange data — Sharpe, drawdown, VaR, Monte Carlo — with cryptographic attestation. Not backtested. Real.

Get started free See methodology