QUANTITATIVE TOOL

Probability Your Edge Is Overfit

Bailey, Borwein, López de Prado & Zhu (2017). Quantify the probability that your backtest result is an artifact of data mining, not genuine alpha. Combines selection bias and parameter overfitting into a single number.

THE PROBLEM

Backtests lie. Here's how much.

Every parameter you tune and every configuration you try inflates your in-sample Sharpe. Test 50 strategies with 10 parameters each, and the winner will look spectacular, even if none have real edge. This calculator quantifies exactly how much of your observed performance is likely noise.

CALCULATOR

Estimate overfitting probability

The annualized Sharpe of your best backtest. This is the number you're hoping is real.

Number of trading days used.

Every time you changed settings and re-ran counts as one. Be honest — this is the whole point of the tool.

Things you tweaked: lookback window, entry threshold, stop-loss, position size, etc. Count each one.

0 = symmetric. Negative = occasional large losses. Most strategies: between −1 and 0.

3 = normal distribution. Above 3 = more extreme days than expected. Most strategies: between 3 and 6.

RESULTS

Probability of Overfitting

Probability that your strategy's edge is an artifact of data mining.

99.7%
0% (genuine)5% threshold100% (overfit)

Verdict

High overfitting risk. The observed performance is likely explained by data mining. Do not allocate capital based on this backtest.

Haircut Sharpe

0.0000

Sharpe after removing selection bias and parameter inflation. What you should actually expect out-of-sample.

Selection bias

1.6176

Expected max SR from N backtests under null (zero skill).

Parameter inflation

1.5875

SR inflation from fitting k free parameters to T observations.

Total SR inflation

3.2051

Combined expected Sharpe inflation from testing multiple strategies and tuning parameters.

Minimum backtest length

6,711

Days needed for your SR to be significant at 95% given current N and k.

METHODOLOGY

Two sources of inflation

01

Selection bias

Testing N strategies and keeping the best one inflates the expected maximum Sharpe by ≈ √(2·ln(N)) / √T. With 100 backtests over 250 days, you'd expect SR ≈ 0.19 purely by chance — from strategies with zero true edge.

02

Parameter overfitting

Each free parameter your strategy uses (lookback window, threshold, MA length, etc.) fits to in-sample noise. The inflation is approximately √(k/T). A strategy with 10 parameters over 500 days inflates SR by ≈ 0.14.

03

Combined test

The total Sharpe threshold SR₀ combines both biases. We test whether your observed Sharpe significantly exceeds this threshold using the Probabilistic Sharpe Ratio framework, accounting for non-normal returns. PBO = 1 − Φ(z_adjusted).

REFERENCE TABLE

How fast overfitting grows

Expected spurious Sharpe Ratio for different combinations of backtests and parameters (T = 500 days, normal returns).

Backtests \ Paramsk=0k=2k=5k=10k=20
N=10.0000.0630.1000.1410.200
N=50.0530.1170.1530.1950.253
N=100.0700.1340.1700.2120.270
N=500.1020.1650.2020.2430.302
N=1000.1130.1770.2130.2550.313
N=5000.1370.2000.2370.2780.337

Values = expected spurious Sharpe Ratio (SR₀). Your observed SR must exceed these thresholds to have genuine edge.

FAQ

Frequently asked questions

REFERENCES

Source papers

  • Bailey, D.H., Borwein, J., López de Prado, M. & Zhu, Q.J. (2017). "The Probability of Backtest Overfitting." Journal of Computational Finance, 20(4), 39–69.
  • Bailey, D.H. & López de Prado, M. (2014). "The Deflated Sharpe Ratio." The Journal of Portfolio Management, 40(5), 94–107.
  • López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter 11: "The Dangers of Backtesting."

Get started

Verified performance. No self-reporting.

AuditZK computes institutional-grade metrics from verified exchange data — Sharpe, drawdown, VaR, Monte Carlo — with cryptographic attestation. Not backtested. Real.

Probability of Backtest Overfitting Calculator — Bailey & López de Prado | AuditZK