When Does a Track Record Prove Skill? The Math Most Traders Get Wrong
A trader shows a Sharpe ratio of 2.0 over six months. With thousands of traders active at any time, some will produce exactly that result by pure chance. The standard error of the Sharpe ratio is approximately 1/√T — and that's the optimistic version assuming IID returns.
The Data Is Corrupted Before You Even Test It
Five biases distort track record data: survivorship bias, favorable-start bias, cherry-picked windows, discontinuity (gaps during drawdowns), and self-reported data. If any of these are present, statistical testing is meaningless.
The formula assumes IID returns. Real markets exhibit autocorrelation (Lo 2002: 1.5–3x SE multiplier), fat tails, volatility clustering, and regime shifts. Realistic estimates: Sharpe 0.5 needs 25–40+ years (not 16). Sharpe 1.0 needs 5–10 years (not 4).
What a Clean Sharpe Can Hide
500 trades do not mean 500 independent observations — correlated instruments and overlapping holding periods shrink the effective sample. A strong Sharpe can rest on a handful of outsized wins. Some strategies produce smooth returns until a specific tail risk materializes, giving back years of gains in weeks.
Rejecting Sharpe = 0 Is the Wrong Question
Sharpe > 0 means better than cash — an extremely low bar. The real question is Sharpe > benchmark. Bailey & López de Prado (2012, 2014) formalized this with the Probabilistic Sharpe Ratio and Deflated Sharpe Ratio, which adjust for non-normality, multiple testing, track record length, and a non-zero benchmark.
References
Lo, A.W. (2002). The Statistics of Sharpe Ratios. FAJ 58(4).
Bailey, D.H. & López de Prado, M. (2012). The Sharpe Ratio Efficient Frontier.
Bailey, D.H. & López de Prado, M. (2014). The Deflated Sharpe Ratio.
Christie, S. (2005). Is the Sharpe Ratio Useful in Asset Allocation?
March 12, 202614 min readStatistical Analysis
When Does a Track Record Prove Skill? Statistical Significance Explained
A trader shows a Sharpe of 2.0 over six months. On such a short window, luck alone can explain that number. The real question isn't whether your Sharpe looks good, but whether six months of data is enough to prove it isn't noise.
Most traders dramatically underestimate how much data is needed. The standard error of the Sharpe ratio is approximately 1/√T, where T is the number of independent observations (Lo, 2002). Confidence grows painfully slowly — and this is the optimistic version, built on assumptions real markets routinely violate.
The Data Is Corrupted Before You Even Test It
Statistical significance requires clean data. Most performance data shared by traders is distorted before any analysis begins. Four well-documented biases are at work:
Favorable-start bias: a trader connects their account after a strong run. The track record starts from the peak. The losses before connection never appear in the data.
Cherry-picked windows: choosing when to start and stop showing performance. Any account can look impressive over a carefully selected 6-month window.
Discontinuity: disconnecting during drawdowns, reconnecting after recovery. Gaps in the equity curve hide the worst periods and inflate every metric calculated from it.
Self-reported data: screenshots and spreadsheets can be edited. Without independent data collection at the source, there is no way to verify what has been omitted.
If the underlying data suffers from any of these, statistical testing is meaningless. You're testing a fiction.
Sample Size Determines What You Can Conclude
Assume for a moment that the data is clean. The number of independent observations determines how tight the confidence interval around any metric can be. The key word is independent. We return to what that actually means below. More observations means less uncertainty, but confidence grows slowly, as √T.
Period
Trading days
Verdict
What it means
6 months
~125
Noise
A measured Sharpe of 1.5 could be anywhere from 0.3 to 2.7. No meaningful conclusion is possible.
1 year
~250
Directional only
You can tell the Sharpe is likely positive, but not whether it's 0.5 or 1.1. That distinction matters enormously for allocation.
2–3 years
~500–750
Minimum viable
High Sharpe ratios (>1.5) become distinguishable from zero. Moderate ones remain ambiguous.
5+ years
~1,250+
Institutional-grade
Even moderate Sharpe ratios (0.5–1.0) can be tested with reasonable confidence.
Standard Error of the Sharpe Ratio
SE(SR) ≈ 1 / √T
With T = 250 trading days (1 year), the standard error is ~0.063. A measured Sharpe of 0.8 has a 95% confidence interval of roughly [0.68, 0.92]. Positive, but the precision is too low for allocation decisions.
Why the Real Numbers Are Worse
The formula above assumes IID returns — independent, identically distributed. Real markets violate every part of that assumption.
Autocorrelation: daily returns are serially correlated. When returns are correlated, each new day adds less than one independent observation to your sample. Lo (2002) showed this inflates the standard error by 1.5–3x for typical strategies.
Fat tails: extreme moves happen far more often than a normal distribution predicts. This makes the Sharpe a noisier estimator than the formula implies. The confidence interval you think you have is too narrow.
Volatility clustering: calm periods alternate with turbulent ones. A high Sharpe measured during low volatility may not survive the next regime. The underlying Sharpe is itself non-stationary.
Regime shifts: a strategy calibrated to low rates and compressed volatility may fail completely when conditions shift. Five years of data from one regime tells you little about the next.
Years of Daily Data Needed to Reject Luck
Given an observed live Sharpe ratio, how many years of continuous daily data are needed to reject H₀: Sharpe = 0? The IID columns assume independent returns. The realistic column accounts for the violations above.
Observed Sharpe
IID (95%)
IID (99%)
Realistic
0.5
~16 years
~28 years
~25–40+ years
1.0
~4 years
~7 years
~5–10 years
1.5
~1.8 years
~3.1 years
~3–5 years
2.0
~1 year
~1.7 years
~1.5–3 years
3.0
~5 months
~9 months
—
A Sharpe of 0.5 is common among consistently profitable traders, but proving it requires over a decade of continuous data. For most strategies, multiply the IID estimate by 2–4x.
What a Clean Sharpe Can Hide
Even with enough data and honest collection, the Sharpe ratio can paint a misleading picture. Three structural issues sit outside what standard significance tests capture.
01
Real independence of observations
Five hundred trades do not automatically mean five hundred independent bets. A trader running the same directional thesis across correlated instruments, in the same market regime, with overlapping holding periods, may have far fewer independent observations than the trade count suggests. The effective sample size, and therefore the statistical power of the track record, can be a fraction of what it appears.
02
Concentration of gains
A strong Sharpe can rest on a handful of outsized wins. If removing the best five trades out of three hundred collapses the result, the performance is fragile: it depends on rare events repeating, not on a consistent edge. A more convincing track record shows gains distributed across trades and time periods, not concentrated in a few positions or a short window.
03
Non-linear risk profiles
Some strategies produce smooth returns until a specific type of risk materializes. A short-volatility book, a carry trade, or a liquidity-provision strategy can show a clean equity curve for years — then give back several years of gains in weeks. The Sharpe captures the calm periods faithfully. It says nothing about the latent risk that hasn't triggered yet.
Rejecting Sharpe = 0 Is the Wrong Question
Even the table above tests the wrong hypothesis. Rejecting Sharpe = 0 means proving you're better than cash. That's an extremely low bar. The real question an allocator asks is: does this Sharpe exceed a meaningful threshold?
If the cost of capital is equivalent to a Sharpe of 0.3, or if the relevant benchmark delivers a Sharpe of 0.5, the null hypothesis should be H₀: SR ≤ 0.5, not H₀: SR = 0. Testing against a non-zero threshold requires far more data, because you're trying to detect a smaller effect size.
Selection bias
If a trader tested fifty parameter sets, fifty entry rules, or fifty market combinations, and shows only the best-performing variant, that result is far less impressive than it appears. The winner of a fifty-way trial has an enormous built-in advantage that has nothing to do with skill.
Bailey & López de Prado (2012, 2014) formalized this with the Probabilistic Sharpe Ratio (PSR) and the Deflated Sharpe Ratio (DSR). The DSR adjusts for:
Non-normality of returns (skewness and kurtosis)
The number of strategies tried before selecting this one (multiple testing)
Track record length
A non-zero benchmark Sharpe ratio
Apply the DSR to most track records posted on social media (typically 6–18 months of data, with no accounting for how many strategies were tried) and the vast majority fail to achieve statistical significance. The apparent Sharpe of 1.5 or 2.0 deflates to something statistically indistinguishable from the benchmark.
Short track records are statistically meaningless. A 6-month record cannot distinguish skill from variance, period. Even after a year, you can only tell if the Sharpe is likely positive, not whether it's good enough to allocate to.
Daily observations are non-negotiable. Monthly data gives 12 points per year, daily gives 250. This is why institutional verification uses daily equity snapshots.
Continuity must be enforced, not self-reported. Any gap in the equity curve reduces the effective sample size and can hide the worst drawdowns. A track record with holes is not a track record.
Data must be collected independently at the source. Cherry-picked windows, exported spreadsheets, and screenshots invalidate any significance test before it begins.
A clean Sharpe is necessary but not sufficient. Gain concentration, hidden non-linear risks, and correlated trades can make a track record look more robust than it is. Significance tests only work if the inputs are structurally honest.
The Deflated Sharpe Ratio is the proper framework. It accounts for multiple testing, non-normality, and a real benchmark. These three adjustments collapse most publicly shared track records.
References
Lo, A.W. (2002). "The Statistics of Sharpe Ratios." Financial Analysts Journal, 58(4), 36–52.
Bailey, D.H. & López de Prado, M. (2012). "The Sharpe Ratio Efficient Frontier." Journal of Risk, 15(2), 3–44.
Bailey, D.H. & López de Prado, M. (2014). "The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality." Journal of Portfolio Management, 40(5), 94–107.
Christie, S. (2005). "Is the Sharpe Ratio Useful in Asset Allocation?" MAFC Research Papers, Macquarie University.
Get started
Every Day You Delay Is a Day Missing From Your Track Record
AuditZK collects daily equity snapshots directly from your exchange, independently and continuously from day one. No gaps, no cherry-picking, no self-reporting. The clock starts when you connect.