Does the model self-improve?
We track how calibration (ECE) and hit rate evolve against the frequentist baseline, revision after revision. Improving calibration makes the probabilities more reliable — it does not increase wins: EV stays negative.
How to read these charts
Each chart compares two different things on two different axes. The left axis (in %) measures how often the model is right; the right axis measures how reliable its probabilities are.
Model hit rate: solid line, game colour — left axis (%). How often the model's generated plays land a useful result. The higher it goes, the better it selects.
Frequentist baseline: dashed red line — same left axis (%). The level you'd get picking numbers from historical frequencies alone, with no model. When the coloured line sits above the red one, the model beats plain chance.
ECE — calibration error: light line — right axis. How realistic the model's stated probabilities are. Here it's the opposite: lower is better. Near 0 = honest probabilities. Target ≤ 0.05.
The two scales are independent: compare the coloured line only with the red one. The light ECE line is read on its own against the right axis — if it seems to «cross» the baseline, that's just a dual-axis effect, not a real comparison.
And either way: better calibration means more honest probabilities, not more wins. EV stays negative by definition.
Room for improvement
How much better the model selects than the frequentist baseline, measured on the aggregated live sample and reported with a confidence interval. The value appears as soon as the evaluated signals reach the minimum threshold.
Better calibration means more honest, reliable probabilities — not more wins. Expected value (EV) stays negative by definition and no system guarantees a win.
Le performance passate non garantiscono risultati futuri · Nessun sistema garantisce la vincita · EV negativo per definizione · 18+ · adm.gov.it