Total: 1
Forecast models in statistical seismology are commonly evaluated with log-likelihood scores of the full distribution P(n) of earthquake numbers, yet heavy tails and out-of-range observations can bias model ranking. We develop a tail-aware evaluation framework that estimates cell-wise P(n) using adaptive Gaussian kernel density estimation and tests three strategies for handling out-of-range counts. Using the AoyuX platform, we perform a ~25-year month-by-month pseudo-prospective forecast experiment in the China Seismic Experimental Site (CSES), comparing Epidemic-Type Aftershock Sequence (ETAS) model with a homogeneous background (ETAS{\mu}) to a spatially heterogeneous variant (ETAS{\mu}(x,y)) across six spatial resolutions and five magnitude thresholds. Empirical probability density functions (PDFs) of counts per cell are well described by power laws with exponents a = 1.40 +- 0.21 across all settings. Using previous theoretical results, this provides a robust estimate of the productivity exponent, {\alpha} = 0.57 +- 0.08 using a b-value equal to 0.8, providing a valuable quantification of this key parameter in aftershock modeling. Model ranking is sensitive to how the tail of the full distribution P(n) of earthquake counts is treated: power law extrapolation is both theoretically justified and empirically the most robust. Cumulative information gain (CIG) shows that ETAS{\mu}(x,y) outperforms ETAS{\mu} in data-rich configurations, whereas in data-poor settings stochastic fluctuations dominate. A coefficient-of-variation analysis of per-window log-likelihood differences distinguishes genuine upward trends in CIG from noise-dominated fluctuations. By aligning a fat-tail-aware scoring methodology with an open testing platform, our work advances fair and statistically grounded assessment of earthquake forecasting models for the CSES and beyond.