v2 Statistical Hardening
Post-processed outputs from pipeline.stage7_v2_hardening —
bootstrap / percentile confidence intervals on CV metrics, weighted
consensus feature ranking with stability filtering, Wilson-CI somatic
dN/dS with proper handling of zero-synonymous genes, and
percentage-point specificity deltas.
v1 outputs (under results/<cohort>/) are frozen.
v2 adds a read-only statistical layer (improvements #4, #5, #14, #15
from the hardening spec) without retraining. Heavy modules
(nested-CV, SHAP, LOCO, permutation tests, GSEA, MC-Dropout, GEO
external validation) ship as Python APIs but are not invoked here.
🔬 Stage 8 — Reliability Hardening (signature-only metrics)
Re-evaluates every cohort on its biologically-meaningful candidate gene set with logistic regression, OOF predictions, Youden-threshold tuning, and 1 000-iteration bootstrap 95 % CIs. PRAD specificity 73.5 % → 96.2 % (+22.7 pp); BRCA AUC 0.999 → 0.890. Source: /api/v2/reliability.
Specificity Improvement (pp) — Improvement #14
Baseline (LR, no FocalLoss/SMOTE) vs. optimised MLP (FocalLoss α=0.25, γ=2.0). Absolute percentage-point deltas only; misleading percentage-change metrics have been removed.
Metrics with 95 % CIs — Improvement #4
Stratified bootstrap (N = 1000) when raw predictions are available, else percentile CIs over the 5 CV folds. Source column indicates which method was used.
Somatic dN/dS with Wilson CIs — Improvement #15
Binomial exact Wilson-CI on non-synonymous / total mutations per gene.
Genes with < 3 synonymous observations are filtered out (see
selected column). A gene is marked selected
only if dnds_ci_lo > 1. Infinite dN/dS values are
replaced with a pseudocount-adjusted estimate or marked undefined.
Weighted Consensus Ranking — Improvement #5
LR / RF / MLP feature importances min-max normalised, ranked, and combined with weights 0.3 / 0.3 / 0.4. Stability filter keeps genes that make the top-k in ≥ 4 of 5 folds (falls back to top-k when only single-fold importances are persisted).