1. Data Sources & Preprocessing

RNA-seq HTSeq counts were obtained from the TCGA GDC portal for five cancer types:

Cancer TypeTotalTumorNormalRatio
BRCA (Breast)1,2181,1041149.7:1
BLCA (Bladder)4264071921.4:1
PRAD (Prostate)550498529.6:1
LUAD (Lung Adenocarcinoma)576517598.8:1
UCEC (Uterine)201177247.4:1

DESeq2 pre-filtering: Genes were retained only if |log2FC| > 1.5 with Benjamini–Hochberg adjusted p < 0.05, leaving ~13,660 genes for downstream modelling.

Class balancing: SMOTE oversampling is applied within each CV fold for cancer types with severe class imbalance (PRAD: 9.6:1, BLCA: 21.4:1). For other cancers, class-weight balancing is used instead.

Batch correction: ComBat batch correction was applied for PRAD to address adjacent-normal heterogeneity between sequencing batches.

2. ML Models

Three complementary model types are trained per cancer type using 5-fold stratified cross-validation:

ModelHyperparametersFeature Importance Method
Logistic Regression (L2) L2 penalty, C=1.0 |coefficients|
Random Forest 100–500 trees, max_depth=None Gini importance
MLP Neural Network Dynamic architecture (see below) gradient × input saliency

Dynamic MLP architecture:

  • 512 → 256 → 128 neurons when n > 600 samples
  • 256 → 128 neurons when n ≤ 600 samples

BatchNorm1d is applied between each hidden layer.

🎯
FocalLoss (α=0.25, γ=2.0) replaces BCEWithLogitsLoss to focus training on hard-to-classify normal samples, improving specificity for imbalanced cohorts.
3. Gene Signature Extraction

For each cancer type the gene signature is constructed as the union of top-N genes across all three models (LR, RF, MLP).

  • Pseudogene blacklist filter: Genes annotated as pseudogenes in Ensembl (biotype filtering) are removed before ranking.
  • Importance renormalisation: After filtering, importance scores are renormalised so they sum to 1.0 within each model.
4. Germline dN/dS (Conservation)

Cross-species comparison spanning ~90–400 Myr of divergence is used to quantify purifying selection on protein-coding genes.

Species panel: mouse, rat, dog, cow, opossum, zebrafish.

A weighted mean dN/dS is computed across species, weighted by divergence time. Genes with dN/dS < 0.3 are classified as under purifying selection, indicating they are functionally constrained and likely essential.

5. Somatic dN/dS (Selection)

Somatic dN/dS is calculated using a binomial exact test comparing observed nonsynonymous mutations to expected counts under neutral evolution (expected nonsynonymous proportion = 2.85/(1+2.85) ≈ 0.74). FDR correction (Benjamini–Hochberg) is applied to genes with dN/dS > 1. This is a simplified approach compared to the dNdScv method (Martincorena et al., 2017) which accounts for gene-specific covariates.

Genes under positive somatic selection must satisfy all three criteria:

  • dN/dS ≥ 1.5
  • 95% CI lower bound > 1.0
  • FDR q < 0.05 (TMB-adaptive: < 0.01 for hypermutated cancers)
ThresholdOld ValueNew ValueRationale
dN/dS minimum 1.0 1.5 Reduces false positives from near-neutral genes
CI lower bound > 1.0 Ensures statistical robustness
FDR threshold 0.05 0.05 (0.01 for high-TMB) TMB-adaptive: stricter threshold for hypermutated cancers (e.g., UCEC)
6. Integration & Candidate Identification

Candidate cancer dependencies are identified at the intersection of three evidence layers:

  • ML-predictive — gene appears in the top-N signature
  • Germline conserved — dN/dS < 0.3 across species
  • Somatic selected — dN/dS ≥ 1.5, CI > 1.0, FDR < 0.05 (FDR < 0.01 for high-TMB cancers)

Cross-cancer validation: Genes appearing in ≥ 2 cancer types receive higher confidence. Priority scoring is based on multi-criteria ranking across all three layers.

7. Statistical Framework
  • Balanced accuracy — primary classification metric (handles class imbalance by averaging per-class recall).
  • MCC (Matthews Correlation Coefficient) — single-number measure of binary classification quality that accounts for all four confusion-matrix cells.
  • Benjamini–Hochberg FDR correction applied to all multiple-testing scenarios (DESeq2, somatic dN/dS).
  • 95% confidence intervals for somatic dN/dS estimates, computed via profile likelihood.
8. Reproducibility
  • Random seed = 42 for all stochastic operations (train/test splits, model initialisation, SMOTE).
  • All thresholds centralised in config.py — no magic numbers in pipeline code.
  • Results are namespaced by cancer type (e.g. results/TCGA-BRCA/), enabling independent re-runs per cohort.
9. Limitations
  • Bulk RNA-seq only — does not capture single-cell heterogeneity within tumour or stromal compartments.
  • Limited normal samples for some cancer types (BLCA: 19 normals, UCEC: 24 normals), mitigated by SMOTE but not eliminated.
  • Somatic dN/dS depends on mutation count — low-mutation genes produce wide confidence intervals and may be missed.
  • Cross-species dN/dS may miss lineage-specific functional constraints that arose after the last common ancestor.
  • PRAD under-powered — prostate cancer has the lowest TMB in our cohort (median 2 nonsyn/gene), yielding only 1 candidate(s). v2 signature-only retrain: AUC = , Youden specificity = .
  • Inflated v1 AUC — headline AUC ≥ 0.99 on several cohorts reflects the full ~5 000-gene DESeq2 feature set memorising tumour-vs-normal patterns. The v2 Reliability layer retrains on final candidate genes only; see Stage 8.
  • UCEC hypermutation — elevated TMB (median 37 nonsyn/gene) inflates the number of genes reaching statistical significance. TMB-adaptive FDR (q < 0.01) partially addresses this but 116 candidates should be interpreted cautiously.
  • Infinite dN/dS — genes with zero synonymous mutations yield dN/dS = ∞. These are retained when FDR is significant (e.g., TP53 in PRAD: 57 nonsyn, 0 syn), as the statistical test accounts for mutation counts.

🆕 v2 Statistical Hardening — Methods

Stage 7 (pipeline.stage7_v2_hardening) is a read-only post-processor over the frozen v1 artifacts. It applies four statistical corrections:

  1. #4 Bootstrap / percentile CIs — 5-fold held-out predictions, N = 1000 stratified bootstrap (or percentile over 5 fold metrics when raw predictions are unavailable).
  2. #5 Weighted consensus ranking + stability selection — LR / RF / MLP importances min-max normalised, weighted 0.3 / 0.3 / 0.4, retained in top-k across ≥ 4 of 5 folds.
  3. #14 Percentage-point notation — all specificity improvements reported as absolute pp, not % change.
  4. #15 Wilson-CI somatic dN/dS — binomial Wilson 95 % CI; minimum-synonymous filter (≥ 3); ∞ values rejected; selection criterion ci_lower > 1.

Heavy modules for improvements #1 (dNdScv), #2 (signature-only AUC), #3 (Youden / Platt), #6 (external GEO), #7 (UCEC MSS), #8 (SHAP), #9 (GSEA), #10 (nested CV), #11 (permutation tests), #12 (MC Dropout), #13 (LOCO) ship as Python APIs under pipeline.evaluation, pipeline.stats, pipeline.explain, and are invoked on demand rather than every run.