Cancer Transcriptomics ML
Machine learning classification of tumour vs. normal tissue across 5 TCGA cancer types, filtered through two-scale evolutionary analysis to identify candidate cancer-maintaining gene dependencies.
163 candidate genes identified
โ 15 cross-validated across cancers
โ Established drivers confirmed (TP53, PIK3CA, PTEN)
Central hypothesis: ML-predictive genes under both strong germline purifying selection (dN/dS < 0.3) AND somatic positive selection (dN/dS โฅ 1.5, FDR < 0.05) are candidate cancer-maintaining dependencies.
๐งฌ Cancer Type Overview
BRCA โ Breast โ
Bal. Accuracyโ
Specificityโ
AUCโ
Samplesโ
Genes Testedโ
BLCA โ Bladder โ
Bal. Accuracyโ
Specificityโ
AUCโ
Samplesโ
Genes Testedโ
PRAD โ Prostate โ
Bal. Accuracyโ
Specificityโ
AUCโ
Samplesโ
Genes Testedโ
LUAD โ Lung Adeno. โ
Bal. Accuracyโ
Specificityโ
AUCโ
Samplesโ
Genes Testedโ
UCEC โ Uterine โ
Bal. Accuracyโ
Specificityโ
AUCโ
Samplesโ
Genes Testedโ
โ ๏ธ Limitations & Caveats
- Somatic dN/dS method: Uses a per-gene binomial exact test, not the site-level dNdScv model. Results should be interpreted as exploratory.
- UCEC candidate count: Elevated candidate numbers in uterine cancer likely reflect microsatellite-instability-driven hypermutation rather than a proportionally larger set of true dependencies.
- PRAD statistical power: Prostate cancer has the lowest specificity (73.5%), driven by smaller normal-tissue sample size and adjacent-normal heterogeneity.
- Near-perfect AUC values: High AUC scores reflect the intrinsic separability of tumour vs. normal transcriptomes (thousands of DE genes) rather than the specificity of the final gene signatures.