Pipeline Architecture
Interactive visualisation of the complete analysis pipeline β from data acquisition through ML classification to evolutionary candidate identification.
π Pipeline Summary
Download TCGA RNA-seq HTSeq counts and MAF mutation files for 5 cancer types via GDC API. DESeq2 pre-filter to ~13,660 genes.
Train LR, RF, and MLP classifiers with 5-fold CV. Extract feature importance signatures. Union top genes across models.
Germline dN/dS < 0.3 (purifying selection) + Somatic dN/dS β₯ 1.5 with FDR < 0.05 (positive selection). Intersection = candidates.
βοΈ Pipeline Improvements (April 2026)
FocalLoss
Replaces BCEWithLogitsLoss. Forces model to focus on hard-to-classify normal samples (Ξ±=0.25, Ξ³=2.0).
SMOTE Oversampling
Synthetic minority oversampling for PRAD and BLCA normal class to address class imbalance.
DESeq2 Pre-filter
Differential expression filter: |log2FC| > 1.5, BH-adjusted p < 0.05. Retains ~13,660 informative genes.
Pseudogene Blacklist
Removes processed/unprocessed pseudogenes from all signatures using Ensembl biotype annotations.
ComBat Correction
Batch correction for PRAD adjacent-normal tissue heterogeneity using ComBat-seq.
Dynamic Architecture
MLP auto-selects 512β256β128 for n>600, 256β128 for smaller datasets + BatchNorm.