Data Sources
ML Pipeline
Germline Evolution
Somatic Selection
Integration

📋 Pipeline Summary

Stage 1: Data Acquisition

Download TCGA RNA-seq HTSeq counts and MAF mutation files for 5 cancer types via GDC API. DESeq2 pre-filter to ~13,660 genes.

Stage 2: ML Classification

Train LR, RF, and MLP classifiers with 5-fold CV. Extract feature importance signatures. Union top genes across models.

Stage 3: Evolutionary Filtering

Germline dN/dS < 0.3 (purifying selection) + Somatic dN/dS ≥ 1.5 with FDR < 0.05 (positive selection). Intersection = candidates.