πŸ†• Stage 7 β€” v2 Statistical Hardening

After Stage 5 (Candidate Gene Selection), pipeline.stage7_v2_hardening runs as a read-only post-processor over the v1 artifacts. It emits results/v2/<cohort>/ directories with bootstrap CIs (#4), weighted consensus ranking (#5), Wilson-CI dN/dS (#15), and a cross-cohort pp-notation specificity table (#14).

View v2 dashboard β†’

Data Sources
ML Pipeline
Germline Evolution
Somatic Selection
Integration

πŸ“‹ Pipeline Summary

Stage 1: Data Acquisition

Download TCGA RNA-seq HTSeq counts and MAF mutation files for 5 cancer types via GDC API. DESeq2 pre-filter to ~13,660 genes.

Stage 2: ML Classification

Train LR, RF, and MLP classifiers with 5-fold CV. Extract feature importance signatures. Union top genes across models.

Stage 3: Evolutionary Filtering

Germline dN/dS < 0.3 (purifying selection) + Somatic dN/dS β‰₯ 1.5 with FDR < 0.05 (positive selection). Intersection = candidates.

βš™οΈ Pipeline Improvements (April 2026)

🎯

FocalLoss

Replaces BCEWithLogitsLoss. Forces model to focus on hard-to-classify normal samples (Ξ±=0.25, Ξ³=2.0).

βš–οΈ

SMOTE Oversampling

Synthetic minority oversampling for PRAD and BLCA normal class to address class imbalance.

🧬

DESeq2 Pre-filter

Differential expression filter: |log2FC| > 1.5, BH-adjusted p < 0.05. Retains ~13,660 informative genes.

🚫

Pseudogene Blacklist

Removes processed/unprocessed pseudogenes from all signatures using Ensembl biotype annotations.

πŸ”§

ComBat Correction

Batch correction for PRAD adjacent-normal tissue heterogeneity using ComBat-seq.

πŸ“

Dynamic Architecture

MLP auto-selects 512β†’256β†’128 for n>600, 256β†’128 for smaller datasets + BatchNorm.