Paper Title
PREDICTIVE MODELLING OF TUBERCULOSIS RISK AND DRUG RESISTANCE USING HOST–PATHOGEN GENETIC SIGNATURES AND MACHINE LEARNING REGRESSION
Abstract
Tuberculosis (TB), caused by Mycobacterium tuberculosis (MTB), remains the world's second deadliest infectious disease, with Multi-Drug Resistant (MDR) and Extensively Drug-Resistant (XDR) strains posing an accelerating public-health threat. Current molecular diagnostics yield binary (Resistant/Susceptible) calls that fail to capture the continuum of resistance severity. We present a regression-based machine learning framework that predicts a continuous Resistance Burden Score (RBS) — a weighted percentage of drugs to which an isolate is resistant — directly from binary gene-mutation profiles. We consolidate a harmonised dataset of 4,558 unique MTB isolates from five independent cohorts (CRyPTIC, PATRIC, iValiD-TB, TB-Portal, Nikshay/Karnataka), spanning 22 resistance genes and 15 anti-TB drugs, after rigorous deduplication. Six regression architectures are benchmarked: Multiple Linear Regression (MLR), Ridge, LASSO, ElasticNet, Random Forest, and Gradient Boosting. The Gradient Boosting Regressor achieves the best performance (R² = 0.9328, MAE = 2.744, RMSE = 3.892) under 5-fold cross-validation, while MLR delivers near-equivalent accuracy (R² = 0.9204) with full coefficient transparency. Gene coefficient analysis consistently ranks gyrA, rrl, rpoB, and katG as the highest-impact resistance determinants. Our framework provides a clinically actionable risk score that stratifies patients beyond the binary dichotomy, supporting personalised treatment selection.
Keywords - Tuberculosis; Drug Resistance; Whole Genome Sequencing; Resistance Burden Score; Multiple Linear Regression; Random Forest; Gradient Boosting; CRyPTIC; PATRIC; iValiD-TB; rpoB; katG; gyrA