Master’s Thesis Presentation • Machine Learning • Predicting Cardiovascular Events or Death in People with Dysglycemia Using Machine Learning Methods

Monday, August 11, 2025 10:00 am - 11:00 am EDT (GMT -04:00)

Please note: This master’s thesis presentation will take place online.

Yuanhong Zhang, Master’s candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Anita Layton

Cox regression is commonly used to analyze time-to-event for patients in tabular medical data. Hazard ratio can then be calculated to show how much riskier an event may occur for a patient in one group versus the other. Nonetheless, the hazard ratio and Cox regression rely on the proportional hazards assumption, which is not guaranteed.

In this paper, we investigate the use of machine learning models to predict patients’ outcomes and identify key factors that may influence the outcomes. We focused on the ORIGIN Trial dataset because it has undergone extensive analysis using Cox regression, allowing us to compare the machine learning model results with previous findings. Three outcomes, major adverse cardiovascular events (MACE), expanded composite outcome (COPRIM2), and all-cause death (ALLDTH), were analyzed in this thesis. The machine learning models we used are Neural Network (NN), Random Forest (RF) and Gradient Boosted Trees (GBT), which were trained with nested cross-validation to tune their hyperparameters.

When testing the trained models for all three outcomes, we found that machine learning models had higher Area-Under-the-Curve scores (AUCs) than Cox regression (0.91-0.95 vs 0.63-0.65), and Random Forest and Gradient Boosted Trees had excellent recall scores (0.80-0.88). Subsequently, we used SHAP values, mean decrease in AUC, and partial dependency plots (PDPs) to further examine variable importance for RF and GBT. For MACE and COPRIM2, prior cardiovascular events (priorcv), cancer, and blood lipid measures are the most important variables, while for ALLDTH, cancer and kidney functions related measures are the most important variables. The PDPs are harder to analyze than hazard ratio due to having no assumptions and fewer restrictions, but it is useful to estimate the non-linear relation between an explanatory variable and the average probability of the outcome occurring to patients in the dataset.


Attend this master’s thesis presentation virtually on MS Teams.