The recent publication provides a comprehensive insight into the science and methodologies employed by SYNTHEMA to produce Synthetic hematological data over federated computing frameworks. With an ambitious goal of setting up a cross-border hub, SYNTHEMA is focusing on the development and validation of cutting-edge Artificial Intelligence methods for anonymization and the creation of synthetic data specifically for rare hematological diseases. Central to SYNTHEMA’s mission is the generation of dependable and top-notch synthetic data, designed to craft virtual patient profiles. This in turn will significantly bolster diagnostic capabilities, enable the evaluation of treatment alternatives, and offer predictive insights into outcomes for these uncommon blood disorders.
In real-world scenarios, data sets often have classes with equal sample sizes. However, there are instances, such as in fraud detection or rare disease diagnosis, where data sets are imbalanced, leading to classification errors and high variability. To address this, the paper introduces a new oversampling technique named WSSMOTE, which is based on the watershed transformation. This method is shown to enhance prediction scores for certain real-world datasets.
The primary focus of the study is to boost the prediction accuracy of an imbalanced dataset related to sickle cell disease (SCD) biomarkers. SCD is a severe inherited condition, with the main hospitalization reason being vaso-occlusive crises. During such hospitalizations, acute chest syndrome (ACS) is the top cause of death, affecting roughly 20% of hospitalized SCD patients. Two studies (PRESEV1 with 247 patients and PRESEV2 with 393 patients) created a predictive score for ACS using clinical and biological data. The results showed a high negative predictive value (NPV) of 98.9% and 94% for the two studies, respectively. However, the positive predictive value (PPV) was lower and inconsistent, being 44.7% and 27.9% for the two studies, respectively.
The paper’s objective is to enhance prediction accuracy and consistency. Traditional oversampling techniques couldn’t boost the PPV value, but the WSSMOTE method did. With WSSMOTE, PPV rose from 24.6% to 28.9%, while the NPV remained high at 96.6%. Additionally, overfitting of the PPV value decreased from 13.3% to 1.2%.
Read the full paper here.