Publications - SYNTHEMA | AI-Driven Synthetic Health Data for Haematology

Synthema

Publications

Synthetic Histopathological Images Generation with Artificial Intelligence to Accelerate Research and Improve Clinical Outcomes in Hematology

This study shows how artificial intelligence can generate realistic synthetic bone-marrow images to support research and personalised care in myeloid neoplasms. Using a fine-tuned Stable Diffusion model guided by a haematology-specific language model, high-quality synthetic histopathology images were created that closely matched real samples. As demonstrated in the validation results, these synthetic images improved disease-classification accuracy and strengthened survival-prediction models when used alongside real data. This approach offers a safe and effective way to expand datasets, facilitate data sharing, and accelerate precision-medicine tools in haematology.

Clinical Text Reports to Stratify Patients Affected with Myeloid Neoplasms Using Natural Language Processing

This study demonstrates how natural language processing can unlock clinically meaningful information from unstructured haematology reports. Using a domain-adapted BERT model (HematoBERT), the GenoMed4All and Synthema consortia analysed clinical text from patients with myelodysplastic syndromes, myeloproliferative neoplasms and acute myeloid leukaemia. Unsupervised clustering of text embeddings identified seven patient groups reflecting known diagnostic and genotypic–phenotypic associations, including distinctions between MDS subtypes, MPN entities and AML categories. Survival analyses showed that clusters derived solely from clinical reports achieved prognostic separation comparable to models based on structured clinical and genomic data. HematoBERT outperformed generic language models, confirming the value of domain-specific adaptation. These findings highlight clinical text as an early, information-rich data layer that can support more precise disease stratification within multimodal personalised medicine frameworks.

Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology

This study presents an artificial intelligence–driven framework for generating high-fidelity multimodal synthetic data to accelerate personalised medicine in myeloid neoplasms. Using generative models—including conditional GANs, Tabular-VAEs, Tabular-GPT, and Stable Diffusion—the GenoMed4All and Synthema consortia produced synthetic clinical, genomic, cytogenetic, transcriptomic, and bone-marrow image data that closely replicated real-world datasets from MDS and AML patients. Validation through a dedicated Synthetic Validation Framework demonstrated high statistical, biological, and clinical fidelity across all data layers, with strong preservation of longitudinal survival patterns and low privacy risk. Models trained on hybrid real-plus-synthetic datasets achieved performance comparable to those trained exclusively on real data, and in some cases improved classification and prognostic accuracy. The resulting JUNO platform enables clinicians to generate and explore synthetic patient cohorts, offering a privacy-compliant tool to support research, model development, and future clinical trial design.

Data-driven, harmonised classification system for myelodysplastic syndromes: a consensus paper from the International Consortium for Myelodysplastic Syndromes

This study applies a data-driven approach to harmonise the 2022 WHO and International Consensus Classification systems for myelodysplastic syndromes (MDS), addressing inconsistencies that hinder their clinical adoption. Using genomic clustering and expert consensus (via a modified Delphi process), nine biologically distinct MDS groups were identified, led by a cluster defined by biallelic TP53 inactivation. Subsequent clusters were characterised by isolated del(5q) and SF3B1 mutations, with additional rules established to refine label definitions. Morphologically defined MDS subtypes showed substantial genomic heterogeneity, indicating limited alignment between traditional criteria (eg, dysplasia patterns, blast percentages) and underlying biology. An exploration of the continuum between higher-blast MDS and acute myeloid leukaemia revealed only partial genetic overlap. The final consensus recognised MDS with low blasts (<5%) and MDS with increased blasts (≥5%) as discrete clinical entities. Overall, the harmonised framework enhances diagnostic precision and supports more consistent clinical decision-making in real-world practice.

Combining Gene Mutation with Transcriptomic Data Improves Outcome Prediction in Myelodysplastic Syndromes

This study investigates whether combining genomic and transcriptomic data can improve outcome prediction in myelodysplastic syndromes. Using diagnostic samples from 389 patients, the GenoMed4All and Synthema consortia integrated somatic mutations, cytogenetics, bulk RNA-sequencing of CD34⁺ cells, and clinical variables into a penalised Cox model. The combined approach achieved a concordance index of 0.83 for overall survival—substantially outperforming established prognostic systems such as IPSS-R and IPSS-M. Analysis of explained variance shows transcriptomic features contribute the largest share (40%) to survival prediction. These findings demonstrate that gene expression data provide significant additional prognostic value and support the development of integrated molecular tools for personalised risk assessment in MDS.

Artificial Intelligence-Powered Digital Pathology to Improve Diagnosis and Personalized Prognostic Assessment in Patient with Myeloid Neoplasms

This study presents an artificial intelligence–driven digital pathology approach to enhance diagnosis and personalised prognostic assessment in myeloid neoplasms. Using whole-slide bone marrow images from 1,167 patients, models developed within the GenoMed4All and Synthema consortia extracted high-dimensional morphological features across multiple staining types. These features enabled highly accurate diagnostic classification and prediction of key genomic mutations, and—when integrated with clinical and molecular data—substantially improved risk stratification for overall and leukaemia-free survival. The work demonstrates that AI-powered digital pathology can capture biologically meaningful information and significantly strengthen precision medicine efforts in myeloid neoplasms.

An Artificial Intelligence-Based Federated Learning Platform to Boost Precision Medicine in Rare Hematological Diseases: An Initiative By GenoMed4all and Synthema Consortia

This study introduces a Federated Learning platform developed by the GenoMed4All and Synthema consortia to support precision medicine in rare haematological diseases without sharing sensitive patient data. Using myelodysplastic syndromes as a case study, the platform enables multiple centres to train shared AI models on local clinical and genomic datasets, achieving strong predictive performance even with incomplete data. Fully GDPR-compliant, it will be deployed across the EuroBloodNET network and extended to include medical imaging, offering a secure and scalable solution for advancing personalised care in haematology.

A Comprehensive, Artificial Intelligence, Digital Twin Platform Based on Multimodal Real-World Data Integration for Personalized Medicine in Hematology

This study introduces GEMINI, an advanced artificial intelligence–driven Digital Twin platform designed to support personalised medicine in haematology by integrating large-scale multimodal real-world data from more than 22,000 patients with myelodysplastic syndromes (MDS). Developed using privacy-preserving federated learning and synthetic data technologies, GEMINI consolidates clinical, genomic, imaging, and patient-reported information into a comprehensive decision-support tool. The platform provides individualised predictions on survival, risk of leukaemic evolution, and treatment response, and simulates disease trajectories and quality-of-life outcomes through an interactive interface. By enabling clinicians and researchers to explore high-fidelity patient simulations without the need for data sharing, GEMINI demonstrates the potential of Digital Twins to advance precision medicine in haematology.

Synthetic Data in Healthcare

Synthetic data is changing how healthcare data can be used, shared and protected. This white paper highlights how it helps address long-standing issues in digital health, from data scarcity and bias to privacy and regulatory complexity. It shows that synthetic data can make AI models more reliable by balancing underrepresented data, enable new approaches to clinical trials through virtual populations, and support GDPR-compliant collaboration across borders. At the same time, the report points out the need for common benchmarks, legal clarity and trustworthy infrastructure to scale synthetic data safely across health systems. These insights outline a clear direction for Europe’s digital health future: data that is secure, inclusive and ready for innovation.

Synthetic Tabular Data Generation Under Horizontal Federated Learning Environments in Acute Myeloid Leukemia: Case-Based Simulation Study

This study evaluates the combination of synthetic data generation and federated learning in the context of acute myeloid leukemia, a rare hematological disease. Using two state-of-the-art generative models across various data distribution scenarios, the research shows that horizontal federation leads to a loss in data fidelity while maintaining privacy. Despite this trade-off, increasing the number of nodes does not significantly worsen performance, making the approach promising for privacy-preserving data generation in biomedical research.

An improved tabular data generator with VAE-GMM integration

This paper introduces a new approach for generating synthetic tabular data by combining Variational Autoencoders with a Bayesian Gaussian Mixture model. The method improves representation of complex data distributions, handling both continuous and discrete features more effectively than existing models like CTGAN and TVAE. Validation on real-world datasets, including medical data, shows significant performance gains, highlighting its potential for applications in healthcare and beyond.

Synthetic tabular data validation: A divergence-based approach

This paper proposes a systematic framework to validate synthetic tabular data using divergence-based metrics. By distinguishing between data fidelity and data utility, the approach allows for more reliable evaluation of synthetic datasets. The study demonstrates how these measures can guide the selection of generative models and ensure synthetic data are both realistic and useful for downstream tasks.

Propensity Weighted federated learning for treatment effect estimation in distributed imbalanced environments

This paper investigates whether commonly used metrics for evaluating synthetic data actually reflect real-world model performance. By conducting a large-scale empirical study across various datasets, tasks, and generative models, the authors show that many metrics fail to predict downstream utility. The results highlight the need for better evaluation approaches and provide practical guidance for researchers and practitioners working with synthetic data in sensitive domains.

Membership Inference Attacks and Differential Privacy: a study within the context of Generative Models

This paper explores how membership inference attacks apply to generative models and their connection to differential privacy. It introduces a unified Bayesian framework that defines and evaluates membership inference risk, showing how existing approaches fit within this formulation. The study also proposes a new definition of differential privacy tailored to generative models and demonstrates through simulations how factors like overfitting, prior knowledge, and noise affect privacy risks. The findings highlight the importance of balancing fidelity, utility, and privacy when generating synthetic data.

Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios

This paper presents a new framework to improve synthetic tabular data generation when real data are scarce. By introducing artificial inductive biases into Deep Generative Models through transfer and meta-learning techniques, the method enhances both data quality and utility. Tests on benchmark datasets show substantial improvements, up to 60% in divergence metrics, making the approach especially valuable in fields like healthcare and finance, where reliable data are limited.

Advancing Cancer Research with Synthetic Data Generation in Low-Data Scenarios

Medical research often faces data scarcity, especially in cancer survival studies where patient records are limited. This paper proposes a novel Synthetic Tabular Data Generation (STDG) methodology that uses transfer learning and meta-learning to create high-quality synthetic datasets under constrained conditions. Tested on both large classification and scarce cancer survival datasets, the approach improves data similarity and clinical applicability, showing promise for advancing research while preserving patient privacy.

Improving synthetic Data Generation through Federated Learning in scarce and heterogeneous data scenarios. Big Data and Cognitive Computing

This study introduces Synthetic Data Sharing (SDS), a federated learning approach where institutions exchange synthetic patient data instead of raw records. Tested on medical datasets, SDS outperforms traditional methods by generating more accurate and representative data, especially in scarce and heterogeneous scenarios. The results show its potential to reduce disparities between data-rich and data-poor institutions while preserving privacy

Privacy Mechanisms and Evaluation Metrics for Synthetic Data Generation: A Systematic Review

Our review, “Privacy Mechanisms and Evaluation Metrics for Synthetic Data Generation,” explores synthetic data’s role in enhancing privacy. Covering 105 studies, it highlights differential privacy and GAN models, especially in healthcare. Discover key trends and future research directions in our comprehensive overview.

Automated Knowledge-Based Cybersecurity Risk Assessment of Cyber-Physical Systems

Stephen Phillips from the University of Southampton presents a novel approach for automated cybersecurity risk assessment of cyber-physical systems. This method uses a comprehensive knowledge-base to model and simulate threats, streamlining ISO 27005 implementation. Validated through real-world case studies, it offers enhanced transparency, reproducibility, and performance in risk management.

MOSAIC: An Artificial Intelligence–Based Framework for Multimodal Analysis, Classification, and Personalized Prognostic Assessment in Rare Cancers

The study introduces MOSAIC, an AI-based framework for analyzing and predicting outcomes in rare cancers, tested on 4,427 myelodysplastic syndrome (MDS) patients. Advanced clustering and AI methods improved patient stratification and survival prediction over traditional techniques. UMAP + HDBSCAN achieved better accuracy, and AI models outperformed conventional ones. SHAP analysis provided insights into key features, and federated implementation enhanced model accuracy and data protection, demonstrating MOSAIC’s potential for clinical use.

Clinical and Genomic-Based Decision Support System to Define the Optimal Timing of Allogeneic Hematopoietic Stem-Cell Transplantation in Patients With Myelodysplastic Syndromes

This study aims to optimize the timing of allogeneic hematopoietic stem-cell transplantation (HSCT) for patients with myelodysplastic syndromes (MDS) using the Molecular International Prognostic Scoring System (IPSS-M), which includes clinical and genomic information. Analyzing a retrospective cohort of 7,118 patients, the study finds that low to moderate-low risk patients benefit from delayed HSCT, while high-risk patients benefit from immediate HSCT. The IPSS-M based strategy significantly changes transplantation timing decisions compared to conventional methods, improving life expectancy. This supports the clinical relevance of incorporating genomic data into HSCT timing decisions for personalized treatment.

Personalized Timing for Allogeneic Stem-Cell Transplantation in Hematologic Neoplasms: A Target Trial Emulation Approach Using Multistate Modeling and Microsimulation

This study develops a framework to optimize the timing of allogeneic hematopoietic stem-cell transplantation (HSCT) for patients with hematologic neoplasms using real-world data. By leveraging multistate modeling and microsimulation on a cohort of 7,118 patients with myelodysplastic syndromes, the analysis identifies optimal timing for HSCT based on individual patient profiles. The methodology provides insights and evidence for clinical decision-making, addressing complex scenarios where randomized trials are not feasible.

Protecting Multiple Sensitive Attributes in Synthetic Micro-data

This paper explores the use of synthetic data as a privacy-preserving measure in data analysis, emphasizing the need to protect sensitive attributes while maintaining data utility. It investigates enhancements to the DataSynthesizer model, using Bayesian Networks to generate synthetic data that safeguards multiple sensitive attributes against inference attacks. The study contributes to the field by analyzing the impact of these techniques on data utility, presented at the 2023 IEEE International Conference on Big Data.

Federated learning for causal inference using deep generative disentangled models

In the context of decentralized and privacy-constrained healthcare data settings, we introduce an innovative approach to estimate individual treatment effects (ITE) via federated learning. Emphasizing the critical importance of data privacy in healthcare, especially when drawing on data from various global hospitals, we address challenges arising from data scarcity and specific treatment assignment criteria influenced by the availability of the medication of interest. Our methodology uses federated learning applied to neural network-based generative causal inference models to bridge the gap between decentralized and centralized ITE estimation on a benchmark dataset.

Sickle cell disease landscape and challenges in the EU: the ERN-EuroBloodNet perspective

Sickle cell disease is a hereditary multiorgan disease that is considered rare in the EU. In 2017, the Rare Diseases Plan was implemented within the EU and 24 European Reference Networks (ERNs) were created, including the ERN on Rare Haematological Diseases (ERN-EuroBloodNet), dedicated to rare haematological diseases. The role of the ERN-EuroBloodNet is to improve the overall approach to and the management of individuals with sickle cell disease in the EU through specific on the pooling of expertise, knowledge, and best practices; the development of training and education programmes; the strategy for systematic gathering and standardisation of clinical data; and its reuse in clinical research.

Synthetic Data Generation by Artificial Intelligence to Accelerate Research and Precision Medicine in Hematology

Synthetic data are artificial data generated without including any real patient information by an algorithm trained to learn the characteristics of a real source data set and became widely used to accelerate research in life sciences. In this work researchers apply generative artificial intelligence to build synthetic data in different hematologic neoplasms; develop a synthetic validation framework to assess data fidelity and privacy preservability; and test the capability of synthetic data to accelerate clinical/translational research in hematology.