Synthema
Publications
Synthetic Histopathological Images Generation with Artificial Intelligence to Accelerate Research and Improve Clinical Outcomes in Hematology
This study shows how artificial intelligence can generate realistic synthetic bone-marrow images to support research and personalised care in myeloid neoplasms. Using a fine-tuned Stable Diffusion model guided by a haematology-specific language model, high-quality synthetic histopathology images were created that closely matched real samples. As demonstrated in the validation results, these synthetic images improved disease-classification accuracy and strengthened survival-prediction models when used alongside real data. This approach offers a safe and effective way to expand datasets, facilitate data sharing, and accelerate precision-medicine tools in haematology.
Clinical Text Reports to Stratify Patients Affected with Myeloid Neoplasms Using Natural Language Processing
This study demonstrates how natural language processing can unlock clinically meaningful information from unstructured haematology reports. Using a domain-adapted BERT model (HematoBERT), the GenoMed4All and Synthema consortia analysed clinical text from patients with myelodysplastic syndromes, myeloproliferative neoplasms and acute myeloid leukaemia. Unsupervised clustering of text embeddings identified seven patient groups reflecting known diagnostic and genotypic–phenotypic associations, including distinctions between MDS subtypes, MPN entities and AML categories. Survival analyses showed that clusters derived solely from clinical reports achieved prognostic separation comparable to models based on structured clinical and genomic data. HematoBERT outperformed generic language models, confirming the value of domain-specific adaptation. These findings highlight clinical text as an early, information-rich data layer that can support more precise disease stratification within multimodal personalised medicine frameworks.
Generation of Multimodal Longitudinal Synthetic Data By Artificial Intelligence to Improve Personalized Medicine in Hematology
This study presents an artificial intelligence–driven framework for generating high-fidelity multimodal synthetic data to accelerate personalised medicine in myeloid neoplasms. Using generative models—including conditional GANs, Tabular-VAEs, Tabular-GPT, and Stable Diffusion—the GenoMed4All and Synthema consortia produced synthetic clinical, genomic, cytogenetic, transcriptomic, and bone-marrow image data that closely replicated real-world datasets from MDS and AML patients. Validation through a dedicated Synthetic Validation Framework demonstrated high statistical, biological, and clinical fidelity across all data layers, with strong preservation of longitudinal survival patterns and low privacy risk. Models trained on hybrid real-plus-synthetic datasets achieved performance comparable to those trained exclusively on real data, and in some cases improved classification and prognostic accuracy. The resulting JUNO platform enables clinicians to generate and explore synthetic patient cohorts, offering a privacy-compliant tool to support research, model development, and future clinical trial design.
Data-driven, harmonised classification system for myelodysplastic syndromes: a consensus paper from the International Consortium for Myelodysplastic Syndromes
This study applies a data-driven approach to harmonise the 2022 WHO and International Consensus Classification systems for myelodysplastic syndromes (MDS), addressing inconsistencies that hinder their clinical adoption. Using genomic clustering and expert consensus (via a modified Delphi process), nine biologically distinct MDS groups were identified, led by a cluster defined by biallelic TP53 inactivation. Subsequent clusters were characterised by isolated del(5q) and SF3B1 mutations, with additional rules established to refine label definitions. Morphologically defined MDS subtypes showed substantial genomic heterogeneity, indicating limited alignment between traditional criteria (eg, dysplasia patterns, blast percentages) and underlying biology. An exploration of the continuum between higher-blast MDS and acute myeloid leukaemia revealed only partial genetic overlap. The final consensus recognised MDS with low blasts (<5%) and MDS with increased blasts (≥5%) as discrete clinical entities. Overall, the harmonised framework enhances diagnostic precision and supports more consistent clinical decision-making in real-world practice.
Combining Gene Mutation with Transcriptomic Data Improves Outcome Prediction in Myelodysplastic Syndromes
This study investigates whether combining genomic and transcriptomic data can improve outcome prediction in myelodysplastic syndromes. Using diagnostic samples from 389 patients, the GenoMed4All and Synthema consortia integrated somatic mutations, cytogenetics, bulk RNA-sequencing of CD34⁺ cells, and clinical variables into a penalised Cox model. The combined approach achieved a concordance index of 0.83 for overall survival—substantially outperforming established prognostic systems such as IPSS-R and IPSS-M. Analysis of explained variance shows transcriptomic features contribute the largest share (40%) to survival prediction. These findings demonstrate that gene expression data provide significant additional prognostic value and support the development of integrated molecular tools for personalised risk assessment in MDS.
Artificial Intelligence-Powered Digital Pathology to Improve Diagnosis and Personalized Prognostic Assessment in Patient with Myeloid Neoplasms
This study presents an artificial intelligence–driven digital pathology approach to enhance diagnosis and personalised prognostic assessment in myeloid neoplasms. Using whole-slide bone marrow images from 1,167 patients, models developed within the GenoMed4All and Synthema consortia extracted high-dimensional morphological features across multiple staining types. These features enabled highly accurate diagnostic classification and prediction of key genomic mutations, and—when integrated with clinical and molecular data—substantially improved risk stratification for overall and leukaemia-free survival. The work demonstrates that AI-powered digital pathology can capture biologically meaningful information and significantly strengthen precision medicine efforts in myeloid neoplasms.
An Artificial Intelligence-Based Federated Learning Platform to Boost Precision Medicine in Rare Hematological Diseases: An Initiative By GenoMed4all and Synthema Consortia
This study introduces a Federated Learning platform developed by the GenoMed4All and Synthema consortia to support precision medicine in rare haematological diseases without sharing sensitive patient data. Using myelodysplastic syndromes as a case study, the platform enables multiple centres to train shared AI models on local clinical and genomic datasets, achieving strong predictive performance even with incomplete data. Fully GDPR-compliant, it will be deployed across the EuroBloodNET network and extended to include medical imaging, offering a secure and scalable solution for advancing personalised care in haematology.
A Comprehensive, Artificial Intelligence, Digital Twin Platform Based on Multimodal Real-World Data Integration for Personalized Medicine in Hematology
This study introduces GEMINI, an advanced artificial intelligence–driven Digital Twin platform designed to support personalised medicine in haematology by integrating large-scale multimodal real-world data from more than 22,000 patients with myelodysplastic syndromes (MDS). Developed using privacy-preserving federated learning and synthetic data technologies, GEMINI consolidates clinical, genomic, imaging, and patient-reported information into a comprehensive decision-support tool. The platform provides individualised predictions on survival, risk of leukaemic evolution, and treatment response, and simulates disease trajectories and quality-of-life outcomes through an interactive interface. By enabling clinicians and researchers to explore high-fidelity patient simulations without the need for data sharing, GEMINI demonstrates the potential of Digital Twins to advance precision medicine in haematology.
Synthetic Data in Healthcare
Synthetic data is changing how healthcare data can be used, shared and protected. This white paper highlights how it helps address long-standing issues in digital health, from data scarcity and bias to privacy and regulatory complexity. It shows that synthetic data can make AI models more reliable by balancing underrepresented data, enable new approaches to clinical trials through virtual populations, and support GDPR-compliant collaboration across borders. At the same time, the report points out the need for common benchmarks, legal clarity and trustworthy infrastructure to scale synthetic data safely across health systems. These insights outline a clear direction for Europe’s digital health future: data that is secure, inclusive and ready for innovation.
Synthetic Tabular Data Generation Under Horizontal Federated Learning Environments in Acute Myeloid Leukemia: Case-Based Simulation Study
This study evaluates the combination of synthetic data generation and federated learning in the context of acute myeloid leukemia, a rare hematological disease. Using two state-of-the-art generative models across various data distribution scenarios, the research shows that horizontal federation leads to a loss in data fidelity while maintaining privacy. Despite this trade-off, increasing the number of nodes does not significantly worsen performance, making the approach promising for privacy-preserving data generation in biomedical research.
An improved tabular data generator with VAE-GMM integration
Synthetic tabular data validation: A divergence-based approach
Propensity Weighted federated learning for treatment effect estimation in distributed imbalanced environments
Membership Inference Attacks and Differential Privacy: a study within the context of Generative Models
Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios
Advancing Cancer Research with Synthetic Data Generation in Low-Data Scenarios
Improving synthetic Data Generation through Federated Learning in scarce and heterogeneous data scenarios. Big Data and Cognitive Computing
Privacy Mechanisms and Evaluation Metrics for Synthetic Data Generation: A Systematic Review
Automated Knowledge-Based Cybersecurity Risk Assessment of Cyber-Physical Systems
MOSAIC: An Artificial Intelligence–Based Framework for Multimodal Analysis, Classification, and Personalized Prognostic Assessment in Rare Cancers
Clinical and Genomic-Based Decision Support System to Define the Optimal Timing of Allogeneic Hematopoietic Stem-Cell Transplantation in Patients With Myelodysplastic Syndromes
This study aims to optimize the timing of allogeneic hematopoietic stem-cell transplantation (HSCT) for patients with myelodysplastic syndromes (MDS) using the Molecular International Prognostic Scoring System (IPSS-M), which includes clinical and genomic information. Analyzing a retrospective cohort of 7,118 patients, the study finds that low to moderate-low risk patients benefit from delayed HSCT, while high-risk patients benefit from immediate HSCT. The IPSS-M based strategy significantly changes transplantation timing decisions compared to conventional methods, improving life expectancy. This supports the clinical relevance of incorporating genomic data into HSCT timing decisions for personalized treatment.
Personalized Timing for Allogeneic Stem-Cell Transplantation in Hematologic Neoplasms: A Target Trial Emulation Approach Using Multistate Modeling and Microsimulation
This study develops a framework to optimize the timing of allogeneic hematopoietic stem-cell transplantation (HSCT) for patients with hematologic neoplasms using real-world data. By leveraging multistate modeling and microsimulation on a cohort of 7,118 patients with myelodysplastic syndromes, the analysis identifies optimal timing for HSCT based on individual patient profiles. The methodology provides insights and evidence for clinical decision-making, addressing complex scenarios where randomized trials are not feasible.
Protecting Multiple Sensitive Attributes in Synthetic Micro-data
This paper explores the use of synthetic data as a privacy-preserving measure in data analysis, emphasizing the need to protect sensitive attributes while maintaining data utility. It investigates enhancements to the DataSynthesizer model, using Bayesian Networks to generate synthetic data that safeguards multiple sensitive attributes against inference attacks. The study contributes to the field by analyzing the impact of these techniques on data utility, presented at the 2023 IEEE International Conference on Big Data.
Federated learning for causal inference using deep generative disentangled models
Sickle cell disease landscape and challenges in the EU: the ERN-EuroBloodNet perspective