SYNTHEMA | Synthetic Haematological Data

Synthema

Publications

Synthetic Tabular Data Generation Under Horizontal Federated Learning Environments in Acute Myeloid Leukemia: Case-Based Simulation Study

This study evaluates the combination of synthetic data generation and federated learning in the context of acute myeloid leukemia, a rare hematological disease. Using two state-of-the-art generative models across various data distribution scenarios, the research shows that horizontal federation leads to a loss in data fidelity while maintaining privacy. Despite this trade-off, increasing the number of nodes does not significantly worsen performance, making the approach promising for privacy-preserving data generation in biomedical research.

An improved tabular data generator with VAE-GMM integration.

This paper introduces a new approach for generating synthetic tabular data by combining Variational Autoencoders with a Bayesian Gaussian Mixture model. The method improves representation of complex data distributions, handling both continuous and discrete features more effectively than existing models like CTGAN and TVAE. Validation on real-world datasets, including medical data, shows significant performance gains, highlighting its potential for applications in healthcare and beyond.

Synthetic tabular data validation: A divergence-based approach.

This paper proposes a systematic framework to validate synthetic tabular data using divergence-based metrics. By distinguishing between data fidelity and data utility, the approach allows for more reliable evaluation of synthetic datasets. The study demonstrates how these measures can guide the selection of generative models and ensure synthetic data are both realistic and useful for downstream tasks.

Propensity Weighted federated learning for treatment effect estimation in distributed imbalanced environments.

This paper investigates whether commonly used metrics for evaluating synthetic data actually reflect real-world model performance. By conducting a large-scale empirical study across various datasets, tasks, and generative models, the authors show that many metrics fail to predict downstream utility. The results highlight the need for better evaluation approaches and provide practical guidance for researchers and practitioners working with synthetic data in sensitive domains.

Membership Inference Attacks and Differential Privacy: a study within the context of Generative Models.

This paper explores how membership inference attacks apply to generative models and their connection to differential privacy. It introduces a unified Bayesian framework that defines and evaluates membership inference risk, showing how existing approaches fit within this formulation. The study also proposes a new definition of differential privacy tailored to generative models and demonstrates through simulations how factors like overfitting, prior knowledge, and noise affect privacy risks. The findings highlight the importance of balancing fidelity, utility, and privacy when generating synthetic data.

Artificial Inductive Bias for Synthetic Tabular Data Generation in Data-Scarce Scenarios.

This paper presents a new framework to improve synthetic tabular data generation when real data are scarce. By introducing artificial inductive biases into Deep Generative Models through transfer and meta-learning techniques, the method enhances both data quality and utility. Tests on benchmark datasets show substantial improvements, up to 60% in divergence metrics, making the approach especially valuable in fields like healthcare and finance, where reliable data are limited.

Advancing Cancer Research with Synthetic Data Generation in Low-Data Scenarios

Medical research often faces data scarcity, especially in cancer survival studies where patient records are limited. This paper proposes a novel Synthetic Tabular Data Generation (STDG) methodology that uses transfer learning and meta-learning to create high-quality synthetic datasets under constrained conditions. Tested on both large classification and scarce cancer survival datasets, the approach improves data similarity and clinical applicability, showing promise for advancing research while preserving patient privacy.

Improving synthetic Data Generation through Federated Learning in scarce and heterogeneous data scenarios. Big Data and Cognitive Computing

This study introduces Synthetic Data Sharing (SDS), a federated learning approach where institutions exchange synthetic patient data instead of raw records. Tested on medical datasets, SDS outperforms traditional methods by generating more accurate and representative data, especially in scarce and heterogeneous scenarios. The results show its potential to reduce disparities between data-rich and data-poor institutions while preserving privacy

Privacy Mechanisms and Evaluation Metrics for Synthetic Data Generation: A Systematic Review

Our review, “Privacy Mechanisms and Evaluation Metrics for Synthetic Data Generation,” explores synthetic data’s role in enhancing privacy. Covering 105 studies, it highlights differential privacy and GAN models, especially in healthcare. Discover key trends and future research directions in our comprehensive overview.

Automated Knowledge-Based Cybersecurity Risk Assessment of Cyber-Physical Systems

Stephen Phillips from the University of Southampton presents a novel approach for automated cybersecurity risk assessment of cyber-physical systems. This method uses a comprehensive knowledge-base to model and simulate threats, streamlining ISO 27005 implementation. Validated through real-world case studies, it offers enhanced transparency, reproducibility, and performance in risk management.

MOSAIC: An Artificial Intelligence–Based Framework for Multimodal Analysis, Classification, and Personalized Prognostic Assessment in Rare Cancers

The study introduces MOSAIC, an AI-based framework for analyzing and predicting outcomes in rare cancers, tested on 4,427 myelodysplastic syndrome (MDS) patients. Advanced clustering and AI methods improved patient stratification and survival prediction over traditional techniques. UMAP + HDBSCAN achieved better accuracy, and AI models outperformed conventional ones. SHAP analysis provided insights into key features, and federated implementation enhanced model accuracy and data protection, demonstrating MOSAIC’s potential for clinical use.

Clinical and Genomic-Based Decision Support System to Define the Optimal Timing of Allogeneic Hematopoietic Stem-Cell Transplantation in Patients With Myelodysplastic Syndromes

This study aims to optimize the timing of allogeneic hematopoietic stem-cell transplantation (HSCT) for patients with myelodysplastic syndromes (MDS) using the Molecular International Prognostic Scoring System (IPSS-M), which includes clinical and genomic information. Analyzing a retrospective cohort of 7,118 patients, the study finds that low to moderate-low risk patients benefit from delayed HSCT, while high-risk patients benefit from immediate HSCT. The IPSS-M based strategy significantly changes transplantation timing decisions compared to conventional methods, improving life expectancy. This supports the clinical relevance of incorporating genomic data into HSCT timing decisions for personalized treatment.

Personalized Timing for Allogeneic Stem-Cell Transplantation in Hematologic Neoplasms: A Target Trial Emulation Approach Using Multistate Modeling and Microsimulation

This study develops a framework to optimize the timing of allogeneic hematopoietic stem-cell transplantation (HSCT) for patients with hematologic neoplasms using real-world data. By leveraging multistate modeling and microsimulation on a cohort of 7,118 patients with myelodysplastic syndromes, the analysis identifies optimal timing for HSCT based on individual patient profiles. The methodology provides insights and evidence for clinical decision-making, addressing complex scenarios where randomized trials are not feasible.

Protecting Multiple Sensitive Attributes in Synthetic Micro-data

This paper explores the use of synthetic data as a privacy-preserving measure in data analysis, emphasizing the need to protect sensitive attributes while maintaining data utility. It investigates enhancements to the DataSynthesizer model, using Bayesian Networks to generate synthetic data that safeguards multiple sensitive attributes against inference attacks. The study contributes to the field by analyzing the impact of these techniques on data utility, presented at the 2023 IEEE International Conference on Big Data.

Federated learning for causal inference using deep generative disentangled models

In the context of decentralized and privacy-constrained healthcare data settings, we introduce an innovative approach to estimate individual treatment effects (ITE) via federated learning. Emphasizing the critical importance of data privacy in healthcare, especially when drawing on data from various global hospitals, we address challenges arising from data scarcity and specific treatment assignment criteria influenced by the availability of the medication of interest. Our methodology uses federated learning applied to neural network-based generative causal inference models to bridge the gap between decentralized and centralized ITE estimation on a benchmark dataset.

Sickle cell disease landscape and challenges in the EU: the ERN-EuroBloodNet perspective

Sickle cell disease is a hereditary multiorgan disease that is considered rare in the EU. In 2017, the Rare Diseases Plan was implemented within the EU and 24 European Reference Networks (ERNs) were created, including the ERN on Rare Haematological Diseases (ERN-EuroBloodNet), dedicated to rare haematological diseases. The role of the ERN-EuroBloodNet is to improve the overall approach to and the management of individuals with sickle cell disease in the EU through specific on the pooling of expertise, knowledge, and best practices; the development of training and education programmes; the strategy for systematic gathering and standardisation of clinical data; and its reuse in clinical research.

Synthetic Data Generation by Artificial Intelligence to Accelerate Research and Precision Medicine in Hematology

Synthetic data are artificial data generated without including any real patient information by an algorithm trained to learn the characteristics of a real source data set and became widely used to accelerate research in life sciences. In this work researchers apply generative artificial intelligence to build synthetic data in different hematologic neoplasms; develop a synthetic validation framework to assess data fidelity and privacy preservability; and test the capability of synthetic data to accelerate clinical/translational research in hematology.