WP3: Data Anonymisation and Synthetic Data Generation Pipelines

Purpose and Objectives

Work Package 3 (WP3) focuses on creating and implementing robust pipelines for generating GDPR-compliant health data assets tailored for clinical use cases. These pipelines are rigorously assessed for clinical utility, statistical relevance, privacy, and security. The objective is to develop distributable health data solutions that meet both ethical and operational standards.

The first 18 months of WP3 activity have been centered on building foundational tools, defining strategies, and addressing challenges in data anonymization and synthetic data generation.

Key Results

Pipeline Implementation: Proof-of-concept pipeline created, integrating anonymization and synthetic data generation technologies.
Anonymization Strategy: Collaborated with clinical partners to design strategies aligned with the SCD and AML datasets’ structures.
Federated Learning Integration: Developed federated implementations of CTGAN and Bayesian networks with differential privacy tests.
Causal Inference Tools: Implemented algorithms for Average and Individualized Treatment Effects, facilitating advanced insights into clinical data.

Development Highlights

Deployed software packages for fidelity metrics and the initial anonymization engine for the AML dataset.
Tested Bayesian networks for synthetic data generation (SDG) in federated and centralized settings.
Advanced preprocessing and generation of semi-synthetic datasets for causal inference tests.

Future Objectives
Year 3 aims to integrate SDG and anonymization prototypes into the production environment of WP2 infrastructure, expanding functionality and scalability. This includes operationalizing automatic fidelity metrics evaluation, extending causal inference models to survival analysis, and enhancing collaborations with other work packages.

WP3 continues to bridge technological innovation with clinical applicability, ensuring data privacy while maximizing utility for healthcare research.

WP3: Data Anonymisation and Synthetic Data Generation Pipelines