WP3 – Data Anonymization and Synthetic Data Generation Pipelines
Main Purpose and Objectives
Work Package 3 (WP3) within the SYNTHEMA project is instrumental in developing and deploying core functionalities for creating GDPR-compliant health data assets for specific clinical use cases like Sickle Cell Disease (SCD) and Acute Myeloid Leukemia (AML). This work package focuses on crafting a pipeline for shareable data asset generation and executing various data anonymization and Synthetic Data Generation (SDG) techniques, including statistical association, causal modeling, and federated learning.
Detailed Activities and Progress
T3.1: Shareable Data Assets Pipeline
WP3 has made significant strides in establishing a robust data assets pipeline. Key developments include creating the flow, interconnection and architecture schemas for pipeline management, evaluating technologies for the pipeline’s development and deployment, and implementing a proof of concept with fundamental functionalities like basic model training and SDG sample generation. This setup allows the tracking and linking of SDG models to the data used for training and evaluating the fidelity of synthetic samples generated against the original data.
T3.2: Anonymization Engine for Target Data Modalities
Significant work in T3.2 involves designing and initiating the anonymization strategy for SCD and AML datasets. This includes:
- Anonymization Techniques: Exploring state-of-the-art tools like the internal INTRA project “Nomos,” the ARX open-source tool, and “Amnesia” developed by the Athena Research Center. Additionally, a custom Python-based solution is being tailored to meet the specific requirements of the project.
- Collaboration and Design: Engaging in discussions with clinical partners and Datawizard to refine the understanding of dataset variables and formats. A set of minimal variables for each dataset type has been identified, guiding the design components of the anonymization engine.
Open Issues and Questions
- SCD Dataset Biobank Data: Questions have arisen about the sensitivity of the biobank data and whether it discloses identifiable patient information.
- Variable Necessity: There is ongoing analysis to determine if any variables in the datasets are unnecessary for analysis and could potentially be suppressed to enhance privacy.
- Variable Addition: Considerations are being made regarding any additional variables that might be necessary for the initial version of the anonymization process.
Next Steps
- Documentation and Strategy: Update the anonymization strategy document incorporating feedback from partners like VHIR and ICH, proposing tailored strategies for each selected variable.
- Development Initiatives: Kick-start the development of the Anonymization Engine and the Anonymized Data Catalogue.
T3.3: SDG Engine for Target Data Modalities
A proof of concept of the SDG engine has been created, leveraging Flower and various open-source SDG libraries. This includes testing DP adaptations and assessing models like CTGAN for their efficacy in synthetic data generation under different privacy-preserving conditions. A minimum set of fidelity metrics has been agreed, both quantitative and observational. Also, a proof of concept has been created to evaluate the fidelity of the SD w.r.t the source real samples, automatically obtaining the fidelity metrics as an interactive report.
T3.4: Federated Training of SDG Models
This task has seen testing and improvement in SDG models using Bayesian Networks, Variational Autoencoders (VAEs), and Generative Adversarial Networks (GAN) architectures like CTGAN in both centralized and federated settings. Key challenges include adapting preprocessing techniques for federated settings and dealing with non-IID data distributions and imbalance in categorical attributes.
T3.5: In-silico Modelling of Optimal Treatments
WP3 also encompasses the in-silico modeling of optimal treatments, where the task involves implementing centralized and distributed causal inference models to select the best treatments given observable confounders. Initial implementations have used public datasets like the Infant Health Development Program (IHDP) to test these models.
Conclusion
WP3 is central to the SYNTHEMA’s project’s goal of enhancing personalized attention in rare diseases taking advantage of the integration of data scattered among different hospitals through secure and private data utilization. By developing advanced data anonymization and synthetic data generation pipelines, WP3 supports not only the protection of sensitive health data but also its utility for groundbreaking research. The progress across various tasks demonstrates a robust commitment to technical excellence and strategic collaboration, paving the way for transformative impacts in healthcare research and treatment methodologies.