WP1 – Data Collection, Harmonization, and Interoperability
Main Purpose and Objectives
Work Package 1 (WP1) is a cornerstone of the SYNTHEMA project, aiming to advance personalized healthcare through AI and big data analytics. WP1 focuses on formulating a robust strategy for data collection, harmonization, and interoperability, with special emphasis on Sickle Cell Disease (SCD) and Acute Myeloid Leukemia (AML). The key objectives of WP1 include designing a comprehensive strategy for data collection, analyzing user requirements, developing a harmonized data and metadata model to ensure GDPR compliance, identifying appropriate ICT standards, and coordinating ethics management, data collection, and processing at clinical sites.
Focus on the SCD Use Case
Sickle Cell Disease (SCD) presents significant clinical challenges, which WP1 aims to address through AI. The primary goals are to develop personalized predictive models by integrating genomic, metabolomic, functional and clinical data; create AI models to predict risk scores and timing for various complications such as recurrent vaso-occlusive crisis, acute chest syndrome, cerebral silent infarcts, stroke, renal disease, and hepatic failure; and develop AI algorithms for imaging diagnosis, focusing on cerebral silent infarcts using MRI. To support these aims, WP1 plans to share synthetic or anonymized datasets, utilize a central data management node (UPM), employ SDG to augment patient populations, and impute missing variables in clinical and genetic data.
The data collection strategy for SCD involves retrieving clinical and research data from clinical partners such as VHIR, UMCU, UNIPD, APHP, Hospital Da Luz, and Charité. The clinical data set has been defined based on the minimum dataset developed by the RADeep, the Rare Anemia Disorders European Epidemiological Platform. Expansion of the clinical dataset to meet the objectives of Synthema has been performed to ensure high data quality and serve as a European SCD registry for both retrospective and prospective data, in line with the ENROL, the central registry of the ERN-EuroBloodNet.
Focus on the AML Use Case
In the context of Acute Myeloid Leukemia (AML), WP1 aims to leverage SDG to enhance data availability and quality. The main objectives include sharing AML synthetic and anonymized datasets while ensuring GDPR compliance, using SDG to increase patient numbers in specific subtypes, balancing classes within datasets through controlled synthetic data generation, and performing data imputation based on statistical distributions. This approach aims to enhance AI-based scores for predicting survival and treatment responses. A target dataset includes 2,500 AML patients, with clinical, omics, and imaging data being collected from both public and clinical partner sources.
The data collection strategy for AML involves utilizing a public dataset comprising clinical, genomic, and treatment data, as well as retrieving AML data from clinical partners such as ICH, VHIR, UMCU, UNIPD, Hospital Da Luz, and Charité. Initial steps have included defining a minimum dataset, ensuring image availability, and implementing the dataset as a speficic disease registry under the ENROL platform to create a dedicated eCRF. This process aims to ensure high data quality and serve as a European AML registry for both retrospective and prospective data, in collaboration with the ERN-EuroBloodNet.
To ensure data harmonization across SCD and AML use cases, WP1 is developing a unified data and metadata model. This model is based on data standards (FHIR, OMOP), taxonomies (HL7 v3, CEN 13606, CDISC, ICD), and electronic clinical report forms (eCRF). Adhering to FAIR principles, the project aims to make data findable, accessible, interoperable, and reusable.
Data Transformation Plan
WP1 is also formulating a data transformation plan to convert collected data into formats suitable for AI algorithms, leveraging the GYDRA data preparation tool (MIDAS). The plan includes implementing ETL code for automatic mapping of new datasets, once the underlying logic is defined.
Planning and Action Items (Months 19-24)
SCD Use Case
- Finalizing data quality analysis of existing datasets.
- Continuing the mapping of the CRF.
- Developing and initiating the RedCap solution for data collection in Germany and Portugal.
AML Use Case
- Finalizing the dedicated eCRF to ensure high data quality.
- Establishing the European AML registry in collaboration with ENROL and the ERN-EuroBloodNet.
- Collaborating with DATAWIZARD for OMOP data mapping.
Conclusion
WP1 is making significant progress in collecting and harmonizing high-quality data for both SCD and AML use cases. By developing robust strategies and utilizing innovative AI applications, WP1 aims to revolutionize personalized healthcare. The project’s focus on creating and sharing synthetic and anonymized datasets, adhering to GDPR and ethical standards, and leveraging advanced data processing tools underscores its commitment to enhancing clinical outcomes through precise and personalized treatment approaches.