CEHRGPT

Description
CEHRGPT is a synthetic data generation model developed to handle structured electronic health records (EHR) with enhanced privacy and reliability. It leverages state-of-the-art natural language processing techniques to create realistic, anonymized patient data that can be used for research and development without compromising patient privacy.
Features
- Synthetic Patient Data Generation: Generates comprehensive patient profiles including demographics, medical history, treatment courses, and outcomes.
- Privacy-Preserving: Implements techniques to ensure the generated data does not reveal identifiable information.
- Compatibility with OMOP: Fully compatible with the OMOP common data model, allowing seamless integration with existing healthcare data systems.
- Extensible: Designed to be adaptable to new datasets and different EHR systems.
Installation
To install CEHRGPT, clone this repository and install the required dependencies.
git clone https://github.com/knatarajan-lab/cehrgpt.git
cd cehrgpt
pip install .
Pretrain
Pretrain cehrgpt using the Hugging Face trainer, the parameters can be found in the sample configuration yaml
mkdir test_results
mkdir test_dataset_prepared
python -u -m cehrgpt.runners.hf_cehrgpt_pretrain_runner sample_configs/cehrgpt_pretrain_sample_config.yaml
Generate synthetic sequences
Generate synthetic sequences using the trained model
export TRANSFORMERS_VERBOSITY=info
export CUDA_VISIBLE_DEVICES="0"
python -u -m cehrgpt.generation.generate_batch_hf_gpt_sequence \
--model_folder test_results \
--tokenizer_folder test_results \
--output_folder test_results \
--num_of_patients 128 \
--batch_size 32 \
--buffer_size 128 \
--context_window 1024 \
--sampling_strategy TopPStrategy \
--top_p 1.0 --temperature 1.0 --repetition_penalty 1.0 \
--epsilon_cutoff 0.00 \
--demographic_data_path sample_data/pretrain
Convert synthetic sequences to OMOP
export OMOP_VOCAB_DIR = ""
export SPARK_WORKER_INSTANCES="1"
export SPARK_WORKER_CORES="8"
export SPARK_EXECUTOR_CORES="2"
export SPARK_DRIVER_MEMORY="2g"
export SPARK_EXECUTOR_MEMORY="2g"
sh scripts/omop_pipeline.sh \
test_results/top_p10000/generated_sequences/ \
test_results/top_p10000/restored_omop/ \
$OMOP_VOCAB_DIR
Citation
@article{cehrgpt2024,
title={CEHRGPT: Synthetic Data Generation for Electronic Health Records},
author={Natarajan, K and others},
journal={arXiv preprint arXiv:2402.04400},
year={2024}
}