Shenghuan (Harry) Sun

Shenghuan (Harry) Sun

UCSF Bakar Computational Health Sciences Institute

Predicting the cancer therapy regimen from social work notes using natural language processing

Phenomenon: To date, most research on social determinants of health have been focused on the use of structured data. Social work notes provide a more open-ended overview because they contain information typically not recorded as structured fields. To our knowledge, there has not been much research on directly using social work notes for clinical outcome prediction. Here we present an example of using deep learning methods, Bidirectional Encoder Representations from Transformers (BERT), to predict therapy regimens administered for patients with breast cancer. In addition, we developed a novel hierarchical BERT model for prediction over long sequences of clinical notes and successfully increased the model performance. Our cohort included all patients treated at UCSF for breast cancer, identified using the ICD9 code 174 and ICD10 code C50. We retrieved patient clinical notes categorized under social work from the UCSF Clinical Data Warehouse. We then annotated whether an individual patient had received a targeted therapy based on the National Cancer Institute Targeted Cancer Therapies Fact Sheet. The dataset was further split into training, validation, and test sets. We then implemented an end-to-end BERT-based classification model to predict for breast cancer patients at UCSF whether a patient received targeted therapy. To use long sequences of clinical notes for prediction, we built a multi-step BERT model (BERT-long), where the first step divides a long sequence of notes into multiple independent instances and then training the single BERT classifier on the individual chunks in the training set. In the second step, we concatenate the BERT representations of all notes of the same patient and further fit it into a multilayer perceptron for the training. We obtained 14921 social work clinical notes on 2868 patients, of which 70% received targeted therapy. We successfully implemented the BERT framework to make use of the rich social work notes at UCSF. We are able to consistently predict whether or not targeted therapies were administered, using only social work notes, which suggests systematic differences in therapy administration due to social determinants of health and clinical factors. UCSF-BERT model, which is pretrained on clinical notes at UCSF, outperformed the other public language models with an AUROC of 0.675. The UCSF BERT-long model, which leverages multiple clinical notes, superseded the UCSF BERT model with an AUROC of 0.718.

Bio: Hello! I am Harry, a PhD student at UCSF in Biological and Medical Informatics. I am co-advisored by Atul Butte at UC San Francisco and and Iain Carmichael at UC Berkeley. I am actively exploring the interface of biomedical research and artificial intelligence for achieving precision medicine. I work on Developing and applying machine learning/deep learning approaches to large-scale biological/clinical data, including biomedical images, clinical notes, and genomic data. Looking forward to talking with you.