Identifying and Extracting Rare Disease Phenotypes with Large Language Models. (arXiv:2306.12656v1 [cs.CL])
By: <a href="http://arxiv.org/find/cs/1/au:+Shyr_C/0/1/0/all/0/1">Cathy Shyr</a>, <a href="http://arxiv.org/find/cs/1/au:+Hu_Y/0/1/0/all/0/1">Yan Hu</a>, <a href="http://arxiv.org/find/cs/1/au:+Harris_P/0/1/0/all/0/1">Paul A. Harris</a>, <a href="http://arxiv.org/find/cs/1/au:+Xu_H/0/1/0/all/0/1">Hua Xu</a> Posted: June 23, 2023
Rare diseases (RDs) are collectively common and affect 300 million people
worldwide. Accurate phenotyping is critical for informing diagnosis and
treatment, but RD phenotypes are often embedded in unstructured text and
time-consuming to extract manually. While natural language processing (NLP)
models can perform named entity recognition (NER) to automate extraction, a
major bottleneck is the development of a large, annotated corpus for model
training. Recently, prompt learning emerged as an NLP paradigm that can lead to
more generalizable results without any (zero-shot) or few labeled samples
(few-shot). Despite growing interest in ChatGPT, a revolutionary large language
model capable of following complex human prompts and generating high-quality
responses, none have studied its NER performance for RDs in the zero- and
few-shot settings. To this end, we engineered novel prompts aimed at extracting
RD phenotypes and, to the best of our knowledge, are the first the establish a
benchmark for evaluating ChatGPT’s performance in these settings. We compared
its performance to the traditional fine-tuning approach and conducted an
in-depth error analysis. Overall, fine-tuning BioClinicalBERT resulted in
higher performance (F1 of 0.689) than ChatGPT (F1 of 0.472 and 0.591 in the
zero- and few-shot settings, respectively). Despite this, ChatGPT achieved
similar or higher accuracy for certain entities (i.e., rare diseases and signs)
in the one-shot setting (F1 of 0.776 and 0.725). This suggests that with
appropriate prompt engineering, ChatGPT has the potential to match or
outperform fine-tuned language models for certain entity types with just one
labeled sample. While the proliferation of large language models may provide
opportunities for supporting RD diagnosis and treatment, researchers and
clinicians should critically evaluate model outputs and be well-informed of
their limitations.
Provided by:
http://arxiv.org/icons/sfx.gif