Development of a Consumer Health Vocabulary by Mining Health Forum Texts Based on Word Embedding: Semiautomatic Approach.
Academic Article
Overview
abstract
BACKGROUND: The vocabulary gap between consumers and professionals in the medical domain hinders information seeking and communication. Consumer health vocabularies have been developed to aid such informatics applications. This purpose is best served if the vocabulary evolves with consumers' language. OBJECTIVE: Our objective is to develop a method for identifying and adding new terms to consumer health vocabularies, so that it can keep up with the constantly evolving medical knowledge and language use. METHODS: In this paper, we propose a consumer health term-finding framework based on a distributed word vector space model. We first learned word vectors from a large-scale text corpus and then adopted a supervised method with existing consumer health vocabularies for learning vector representation of words, which can provide additional supervised fine tuning after unsupervised word embedding learning. With a fine-tuned word vector space, we identified pairs of professional terms and their consumer variants by their semantic distance in the vector space. A subsequent manual review of the extracted and labeled pairs of entities was conducted to validate the results generated by the proposed approach. The results were evaluated using mean reciprocal rank (MRR). RESULTS: Manual evaluation showed that it is feasible to identify alternative medical concepts by using professional or consumer concepts as queries in the word vector space without fine tuning, but the results are more promising in the final fine-tuned word vector space. The MRR values indicated that on an average, a professional or consumer concept is about 14th closest to its counterpart in the word vector space without fine tuning, and the MRR in the final fine-tuned word vector space is 8. Furthermore, the results demonstrate that our method can collect abbreviations and common typos frequently used by consumers. CONCLUSIONS: By integrating a large amount of text information and existing consumer health vocabularies, our method outperformed several baseline ranking methods and is effective for generating a list of candidate terms for human review during consumer health vocabulary development.