WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Word Segmentation and Medical Concept Recognition for Chinese Medical Texts

Posted on 0 Comments
Tags: Chinese Word Segmentation, Electronic Health Records, Medical Concept Recognition, Nature Language Processing

This note is for Liu, Y., Tian, Y., Chang, T.-H., Wu, S., Wan, X., & Song, Y. (2021). Exploring Word Segmentation and Medical Concept Recognition for Chinese Medical Texts. Proceedings of the 20th Workshop on Biomedical Language Processing, 213–220.

Two fundamental tasks to process Chinese electronic medical records (EMRs):

  • Chinese word segmentation
  • medical concept recognition: it assigns an EMR-related tag (e.g., organism and group) to the segmented words.


  • lack of medical domain datasets with high-quality annotations
    • the models trained in the general domain always fail to have good performance

The paper:

  • collect a Chinese EMR corpus (ACEMR), where texts from 500 EMRs (7K sentences) with human annotations for Chinese word segmentation and EMR-related tags
  • run well-known models (i.e., BiLSTM, BERT and ZEN) and existing state-of-art systems (e.g., WMSeg and TwASP) for CWS and medical concept recognition
  • experiments demonstrate that the necessity of building a dedicated medical dataset and show that models that leverage extra resources achieve the best performance for both tasks

Due to the dramatic performance drop when applying the model trained from open source corpus on the medical field, previous studies always construct Chinese medical datasets themselves and test their models on the datasets.

However, most constructed datasets used for CWS are relatively small, only roughly 100 Chinese EMRs.

Besides, the medical concept types in most existing datasets are limited to named entities (e.g. “Disease” and “Symptoms and Signs”), which fails to consider other medical concept types (e.g. “Time”)

Annotated Chinese Electronic Medical Record (ACEMR) Corpus

Data Collection

collect 500 Chinese EMRs from five departments of a local hospital, where one EMR specifically means the First Course Record in the inpatient record.

CWS and Medical Concept Annotation

Four specialists participated in the development of the annotation guideline.

  • CWS: the segmentation guidelines of the Chinese Treebank (Xia, 2000) for the general domain as well as the annotation guideline proposed by He et al. (2017) for the medical domain.
  • medical concept: the medical taxonomy defined by unified medical language system (UMLS) semantic groups (Lindberg et al., 1993) and define 7 major medical concept classes with 20 sub-classes.

an example



Published in categories Note