WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Sequence Alignment in EHR

Posted on (Update: )
Tags: Sequence Alignment, EHR

This note is for Huang, M., Shah, N. D., & Yao, L. (2019). Evaluating global and local sequence alignment methods for comparing patient medical records. BMC Medical Informatics and Decision Making, 19(6), 263.

Background: Sequence alignment is a way of arranging sequences to identify the relatedness between two or more sequences and regions of similarity. For Electronic Health Records (EHR) data, sequence alignment helps to identify patients of similar disease trajectory for more relevant and precise prognosis, diagnosis and treatment of patients.

Methods: two cutting-edge global sequence alignment methods together with their local modifications,

  • dynamic time warping (DTW), and DTW for local alignment (DTWL)
  • Needleman-Wunsch algorithm (NWA) and Smith-Waterman algorithm (SWA)


EHR of a patient can be viewed as a temporal of medical events.

Question: which type of sequence alignment method works best for EHR data?

Goal: compare the strengths and limitations of both global and local sequence alignment methods and evaluate their impact on patient similarity calculation.

Challenging for several reasons:

  • patient medical records are complex
    • thousands of diagnosis codes
    • semantic meaning
    • varied data quality
  • no gold standard data is available for evaluating sequence alignment algorithms
    • it can be very subjective and expensive to ask experts, such as physicians to evaluate and rank the results from different sequence alignment methods

The Rochester Epidemiology Project (REP) was established in the mid-1960s by Dr. Leonard T. Kurland, which contains complete patient medical records.

The paper only considered diagnosis information.

Synthesis of patient medical records

synthesize 20 new patient medical records by applying one or more deleting, updating and switching operations, for each of the 4 seed patients.

Metrics for patient similarity

for multiple codes, use Jaccard index $J(X, Y)$ to measure the similarity


Pairwise global sequence alignment results

Pairwise local sequence alignment results


  • only used diagnosis codes in the experiments
  • only used a limited number of operations to create synthetic patients records, and only 4 seed patients and 20 synthesized patient medical records.
  • used self-defined scoring system to quantitatively evaluate sequence alignment.


DTW (or DTWL) seemed to align better and identify more similarities between patient medical records than NWA (or SWA).

Published in categories Note