WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Knowledge Graph and Electronic Medical Records

Posted on 0 Comments
Tags: Knowledge Graph, Electronic Medical Records, Recommendation System, Embedding, Nature Language Processing

This note covers several papers on Knowledge Graph and Electronic Medical Records.

Survey on Application of Knowledge Graph

Zou, X. (2020). A Survey on Application of Knowledge Graph. Journal of Physics: Conference Series, 1487, 012016.

  • Semantic Web
    • can be traced back to Bernels-Lee’s research in 2001
    • he suggested technical standards such as Uniform Resource Identifier (URI), Resource Description Framework (RDF) and Web Ontology Language (OWL) should be promoted and developed
  • Linked Data
    • came out in 2009
    • propose to link different datasets to each other in the Semantic Web to make them be treated as one large, global knowledge graph
  • Knowledge Graph
    • propose by Google in 2012, use semantic knowledge in web search
    • Google’s KG aims to identify and disambiguate entities in text, to enrich search results with semantically structured summaries, and to provide links to related entities in exploratory search
    • Other KGs: Microsoft Bing’s Satori, Wikidata, Freebase

Two categories of researches on KGs: - construction techniques: extraction, representation, fusion, and reasoning of the knowledge in the graphs - such as linking entities and relations to KG correctly after extracting them from unstructured text - and reasoning new facts from such KG - applications: applying KGs to practical systems and specific domains


Question Answering System

  • Watson: a QA system using several KGs such as YAGO and DBpedia as its data source, developed by IBM, defeat human experts in the program Jeopardy
  • social chatbots and digital assistants: such as XiaoIce, Cortana and Siri

Three groups of traditional QA systems: - semantic parsing based: transform natural language questions into logic forms which can express the semantics of the whole queries, then the parse results are used to generate structured queries to search knowledge bases and obtain the answers - Bercant et al.: construct a coarse mapping between pharses and predicates - Fader et al: factor questions into a set of smaller, related problems - good performance for complex questions, but it depends on large hand-crafted features for semantic parsers - information retrieval based: try to automatically translate natural language questions into structured queries. - concerns little about the semantics of natural language questions and achieve good results only in dealing with simple queries. - rely on rules and dependency parse results to extract hand-crafted features for questions. - embedding based: - learn low-dimensional vector embeddings of given question and of entities, then relate types of Freebase to calculate the similarity score between the question and candidate answers - depends on careful optimization of a matrix parameterizing the similarity adopted in the embedding space - it achieves a competitive performance without any hand-crafted features or additional systems for part-of-speech tagging, syntactic or dependency parsing during training - it ignores word order information and cannot process complicated questions - recent years, deep learning methods are combined with them to improve the performance - Dong et al.: use multi-column convolutional neural networks (MCCNNs) for information retrieving without relying on hand-crafted features and rules - Hao et al.: an end-to-end neural network model with cross-attention mechanism which considers various candidate answer aspects to represent the questions and their corresponding scores - Yih et al.: deep CNN - Zhang et al.: an attention based bidirectional long short-term memory (BiLSTM) to learn the representations of the questions when using embedding approach.

More complex tasks: - instead of fact-finding extractive QA, focus on multi-hop generative task

Recommender System

collaborative filtering: a traditional recommendation method performing the recommendations based on users’ common preferences and historical interactions. - problems: the sparsity of users’ data, cold start problem - solution: use side information

consider KGs as the source of side information. - improve the accuracy - increase the diversity of items - bring interpretabilitu

generally, two approaches to build KG based recommender systems - embedding based: preprocess the KG by knowledge graph embedding (KGE) algorithm, and applies the learned entity embeddings to a recommendation framework - path based: design a graph algorithm directly to explore a variety of patterns of connections between nodes in KG. meta-graph/meta-path based latent features are extracted from KG to represent the link between items and users along different types of relation graphs/paths. - other work: - combine the above two methods - complete missing facts in KG

Information Retrieval

more and more commercial web-based search engines are incorporating entity data from KGs to improve their search results


  • Medical:
    • Ernst et al.: construct a large biomedical science KG automatically
    • Shi et al.: integrate health data into heterogeneous textual medical knowledge
    • Goodwin et al.: incorporate the belief state of the physician for assertions in the medical record
    • Rotmensch et al.: generate a graph mapping diseases to the sympotoms
    • the approaches for constructing medical KGs depend on authentic standard medical terminilogy, which is lacking in some languages such as Chinese.
  • Cyber Security:
    • detect and predict dynamics attacks and safeguard people’s cyber assets
  • Financial:
  • News:
  • Education:
    • learning resource recommendation and concept visualization
  • Other Applications:
    • social network de-anonymization
    • privacy inferring process
    • image classification
    • geoscientific research network
    • combat human trafficking
    • machine translation

SMR: safe medicine recommendation

Gong, F., Wang, M., Wang, H., Wang, S., & Liu, M. (2020). SMR: Medical Knowledge Graph Embedding for Safe Medicine Recommendation. ArXiv:1710.05980 [Cs].

most existing medicine recommendation systems are mainly based on electronic medical records (EMRs) - content limitations in EMRs: e.g., drug-drug interactions

many medical knowledge grahs contains drug-related information, such as DrugBank

The DrugBank database is a comprehensive, freely accessible, online database containing information on drugs and drug targets. wiki: https://en.wikipedia.org/wiki/DrugBank

- but direct use of these knowledge graphs in systems suffer from robustness caused by the incompleteness of the graphs

The paper uses graph embedding learning techniques and proposes a framework called Safe Medicine Recommendation (SMR) - first construct a high-quality heterogeneous graph by bridging EMRs (MIMIC-III) and medical knowledge graphs (ICD-9 ontology and DrugBank) - then SMR jointly embeds diseases, medicines, patients, and their corresponding relations into a shared lower dimensional space - finally, SMR uses the embeddings to decompose the medicine recommendation into a link prediction process while considering the patient’s diagnoses and adverse drug reactions.

medical recommendation systems: - rule-based protocols: defined by the clinical guidelines and the experienced doctors - constructing, curating and maintaining these protocols are time-consuming and labor-intensive - rule-based protocols might be effective for a general medicine recommendation for a specific diagnosis, but give little help to tailored recommendations for complicated patients. (???) - supervised learning algorithms and variations, such as Multi-Instance Multi-label (MIML) learning, have been proposed to recommend medicines for patients. - both input features and group-truth information that are extracted from massive EMRs are trained to obtain a predictive model that outputs multiple labels of the new testing data as medical recommendations - therapies and treatments in clinical practices are rapidly updated, but supervised learning methods cannot deal with those medicines that are not included in the training phase. (incomplete training data)

drug-drug interactions: - helth risks

emergence of knowledge graphs: - DrugBank: extensive entities and relationships - ICD-9 ontology: a knowledge base of human diseases and can be used to classify diagnoses of patients

linking EMRs and medical knowledge graphs to generate a large and high-quality heterogeneous graph: - a promising pathway for medicine recommendations in a wider scope - challenges: - computational efficiency - data incompleteness: the medical knowledge graphs also follow the long-tail distribution as same as other types of large-scale knowledge bases - cold start:



  • medicine recommendation
    • aim to tailor treatment to the individual characteristics of each patient
    • some already existing medicine recommendation systems by leveraging genetics/genomics information of patients in current practice, such information is not yet widely available in everyday clinical practice, and is insufficient since it only addresses one of many factors affecting response to medication.
  • medical knowledge graphs and embeddings
    • automatic knowledge base population and completion, such as
      • Bio2RDF
      • Chem2Bio2RDF
    • medical knowledge graphs contain an abundance of basic medical facts of medicines and diseases and provide a pathway for medical discovery and applications, such as effective medicine recommendations
    • medical knowledge graphs suffer from serious data incomplete problem, which impedes its application in the field of clinical medicine
    • Celebi et al.: a knowledge graph embedding approach for drug-drug interaction prediction in realistic scenario
    • MedGraph: a new EMR embedding framework which introduces a graph-based data structure to capture both structural visit-code co-location information structurally and temporal visit sequencing information.

An Approach to Construct Chinese Medical Knowledge Graph

Wu, Y., Zhu, X., & Zhu, Y. (2021). An Improved Approach to the Construction of Chinese Medical Knowledge Graph Based on CTD-BLSTM Model. IEEE Access, 9, 74969–74976.

  • fundamental and important tasks in the process of constructing the knowledge graph:
    • entity recognition
    • relationship extraction
  • the paper
    • proposes a model of extending Bi-LSTM structural units with Double-word vector and combining Semi-supervised Co-training method
    • use the model in Chinese named entity recognition and entity-relationship extraction in the Chinese medical field, named Co-Training Double Word embedding conditional BLSTM (CTD-BLSTM).
  • Recent years,
    • Chinese medical information management systems and electronic medical record (EMR) files have been rapidly applied in medical institutions.
    • Various Chinese medical knowledge quiz communities and knowledge encyclopedias have emerged, generating a large amount of network medical data.

Research and development are relatively late in China.

  • Wang et al.: automatic construction of traditional Chinese medical (TCM) knowledge graph.
    • used knowledge templates and knowledge reasoning methods to realize knowledge quiz and drug recommendation
  • Jia et al.:

few research on entity relationship extraction in Chinese medical field - Chen et al.: use co-occurrence statistics to compute the correlation between disease entities and drug entities - Tang et al.: multicore learning for medical literature - Liu et al.: k-nearest neighbors to extract the relationship of Chinese drug instructions

with the rise of deep learning: - researchers used the joint neural network model to extract the relationship, but most of them focused on the general field at present


  • data:
    1. download “Medical Vocabulary Encyclopedia” lexicon from “Sogou Cell Dictionary”
    2. Crawled the entries in several sub-categories of ‘‘Dis-ease’’, ‘‘Traditional Medicine’’, ‘‘Anatomy’’, ‘‘Health Care’’, ‘‘Pharmacology’’ from ‘‘Baidu Baike’’;
    3. Aligned, disambiguated, and merged terms in the ‘‘Medical Vocabulary Encyclopedia’’ lexicon with ‘‘Baidu Baike’’ entries, finally got 98647 entries.
  • named entity recognition: the semantic feature and context window feature had the best description effect on named entity recognition
    1. define 7 types of entity including ‘‘disease’’, ‘‘symptom’’, ‘‘check’’,‘‘treatment’’, ‘‘drug’’, ‘‘part’’, and ‘‘department’’
    2. use dictionary to label a small number of corpora, and reviewed the annotation result manually.
    3. train and automatic annotate in the model iterative training
  • entity-relationship extraction

Learning a Health Knowledge Graph from EMRs

Rotmensch, M., Halpern, Y., Tlimat, A., Horng, S., & Sontag, D. (2017). Learning a Health Knowledge Graph from Electronic Medical Records. Scientific Reports, 7(1), 5994.

  • explore an automated process to learn high quality knowledge bases linking diseases and symptoms directly from EMRs.
    • medical concepts were extracted from 273,124 de-identified patient records
    • MLE of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier and a Bayesian network using noisy OR gates
    • a graph of disease-symptom relationships was elicited from the learned parameters and the constructed knowledge graphs were evaluated and validated against Google’s manually-constructed knowledge graph and against expert physician opinions
    • direct and automated construction of high quality helath knowledge graphs from medical records using rudimentary concept extraction is feasible.
    • the noisy OR model produces a high quality knowledge graph, and noisy OR significantly outperforms all tested models across evaluation frameworks
  • previous work considered the use of natural language processing to find relationships between diseases and symptoms from unstructured or semi-structured data.
    • e.g., IBM’s WatsonPaths and the symptom checker Isabel made use of medical textbooks, journals, and trusted web content
  • EMR data is difficult to interpret for four main reasons:
    1. the text of physician and nursing notes is less formal than that of traditional textbooks, making it difficult to consistently identify disease and symptom mentions.
    2. textbooks and journals often present simplified cases that relay only the most typical symptoms, to promote learning. EMR data presents real patients with all of the comorbidities, confounding factors, and nuances that make them individuals
    3. textbooks state the relationships between diseases and symptoms in a declarative manner, the associations between diseases and symptoms in the EMR are statistical, making it easy to confuse correlation with causation
    4. the manner in which observations are recorded in the EMR is filtered through the decision-making process of the treating physician. Information deemed irrelevant may be omitted or not pursued, leading to information missing not at random.
  • use of Google helath knowledge graph (GHKG):
    • the set of diseases and symptoms considered were chosen from the GHKG to establish for later comparison
    • aliases and acronyms were obtained both from GHKG and UMLS
    • (in other words, no need to recognize if the concept is disease or symptom?)

For logistic regression and naive Bayes, learn a model separately for each disease.

  • logistic regression: $b_{ij}$ is the weight associated with symptom $i$ is the logistic regression model fit to predict disease $j$, then measure the importance by $\max(b_{ij}, 0)$
  • naive Bayes: the importance measure for naive Bayes is taken to be $\log(p(x_i=1\mid y_j=1)) - \log(p(x_i=1\mid y_j=0))$, where $x_i$ is the binary variable denoting the presence of symptom $i$ and $y_j$ is the binary variable denoting the presence of disease $j$.
  • noisy OR: a conditional probability distribution that describes the causal mechanisms by which nodes affect the states of children nodes
    • in a deterministic noise free setting, the presence of an underlying disease would always cause its symptoms to be observed, and a symptom could be observed if any of its parent diseases are “on”, e.g., a patient would have a fever if he/she contracted the flu or if he/she has mononucleosis.
    • in practice, far less deterministic: a patient may not present with a fever even if he/she has the flu. Additionally, fever might occur as a result of neither flu nor mononucleosis.
    • noisy OR deals with the inherent noise in the process by introducing failure and leak probabilities. Specifically, a disease $y_j$ that is present might fail to turn on its child symptom $x_i$ with probability $f_{ij}$. The leak probabilities $l_i$ represents the probability of a symptom being on even if all of its parent diseases are off. \(P(x_i=1\mid y_1,\ldots,y_n) = 1 - (1-l_i)\prod_j(f_{ij})^{y_j}\,,\) and take $1-f_{ij}$ as the importance.
  • discussions:
    • the GHKG was designed to provide useful information to web-users, and it is not exhaustive, where some omissions were labeled as highly relevant by both clinical evaluators, so demonstrate that an EMR data-driven approach to uncover relevant symptoms
    • higher recall achieved by the model
    • differences also involves the preciseness of language used
    • heightened severity

Robustly Extract Medical Knowledge from EHRs

Chen, I. Y., Agrawal, M., Horng, S., & Sontag, D. (2019). Robustly Extracting Medical Knowledge from EHRs: A Case Study of Learning a Health Knowledge Graph. In Biocomputing 2020 (Vol. 1–0, pp. 19–30). WORLD SCIENTIFIC.

  • EHRs can algorithmically learn medical knowledge:
    • e.g., a causal health knowledge graph could learn relationships between diseases and symptoms and then serve as a diagnostic tool to be refined with additional clinical input

prior research has demonstrated the ability to construct such a graph from over 270,000 emergency department patient visits. (MIMIC-III?)

  • the paper:
    • describe methods to evaluate a health knowledge graph for robustness
    • analyze for which diseases and for which patients the graph is most accurate
    • identify sample size and unmeasured confounders as major sources of error in the health knowledge graph
    • introduce a method to leverage non-linear functions in building the causal graph to better understand existing model assumptions.
    • to assess model generalizability, extend to a larger set of complete patient visits within a hospital system.
  • Because of the nature of data collection, models learned on EHRs like health knowledge graphs are subject to many sources of statistical bias.
    1. narrow sample sizes of subsets of the data can cause underfitting despite the larger scale of the entire dataset
    2. confounders not measured by the data may bias the reliability of resulting models
    3. algorithms or findings from algorithms may not generalize to entirely different populations

Published in categories Note