WeiYa's Work Yard

A dog, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Self-organized Maps of Document Collections

Posted on 0 Comments
Tags: Self-organized Map

This note is for Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1998). WEBSOM – Self-organizing maps of document collections

WEBSOM: a textual document collection may be organized onto a graphical map display that provides an overview of the collection and facilitates interactive browsing.

  • each document is encoded as a histogram of word categories which are formed by the self-organizing map (SOM) algorithm based on the similarities in the contexts of the words
  • the encoded documents are organized on another self-organizing map, a document map, on which nearby locations contain similar documents.

Introduction

basic approaches for information retrieval and data mining in texual document collections are

  • searching using keywords or key documents
  • exploration of the collection referring to some organization or categorization of the documents
  • filtering of interesting documents from the incoming document stream

one of the traditional methods of searching for texts that match a query is to index all the words (terms) that have appeared in the document collection.

  • the query itself, typically a list of appropriate keywords, is compared with the term list of each document to find documents that match the query
  • terms can be combined by Boolean logic in order to control the breadth of matching.
  • three fundamental problems:
    • recall and precision of retrieval are sensitive
    • no indication on how many valuable documents were not retrieved
    • difficult to formulate the query if the domain of the query is not known well

one basic problem of information retrieval: the same idea or concept can be presented in many different ways

  • enormous variation in expression
  • authors use their unique style

one suggested solution: thesaurus

The paper proposed WEBSOM:

  • a tool especially in exploring a document collection but also in searching and filtering tasks

Vector space model

the resulting vectors can be thought to represent word histograms of the documents

  • $\bfe_i$: unit vector with one in the $i$-th component
  • $n_{ji}$: the frequency of occurrence of the word indexed with $i$ in the $j$-th document
\[\bfn_j = \sum_k n_{jk}\bfe_k\]

two major problems:

  • the dimensionality
  • the orthogonality of the vectors $\bfe_k$ that correspond to the words, in which semantic relationships of the words are not taken into account.

Latent semantic indexing

\[\bfn_j' = \sum_k n_{jk}\bfx_k'\]

where $\bfx_k’$ is the code that LSI forms for the $k$-th word by investigating the co-occurences of the words within the documents.

Perform SVD on the term-by-document matrix, and then discard the factors that have the least influence on the matrix, and then form $\bfx_i’$ by using only the remaining factors.

MatchPlus

rather like LSI, but the dimensions of $x_k’$ are chosen manually to correspond to a set of important “basis terms”.

Random mapping method

replace the unit vectors with random vectors

\[\bfx_j = \sum_k n_{jk}\bfr_k\]

then

\[\bfx_j = R\bfn_j\]

WEBSOM

Using Contextual information: Word category map

word category maps are SOMs which have organized words according to similarities in their roles in different contexts.

  • Each unit on the SOM corresponds to a word category that consists of a set of similar words.

the use of SOM requires that the input is presented as numerical vectors and that a metric for comparing the vectors is available.

  • not the appearance
  • the sentential context

Document maps based on a two-level architecture

the document map is formed with the SOM algorithm using the histograms as fingerprints of the documents.


Published in categories Note