Self-organized Maps of Document Collections

Posted on Jun 12, 2021

Tags: Self-organized Map

This note is for Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1998). WEBSOM – Self-organizing maps of document collections

WEBSOM: a textual document collection may be organized onto a graphical map display that provides an overview of the collection and facilitates interactive browsing.

each document is encoded as a histogram of word categories which are formed by the self-organizing map (SOM) algorithm based on the similarities in the contexts of the words
the encoded documents are organized on another self-organizing map, a document map, on which nearby locations contain similar documents.

Introduction

basic approaches for information retrieval and data mining in texual document collections are

searching using keywords or key documents
exploration of the collection referring to some organization or categorization of the documents
filtering of interesting documents from the incoming document stream

one of the traditional methods of searching for texts that match a query is to index all the words (terms) that have appeared in the document collection.

the query itself, typically a list of appropriate keywords, is compared with the term list of each document to find documents that match the query
terms can be combined by Boolean logic in order to control the breadth of matching.
three fundamental problems:
- recall and precision of retrieval are sensitive
- no indication on how many valuable documents were not retrieved
- difficult to formulate the query if the domain of the query is not known well

one basic problem of information retrieval: the same idea or concept can be presented in many different ways

enormous variation in expression
authors use their unique style

one suggested solution: thesaurus

The paper proposed WEBSOM:

a tool especially in exploring a document collection but also in searching and filtering tasks

Vector space model

the resulting vectors can be thought to represent word histograms of the documents

$\bfe_i$: unit vector with one in the $i$-th component
$n_{ji}$: the frequency of occurrence of the word indexed with $i$ in the $j$-th document

\[\bfn_j = \sum_k n_{jk}\bfe_k\]

two major problems:

the dimensionality
the orthogonality of the vectors $\bfe_k$ that correspond to the words, in which semantic relationships of the words are not taken into account.

Latent semantic indexing

\[\bfn_j' = \sum_k n_{jk}\bfx_k'\]

where $\bfx_k’$ is the code that LSI forms for the $k$-th word by investigating the co-occurences of the words within the documents.

Perform SVD on the term-by-document matrix, and then discard the factors that have the least influence on the matrix, and then form $\bfx_i’$ by using only the remaining factors.

MatchPlus

rather like LSI, but the dimensions of $x_k’$ are chosen manually to correspond to a set of important “basis terms”.

Random mapping method

replace the unit vectors with random vectors

\[\bfx_j = \sum_k n_{jk}\bfr_k\]

then

\[\bfx_j = R\bfn_j\]

WEBSOM

Using Contextual information: Word category map

word category maps are SOMs which have organized words according to similarities in their roles in different contexts.

Each unit on the SOM corresponds to a word category that consists of a set of similar words.

the use of SOM requires that the input is presented as numerical vectors and that a metric for comparing the vectors is available.

not the appearance
the sentential context

Document maps based on a two-level architecture

the document map is formed with the SOM algorithm using the histograms as fingerprints of the documents.

Published in categories Note

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Self-organized Maps of Document Collections

Posted on Jun 12, 2021

Introduction

Vector space model

Latent semantic indexing

MatchPlus

Random mapping method

WEBSOM

Using Contextual information: Word category map

Document maps based on a two-level architecture

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

Self-organized Maps of Document Collections

Posted on Jun 12, 2021

Introduction

Related methods for document encoding

Vector space model

Latent semantic indexing

MatchPlus

Random mapping method

WEBSOM

Using Contextual information: Word category map

Document maps based on a two-level architecture