# Self-organized Maps of Document Collections

##### Posted on Jun 12, 20210 Comments
Tags: Self-organized Map

WEBSOM: a textual document collection may be organized onto a graphical map display that provides an overview of the collection and facilitates interactive browsing.

• each document is encoded as a histogram of word categories which are formed by the self-organizing map (SOM) algorithm based on the similarities in the contexts of the words
• the encoded documents are organized on another self-organizing map, a document map, on which nearby locations contain similar documents.

## Introduction

basic approaches for information retrieval and data mining in texual document collections are

• searching using keywords or key documents
• exploration of the collection referring to some organization or categorization of the documents
• filtering of interesting documents from the incoming document stream

one of the traditional methods of searching for texts that match a query is to index all the words (terms) that have appeared in the document collection.

• the query itself, typically a list of appropriate keywords, is compared with the term list of each document to find documents that match the query
• terms can be combined by Boolean logic in order to control the breadth of matching.
• three fundamental problems:
• recall and precision of retrieval are sensitive
• no indication on how many valuable documents were not retrieved
• difficult to formulate the query if the domain of the query is not known well

one basic problem of information retrieval: the same idea or concept can be presented in many different ways

• enormous variation in expression
• authors use their unique style

one suggested solution: thesaurus

The paper proposed WEBSOM:

• a tool especially in exploring a document collection but also in searching and filtering tasks

### Vector space model

the resulting vectors can be thought to represent word histograms of the documents

• $\bfe_i$: unit vector with one in the $i$-th component
• $n_{ji}$: the frequency of occurrence of the word indexed with $i$ in the $j$-th document
$\bfn_j = \sum_k n_{jk}\bfe_k$

two major problems:

• the dimensionality
• the orthogonality of the vectors $\bfe_k$ that correspond to the words, in which semantic relationships of the words are not taken into account.

### Latent semantic indexing

$\bfn_j' = \sum_k n_{jk}\bfx_k'$

where $\bfx_k’$ is the code that LSI forms for the $k$-th word by investigating the co-occurences of the words within the documents.

Perform SVD on the term-by-document matrix, and then discard the factors that have the least influence on the matrix, and then form $\bfx_i’$ by using only the remaining factors.

### MatchPlus

rather like LSI, but the dimensions of $x_k’$ are chosen manually to correspond to a set of important “basis terms”.

## Random mapping method

replace the unit vectors with random vectors

$\bfx_j = \sum_k n_{jk}\bfr_k$

then

$\bfx_j = R\bfn_j$

## WEBSOM

### Using Contextual information: Word category map

word category maps are SOMs which have organized words according to similarities in their roles in different contexts.

• Each unit on the SOM corresponds to a word category that consists of a set of similar words.

the use of SOM requires that the input is presented as numerical vectors and that a metric for comparing the vectors is available.

• not the appearance
• the sentential context

### Document maps based on a two-level architecture

the document map is formed with the SOM algorithm using the histograms as fingerprints of the documents.

Published in categories Note