Self-organized Maps of Document Collections
Posted on 0 Comments
This note is for Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1998). WEBSOM – Self-organizing maps of document collections
WEBSOM: a textual document collection may be organized onto a graphical map display that provides an overview of the collection and facilitates interactive browsing.
- each document is encoded as a histogram of word categories which are formed by the self-organizing map (SOM) algorithm based on the similarities in the contexts of the words
- the encoded documents are organized on another self-organizing map, a document map, on which nearby locations contain similar documents.
Introduction
basic approaches for information retrieval and data mining in texual document collections are
- searching using keywords or key documents
- exploration of the collection referring to some organization or categorization of the documents
- filtering of interesting documents from the incoming document stream
one of the traditional methods of searching for texts that match a query is to index all the words (terms) that have appeared in the document collection.
- the query itself, typically a list of appropriate keywords, is compared with the term list of each document to find documents that match the query
- terms can be combined by Boolean logic in order to control the breadth of matching.
- three fundamental problems:
- recall and precision of retrieval are sensitive
- no indication on how many valuable documents were not retrieved
- difficult to formulate the query if the domain of the query is not known well
one basic problem of information retrieval: the same idea or concept can be presented in many different ways
- enormous variation in expression
- authors use their unique style
one suggested solution: thesaurus
The paper proposed WEBSOM:
- a tool especially in exploring a document collection but also in searching and filtering tasks
Related methods for document encoding
Vector space model
the resulting vectors can be thought to represent word histograms of the documents
- $\bfe_i$: unit vector with one in the $i$-th component
- $n_{ji}$: the frequency of occurrence of the word indexed with $i$ in the $j$-th document
two major problems:
- the dimensionality
- the orthogonality of the vectors $\bfe_k$ that correspond to the words, in which semantic relationships of the words are not taken into account.
Latent semantic indexing
\[\bfn_j' = \sum_k n_{jk}\bfx_k'\]where $\bfx_k’$ is the code that LSI forms for the $k$-th word by investigating the co-occurences of the words within the documents.
Perform SVD on the term-by-document matrix, and then discard the factors that have the least influence on the matrix, and then form $\bfx_i’$ by using only the remaining factors.
MatchPlus
rather like LSI, but the dimensions of $x_k’$ are chosen manually to correspond to a set of important “basis terms”.
Random mapping method
replace the unit vectors with random vectors
\[\bfx_j = \sum_k n_{jk}\bfr_k\]then
\[\bfx_j = R\bfn_j\]WEBSOM
Using Contextual information: Word category map
word category maps are SOMs which have organized words according to similarities in their roles in different contexts.
- Each unit on the SOM corresponds to a word category that consists of a set of similar words.
the use of SOM requires that the input is presented as numerical vectors and that a metric for comparing the vectors is available.
- not the appearance
- the sentential context
Document maps based on a two-level architecture
the document map is formed with the SOM algorithm using the histograms as fingerprints of the documents.