IKH

Occurrence Matrix

In the previous segment, you learnt that there are two broad ways to represent how terms (words) occur in certain contexts – 1.) The term-occurrence context matrix (or simply the occurrence matrix) where each row is a term and each column represents an occurrence context (such as a tweet, a book, a document  etc.) and 2) the term-term co-occurrence matrix (or the co-occurrence matrix) which is a square matrix having terms in both rows and columns.

Let’s study the occurrence matrix first.

Also, notice that each word and a document has a corresponding vector representation now – each row is a vector representing a word, while each column is a vector representing a document (or context, such as a tweet, a book etc.). Thus, you can now perform all common vector operations on words and documents. The following exercise will help you use some of those operations.

Note that the occurrence matrix is also called a term-document matrix since its rows and columns represent terms and documents/occurrence contexts respectively.

Comprehension: The Term-Document Matrix

Consider four documents each of which is a paragraph taken from a movie. Assume that your vocabulary has only the following words: fear, beer, fun, magic, wizard.

The table below summarises the term-document matrix, each entry representing the frequency of a term used in a movie:

Harry Potter and the Sorcerer’s StoneThe PrestigeWolf of Wall StreetHangover
fear10820
beer0028
fun6588
magic182500
wizard20800

Term-document matrices (or occurrence context matrices) are commonly used in tasks such as information retrieval. Two documents having similar words will have similar vectors, where the similarity between vectors can be computed using a standard measure such as the dot product. Thus, you can use such representations in tasks where, for example, you want to extract documents similar to a given document from a large corpus.

However, note that a real term-document matrix will be much larger and sparse, i.e. it will have as many rows as the size of the vocabulary (typically in tens of thousands) and most cells will have the value 0 (since most words do not occur in most documents).

Using the term-document matrix to compare similarities between terms and documents poses some serious shortcomings such as with polysemic words, i.e. words having multiple meanings. For example, the term ‘Java’ is polysemic (coffee, island and programming language), and it will occur in documents on programming, Indonesia and cuisine/beverages.

So if you imagine a high dimensional space where each document represents one dimension, the (resultant) vector of the term ‘Java’ will be a vector sum of the term’s occurrence in the dimensions corresponding to all the documents in which ‘Java’ occurs. Thus, the vector of ‘Java’ will represent some sort of an ‘average meaning’, rather than three distinct meanings (although if the term has a predominant sense, e.g. it occurs much frequently as a programming language than its other senses, this effect is reduced).

In the next segment, you will study an alternate way to generate a distributed representation of words – the term-term co-occurrence matrix, where both rows and columns represent a term (word). 

Report an error