In the previous segments, you have learnt that the document-topic and topic-term distributions are modelled as a parametric distribution called the Dirichlet. Also, the Dirichlet distribution is parameterized by a variable ‘alpha’.
While studying the Dirichlet distribution, you will often come across the term prior (the Dirichlet prior etc.). This simply means that by specifying a particular value of alpha, you define the rough ‘shape of the Dirichlet distribution’ (similar to when you define μ and σ you define the ‘shape’ of a Gaussian distribution).
Let’s now understand the Dirichlet distribution through an example.
Understanding the Dirichlet distribution
Let’s understand the Dirichlet distribution using some examples. In a previous discussion, you saw how we can model the document-topic and topic-term distribution using a multinomial distribution.
To reiterate, in the multinomial distribution, the random variable X is a vector whose values represent the number of times a particular value on a k-sided die appears, i.e. X=(X1=3,X2=5,…Xk=10). For example, say you have five topics t1, t2 … t5 in a document. The random variable X=(t1=4,t2=10,t3=14,t4=16,t5=11) represents the number of times each topic occurs in the document (i.e. topic-1 occurs 4 times, topic-2 occurs 10 times and so on).
Now, there’s something we want to modify – rather than each topic taking an integer value, we want it to take a value between 0 and 1, i.e. we want each topic t1,t2,…tk to appear with probabilities p1,p2…pk .
Similarly, we want to model the topic-term distribution such that each term w1,w2…wk appears with some probability p1,p2…pk in a topic.
The two multinomial distributions thus obtained, where the random variable X represents the probability of occurrence of a topic in a document/term in a topic, is a Dirichlet distribution (along with some other properties as explained below).
In general, a Dirichlet distribution is similar to the multinomial distribution that describesk>=2 variables x1,x2…xk with each xi being between [0,1] and ∑xi=1.
The Dirichlet is parametrized by a vector α=(α1,α2…αk). The individual alphas can take any positive value (i.e. not necessarily between 0 and 1), though the xis are between 0 and 1 (and all xis add to 1 since they are probabilities).
When all alphas are equal, i.e. α=α1=α2=…=αk, it is called a symmetric Dirichlet distribution. In LDA, we assume a symmetric Dirichlet distribution and say that it is parameterised by alpha.
Let’s visualize the Dirichlet using an example of document-topic Dirichlet distribution. Say you have 3 topics (t1, t2, t3) in all your documents. Each document will have a certain topic (Dirichlet) distribution.
Let’s say that the first document d1 has the topic distribution (t1=0.2,t2=0.6,t3=0.2), the document d2 has the distribution (t1=0.5,t2=0.2,t3=0.3) and so on till document dn. Also, consider an extreme case – the document d100 has only one topic t1, and thus, has the distribution (t1=1.0,t2=0.0,t3=0.0)
Now, it will be nice to be able to plot the topic distributions of all these documents. The way to do that is by using a simplex as shown below.
The vertices represent the three topics – t1,t2,t3. The closer a point to a vertex, the more that vertex’s ‘weight’. For e.g. the document d100 having all its weight on t1 is a point on the vertex t1. The point representing document d1=(t1=0.2,t2=0.6,t3=0.2) is closer to t2 and farther (and equidistant) from t1 and t3. Finally, d2=(t1=0.5,t2=0.2,t3=0.3) is closest to t1, slightly close to t3 and far from t2.
Now, imagine you have N=10,000 documents each plotted as a point on this simplex. The shape of this distribution is controlled by the parameter alpha (assuming a symmetric distribution).
The four figures below show the different shapes of Dirichlet distributions as alpha varies.
- At values of alpha < 1 (figure-4), most points are dispersed towards the edges apart from a few which are at the centre (a sparse distribution – most topics have low probabilities while a few are dominant).
- At alpha=1 (figure-1) the points are distributed uniformly across the simplex.
- At alpha > 1 (figure 2, top-right) the points are distributed around the centre (i.e. all topics have comparable probabilities such as (t1=0.32,t2=0.33,t3=0.35)).
- The figure on bottom-left shows an asymmetric distribution (not used in LDA) with most points being close to topic-2.
Note that when you take a sample from a Dirichlet (such as T=(t1=0.32,t2=0.33,t3=0.35)), T itself a probability distribution, and hence a Dirichlet is also often called a distribution over distributions.
Rahim will demonstrate the effect of alpha on the shape of the Dirichlet in the upcoming lecture.