Latent Dirichlet allocation

In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model. In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's presence is attributable to one of the document's topics. LDA is an example of a topic model. In the context of population genetics, LDA was proposed by J. K. Pritchard, M. Stephens and P. Donnelly in 2000. In the context of machine learning, where it is most widely applied today, LDA was rediscovered independently by David Blei, Andrew Ng and Michael I. Jordan in 2003, and presented as a graphical model for topic discovery. As of 2019, these two papers had 24,620 and 26,320 citations respectively, making them among the most cited in the fields of machine learning and artificial intelligence. In LDA, each document may be viewed as a mixture of various topics where each document is considered to have a set of topics that are assigned to it via LDA. This is identical to probabilistic latent semantic analysis (pLSA), except that in LDA the topic distribution is assumed to have a sparse Dirichlet prior. The sparse Dirichlet priors encode the intuition that documents cover only a small set of topics and that topics use only a small set of words frequently. In practice, this results in a better disambiguation of words and a more precise assignment of documents to topics. LDA is a generalization of the pLSA model, which is equivalent to LDA under a uniform Dirichlet prior distribution. For example, an LDA model might have topics that can be classified as CAT_related and DOG_related. A topic has probabilities of generating various words, such as milk, meow, and kitten, which can be classified and interpreted by the viewer as 'CAT_related'. Naturally, the word cat itself will have high probability given this topic. The DOG_related topic likewise has probabilities of generating each word: puppy, bark, and bone might have high probability. Words without special relevance, such as 'the' (see function word), will have roughly even probability between classes (or can be placed into a separate category). A topic is neither semantically nor epistemologically strongly defined. It is identified on the basis of automatic detection of the likelihood of term co-occurrence. A lexical word may occur in several topics with a different probability, however, with a different typical set of neighboring words in each topic. Each document is assumed to be characterized by a particular set of topics. This is similar to the standard bag of words model assumption, and makes the individual words exchangeable. With plate notation, which is often used to represent probabilistic graphical models (PGMs), the dependencies among the many variables can be captured concisely. The boxes are 'plates' representing replicates, which are repeated entities. The outer plate represents documents, while the inner plate represents the repeated word positions in a given document, each of which position is associated with a choice of topic and word. M denotes the number of documents, N the number of words in a document. The variable names are defined as follows: The fact that W is grayed out means that words w i j {displaystyle w_{ij}} are the only observable variables, and the other variables are latent variables.As proposed in the original paper, a sparse Dirichlet prior can be used to model the topic-word distribution, following the intuition that the probability distribution over words in a topic is skewed, so that only a small set of words have high probability. The resulting model is the most widely applied variant of LDA today. The plate notation for this model is shown on the right, where K {displaystyle K} denotes the number of topics and φ 1 , … , φ K {displaystyle varphi _{1},dots ,varphi _{K}} are V {displaystyle V} -dimensional vectors storing the parameters of the Dirichlet-distributed topic-word distributions ( V {displaystyle V} is the number of words in the vocabulary). It is helpful to think of the entities represented by θ {displaystyle heta } and φ {displaystyle varphi } as matrices created by decomposing the original document-word matrix that represents the corpus of documents being modeled. In this view, θ {displaystyle heta } consists of rows defined by documents and columns defined by topics, while φ {displaystyle varphi } consists of rows defined by topics and columns defined by words. Thus, φ 1 , … , φ K {displaystyle varphi _{1},dots ,varphi _{K}} refers to a set of rows, or vectors, each of which is a distribution over words, and θ 1 , … , θ M {displaystyle heta _{1},dots , heta _{M}} refers to a set of rows, each of which is a distribution over topics. To actually infer the topics in a corpus, we imagine a generative process whereby the documents are created, so that we may infer, or reverse engineer, it. We imagine the generative process as follows. Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over all the words. LDA assumes the following generative process for a corpus D {displaystyle D} consisting of M {displaystyle M} documents each of length N i {displaystyle N_{i}} :

Parent Topic

Child Topic

No Parent Topic