Knowing me knowing you
Topic modeling is essentially a statistical technique to uncover the hidden subjects, or topics, that occur in a collection of documents. The method is completely unsupervised and the learning is performed based on subtle co-occurrences of words in a documents. Due to the large number of possible combinations, the algorithm takes into account the patterns presented in the corpus and tries to generalize these patterns, clustering them into a set of groups, or topics.
The most used technique for topic modelling is Latent Dirichelet Allocation (LDA). In this model we assume that the observations are explained by a set of topics (unobserved) groups that explains the observed data – thus the latent designation. This class of models are called generative in a sense that we assume the latent variables to be sufficient to generate the data.
In our case observations are words in documents (dating profile descriptions – see previous blog entry), and we assume that each document is a mixture of a number of topics (set initially) and that each word’s creation is attributable to one of the document’s topics.
The topic modeling can be expressed as:
where P is the probability of a given word W being generated by a stochastic process originated in the distribution. The term θi is the word distribution for topic k, and z_ij is the topic for the jth word in document i, and wij is the specific word. The parameters α and β are the priors on document and word distributions in the bayesian framework.
So we have two distributions: words distribution to topic, φ, and topic distribution for document, θ. Noticed that a document is in general generated by more than one topic.
After some cleaning and stematization of the text we end up with a vocabulary size of 8734 words. Since most words are not informative – they are too common or too rare – we introduce two cutoffs: at word frequency below 5 and higher than 80 ending with a vocabulary of 2743.
The model was trained using Gibbs sampling with 1000 iterations and K=10 topics using the R package “lda”.
After convergence, we got two dataframes, one for θ and other for φ, containing the membership of each user profile with respect to the 10 topics considered. The results can be visualized in this link on a interactive visualization dashboard.
Is it useful?
Now that we have clustered the profiles into 10 topics, the question is: how do we know they have any meaning? To answer this question, I created a model that fits the topic membership of each profile to several personality scores, obtained from the answers on Q&A in the dating website. We used a Random Forest algorithm with some caution to avoid overfitting. The results were impressive.
Next figure presents the predicted versus real score on “love driven” personality component (note that only o fraction of the profiles have a sufficiently large number of answers to get an estimate of this personality trait). Colours indicate the type of relationship users are looking for – I only consider female users. Essentially the algorithm can predict how love driven you are based on the description you made of yourself… Now, that’s something I wouldn’t expect, at least with this level of accuracy.
I’m currently working on the approach by Salakhutdinov and Hinton with Deep Belief Networks and see how it compares with LDA.
Next, I used a self organised map (SOM) to cluster the different profiles into a visual and topological meaningful way. Each colour region is a cluster formed by semantic similar profiles.
Next figure is a semantic fingerprint obtained for a specific profile. The red points are semantic features that this user dislike, the green points the semantic features the user considers important. Note that each cell in the graph has a meaning, so that you can know precisely which components do you match or not. This is a visual match algorithm that is much richer than keyword matching – by selecting (deselecting) a specific cell you can fine tune your search.
Note that in this vectorial version of the profiles, user are allowed to query by words or desired personality traits, something like “Profile X + love driven – arrogant”.
Next posts I will consider images and multimodal learning combining text and images (profile photos) in the semantic search.