Combining LDA and K-Means clustering for automated persona generation

A large project that I'm working on for the past few months is all about natural language analysis. I've had a great deal of fun diving deeply into both cloud api powered and local ML and NLP techniques, and in the meantime collecting largish data sets that are verbatim quotes from specially crafted consumer surveys.

The background: There's this construct in marketing-land called a "Persona", and it's a useful abstraction of the idea that there are groups of people who behave similarly, or have similar intents in the market, and can be described with a few key wants, needs, and desires. 

Given the large amount of statements of consumers from the survey data, what can we do to help automatically create such personas? We may not be able to do the whole job with code, but there should be interesting ways to help find the data points and insights to do 80% of the job, and provide a starting point for human strategists to editorialize and craft into a useable persona document.

Previous attempts at this had leveraged IBM's Watson Personality Insights service to analyze the unique text of a single person's answers to a custom survey, then use the percentile values as N-dimensional vectors for a k-Means clustering algorithm. The results were not always interesting, but held some promise. Just looking at the personality traits doesn't give enough direction as to what are the natural groupings of how people respond to specific questions.

How could I try to model additional dimensions that would not only capture the personality attributes of the survey population, but also the actual topics they were talking about? One direction to try is automatic topic modeling - find a way to automatically classify a number of unique topics from a large group of user statements.

Enter the LDA algorithm (Latent Dirichlet Allocation), recommended to me by a colleague for the task of automatically detecting topics in a body of text. It's an unsupervised machine learning algorithm that uses a bunch of math to automatically figure out any N number of topics in a body of text. These topics are expressed as a list of key terms, aka keywords, with scoring for each term.

A full survey is a collection of these questions, so then we have a unique LDA topic set for each question. It's not useful to find topics across the entire survey response set, as individual questions can have different expected discussions.

Based on the LDA topics generated, we can then go back to each individual survey answer and rank the source text according to how many of the LDA topic keywords are included in the text. We do a sort to find the 'best expression of that topic', turning our topic terms from "luxury watch brand shopping" to "I like to shop for luxury watch brands and sometimes i do it online". In testing about 6 topics per question set seems to give a good spread of key terms that helps with differentiation.

Once we have a good set of topics for each question, we can go back and classify what topic someone is talking about in a specific answer to a specific question, by doing a similar keyword / key term comparison. The topic with the highest matches with the specific answer is the 'matched topic' of that answer according to the automatically discovered topics via the LDA algorithm. Then we can start using the classified topics as additional dimensions in our clustering, and really get somewhere.

Improving our K-means clustering with more dimensions: What if we encoded the classified topic of each respondent's answers as additional dimensions in the k-means clustering? So the first 35 dimensions are the same personality traits, but for each survey question we add another dimension, encoding the topic (either a zero for no topic detected, or 1-6 for an index of the per-question LDA topic) as an index.

What we're trying to get at here: Automatically finding the groups in a large population by the topics they're talking about in common, plus any shared personality traits as expressed by the language they're using to answer with. These can map pretty closely back to the concept in marketing known as a "Persona" - basically a useful abstraction to help marketers craft a message or strategy to address the perceived things this population of people want, need, and value.

Another piece of the puzzle is automating how to find the ideal number of clusters. Our `ml-kmeans` library reports an error value for each returned cluster. We can average the errors for each returned cluster to guess at an overall error rate, and apply a typical k-means cluster count optimization approach.

Typically the literature for analyzing k-means clustering results focuses on finding the 'elbow' of an error graph - as the number of clusters goes up, the error rate typically goes down, but there's a diminishing returns effect here, too many clusters aren't useful and are hard to interpret. Instead of having to manually set a cluster size each time (painful and non-automated), let's analyze the error values and find the size automatically.

Therefore we need a simple method to compare the error values across a range of cluster sizes (2 to 10) and automatically pick a good value for cluster size. Since 2 points makes a line, we can calculate the current slope of the error line, and find the first point at which the slope goes the most horizontal - meaning that the graph has "elbowed". Alternatively, we can also stop when the slope goes over a set threshold, meaning the graph has gone horizontal enough.

Then we can just return that set of cluster centers, sometimes it's 4, 5, 6, depending on our elbow detection code. Along with the membership of each cluster, mapping back to source record identifiers, we also return the topic predictions for each cluster centers. These are the additional vectors we encoded by topic index (1-6) earlier. Therefore, when we display the cluster's best matched topic, we can simply find the closest integer, and that becomes the estimated topic index for that question's dimension. We can further refine the display logic so that we only display the sample responses when there's more than one different topic across all the cluster centers. It's the nature of high-dimensional k-means clustering that for some dimensions, you may not see a big range of differences. It's good to take a slice of that dimension across your input space and do a histogram of it, so you can see that dimensions without wide variance in topics, also cluster without large differences, as would be expected.

Some questions lend themselves to common answers in a population and that's fine - think of the answers to "Have you ever been to the moon? How does it feel? What thoughts or emotions come to mind when you are on the surface of the moon looking back at earth?" Asking a few thousand people this question will get you a bunch of similar responses: "What are you talking about I've never been to the moon".

In testing, results are good and meaningful. Depending on the robustness and variety of topics in a set of survey responses, in practice this approach has helped find common topics that align to a fictional persona.

It's been 6 days since the last post. This is a longish post. I had fun writing it! Might need some revision, but it's a complete post.

Posted on June 17th, 2018 in Algorithms, Machine Learning by Kevin