Big Data Case: Natural Language Processing on Doodle polls

With this post I’d like to present a highly interesting productivity case based on big data findings. We asked ourselves how we can learn from Doodle users when it comes to measuring interaction between clients. Our product management team collaborated for this project with the analytics team of our parent company Tamedia to give us a better understanding of how our platform is used, to improve our product, and ultimately afford our users an even better user experience. Sounds easy, but I can tell you: it is quite complex.

Big Data scientist Nicolas explains

I have asked the data scientist Nicolas Perony to explain in his own words (but in an understandable way :-)) how this project started: here is his version taken from an article he posted on the blog of Tamedia Digital just recently.

Doodle Data analysed
Doodle Data analysed

Understanding the interaction between users of an online platform and the platform itself involves asking precise questions; in the case of Doodle, such questions include for example:

–       how many polls does a user create per month, on average?

–       what is a typical topic for polls created on a Sunday?

To answer the first question, it is enough to analyse the data stored in a production database, or data warehouse. To answer the second question however, this will not be sufficient. It would require being able to systematically analyse the content of a large number of text documents, also known as unstructured data. When unstructured data is mostly made of written or transcribed text, the task of extracting information from it is called Natural Language Processing.

The work involved the use of machine learning tools to scan millions of anonymised polls (this is what we mean with big data), to extract meaning from what Doodle users write in the poll title and description.

Natural Language Processing in Action

One of the most difficult tasks in Natural Language Processing pipeline is that of building meaningful numerical word representations, or word vectors. In the context of Doodle polls, does the word “meeting” belong to a professional of personal context? In other words, is it numerically closer to “project” (e.g. project meeting) or “friends” (e.g. meeting with friends)? This difficult problem saw a significant breakthrough in 2013, when Google researchers led by Tomas Mikolov published a paper describing the word2vec technique, based on artificial neural network models. A strength of word2vec is that it is able to represent analogies and semantic similarities through vector operations. With a text corpus composed of Doodle polls, this makes the model very powerful in describing the context in which a word is used. With this model we were successfully able at capturing the professional context of the word “project” in English, but also the private context of the word “essen” in German, with both words frequently occurring in Doodle poll titles and descriptions.

Low-dimensional embedding of semantic context in a word map

The output of the model can be summarised in a single figure, or word map. On this map, words that appear together will have a similar meaning in the context of event scheduling with Doodle. This produces the following result:

Word map of Doodle Keywords in German
Word map of Doodle Keywords in German

This word map represents the 2’000 most frequent words present in the title and description of Doodle polls in the German language. Colours represent clusters of similar words. Zooming in on this map reveals some interesting features:

By characterising such relations between the words used in Doodle polls, the Tamedia Digital Analytics team was able to help the Doodle product managers to better understand how people use the scheduling tool, in private and business scenarios. Such insights are precious for the Doodle team, hard at work to make an already great product an excellent one!

Thanks Nicolas for this insight, big data science presented with the use-case of Doodle’s user interaction, this is digital transformation in reality and several improvements on the platform resulted immediately out of these findings!