As a programmer, I think a lot about automation, and I am always looking out for bottlenecks and repetitive tasks that can be done programmatically. My philosophy is this: if it takes 15 minutes to do a task, and if it would take roughly the same amount of time to quickly write some code to automate it, I’d always recommend trying to automate it.
At Wibbu Studios, we are building something very unique, and the challenges that we face are quite different from engineering challenges that I have faced in the past. Unlike my previous roles, at Wibbu I talk to people from a wide variety of backgrounds, and I find their problems are usually very interesting. So when Dean, our CEO, approached me to help him with automating some language analysis, I was intrigued.
Dean introduced me to the Natural Language Toolkit, or NLTK, which is a Python library for natural language processing. The concept of natural language processing is what powers bots like Siri or Google Assistant on your phone and is also the same technology that is behind the countless chatbots that are on Facebook and other messaging services.
The challenge at hand was that we needed to find out how frequently words from a given language-learning curriculum were being used in our game’s dialogue script. Before revealing how we managed to solve the problem, here is an explanation of a few key concepts and features of NLTK.
The first concept of NLTK is tokenization, which is the process of breaking up data into individual words or sentences. This is done by breaking up words by spaces, tabs, new lines, or special characters.
NLTK’s tokenizers are smart, so if we run a sentence tokenizer on “Hello, Mr. Jacobs. Nice to meet you!”, it would understand that the dot after ‘Mr’ is not breaking up the sentence.
The next feature of NLTK that we use is Part of Speech (POS) tagging. NLTK’s POS tagger helps find the part of speech for each word and it is very accurate. It does this from context so it can differentiate between “play” used as a noun and “play” used as a verb.
Here is a list of all the POS tags and what they mean.
The third feature of NLTK that we used is called Lemmatizing, which is helpful in finding root words for a given word. For example, “say” for “said”, “good” for “better”, “be” for “is/was/are” etc. This was very helpful because the curriculum that we had to check against only had the root words.
By default, the lemmatizer assumes that the word is a noun, but it’s possible to specify the word type out of a noun (‘n’), verb (‘v’), adjective (‘a’) or adverb (‘r’).
So with these three core features, I had enough understanding of NLTK to begin coding a solution. Here is the code that reads a file of text (“input.txt“) and finds the frequency of each word from a vocab list (“words.txt“), and writes that data to a file (“output.txt“).
NLTK does come with a few limitations, and we needed to find workarounds for these. The POS tagger isn’t perfect, and neither is the Lemmatizer, so there were a few places where I had to customise or update the POS tags for our own specific purposes. In the example here, I am marking all occurrences of not as NEG (negation), a custom POS tag.
The analysis that we have done so far has been on a per word basis. However, the vocabulary list that we use also contains phrases and expressions, which I look forward to finding a solution for.
The tool is in its early stages, but it’s great to work on something that will speed up our processes and free up valuable linguist and developer time in the near future. I’ll be updating the functionality of the tool and improving its ability to analyse grammatical structures over the coming weeks, while also making it more universally user-friendly. I’ll also be connecting this tool to Google Docs and Google Sheets so that we can do all the language analysis at runtime as our Scriptwriter is writing the story, or our Language Team is updating the curriculum data. I’ll post again once I make further breakthrough with this!
Written by Amaan Rizvi, Game Developer and Producer.