Language Learning with FOSS

Recently I started to get back into language learning, as I wanted to learn how to type properly using a Cyrillic keyboard. My hope was that making it easier to write content in Russian would help me improve faster. Then I came across a video by Tsoding which covered the STB data structures library hashmap implementation. He used this to create a byte encoding example. I was inspired by this and decided to write a C program of my own for frequency analysis.

The program is called freqqy. It performs a frequency analysis of input text/.txt files and outputs the analysis to either a file of your choosing or the console. I also realized that sometimes certain words are not relevant to your analysis, like common words that appear in text ("as", "the", "a", etc.). To overcome this, I also added the ability to pass in a "filter file", which is just a collection of words you want excluded from the analysis.

A text frequency analysis tool is a bit of a toy project, I'll admit. I mostly wrote it to play with the STB library. However, in looking for text to test it on, I realized that I have a pretty large collection of research papers from my masters research sitting in a folder on my desktop. I used pdftotext to extract the contents into text, and then promptly added a bunch of artifacts to my filter file (many of these papers include math notation like "dx", "dy", etc., which I wanted excluded). What I ended up with was a pretty interesting analysis summary that told me the top words across all the papers I had read. I am studying game-theoretic optimal control methods for multi-robot systems, so the top words were things like "game", "controls", etc.

The thought that then crossed my mind was that this analysis solves a problem for me: the problem of coming up with words to learn in my target language.

Most of the vocabulary you want to cover in language learning is pretty easily collected by just listening to/reading comprehensible input. This is the 80/20 rule: you're going to learn 80% of the words you need to be fluent through this method because there is a large subset of words that are disproportionately common in everyday language (again, think of English: pronouns like "I", "me", "you", articles, etc., are all used constantly). However, the remaining 20% is very tricky because those words are quite specific to certain topics being discussed and thus require you to know them to really understand the meaning of certain content. Think of watching a documentary on space, and consider how difficult it would be to follow if you lack the vocabulary for words like "cosmos", "planet", "stars", etc. Those words aren't common and they fall in the 20%, but you would benefit greatly from knowing them in that context.

This is a problem that I would struggle with when travelling. For instance, when I visited Italy, I wanted to know some words that apply to my own life so that I could introduce myself and talk about myself to people I was meeting. I would sit in front of the computer and write up a list of words like "rocket" and "engineering", then translate them into my target language and make flashcards. This is a laborious task.

However, using freqqy, it was now possible to feed in content that I regularly interact with, do a frequency analysis, and immediately see the top n words that I would need to learn in order to talk about 80% of that content. For instance, if I were going to a conference in Italy, I could take the top 80% of the words in the analysis of my research paper collection and study those with flashcards. I should then have sufficient vocabulary to discuss my research/interact with other conference attendees and understand what's being said.

This combines very nicely with Anki, which is a FOSS software for spaced repetition learning using flash cards. I actually wanted to try making a pipeline from a frequency analysis file to a collection of flashcards in a target language, but didn't get around to it. It would at least be possible to fill out the English side, but generally the translation to the target language has a little bit of nuance (I use Context Reverso a lot to make sure that the translation matches the context that the original word was used in. For example, the English word "plane" can be used to mean an air plane (the vehicle) or a 2D subset of a 3D space when used in a linear algebra context. Google Translate does not capture this, so I check translations based on context instead. I don't see an easy way to automate this.

In any case, the current pipeline is to take some text input in your native language (video/audio transcripts, books, blog posts, research papers, etc.) and feed them into freqqy to obtain an analysis output. Then, the top 80% of the words can be turned in to Anki flashcards for your target language and studied. By the time you master the flashcard deck, you should be able to at least understand similar content in your target language. This will allow you to target specific topics that are of interest to you (and this seems to be a similar idea to the science behind LingQ).

Matteo Golin

Language Learning with FOSS