Language Discovery

Visualizing hidden features of languages and their connections to each other


The goal of this project is to uncover new and interesting facts about the languages we know and use every day. I am particularly interested in what connection a given language has with another, and if there is a way to quantify how similar a set of languages are.

The data sets were mostly collected from frequency dictionaries that catalog the most commonly used words in a particular language. Drawing from those sources, the first 900 words were used and organized into a "language matrix," as well as a hash table was used to find connections between the various languages using an algorithm I wrote. More details on that below.

I tried to be diverse in picking languages, yet I was constrained by the fact that the languages must use the latin alphabet for my analysis to work. The languages included are:


  1. English My Image
  2. Spanish My Image
  3. French My Image
  4. Dutch My Image
  5. German My Image
  6. Catalan My Image
  7. Italian My Image
  8. Turkish My Image
  9. Esperanto My Image


How long are the most common words?

We analyze this in two ways: we look at the average word length among the top 10% (or first 90 words of 900) for the average length, and then we do the same for the entire collection. What follows are visualizations describing both.

Interestingly enough, among the top 10% (or 90 most commonly used words), Turkish appears to have the longest words, followed by Esperanto, a language invented by linguist in the 19th century. Every other language seems to average around 3.3 characters however, except Spanish and Catalan.

As more words are included, Spanish has the longest words out of the langauge group, followed closely behind by Turkish and Catalan. The difference between the average lengths of the languages has gotten smaller. Another obvious trend is that every language had its average length increase. This means that the most common words tend to also be the shortest words.

French changes the most the more words are added, and every language increases, proving the previous claim.

The most common colors in every language

We model this problem as a graph, where every edge between two languages represents that they have a color in common and the color of the edge is their commonality.

The graph can be found here.