
Anthropology and biology frequently employ metrics that describe similarities and differences between languages. A key use is comparing whether the languages that individuals speak and the genes that they carry have diverged in similar or different ways. These comparisons can provide insight into the kinds of historical processes that have affected communities, as well as general information about broader human evolutionary processes.
To make language comparisons, we need a quantitative way to compare languages. The most commonly used approaches focus on a key unit of languages: words. Most metrics used for evolutionary comparisons are based on 'edit distances', the number of changes required to turn one word into another (such as the one change required to turn 'film' into 'firm').
However, some sounds change more easily than others, and standard edit distances do not take these phonetic features into account. We can see the reason for this in the figure above, which shows where some different vowel sounds are articulated in the mouth. The sounds in the English words 'bit' and 'bet' are articulated closer together in the mouth than 'bit' and 'bat'. As languages change over time, we might expect to observe more changes between similar sounds and fewer changes between quite different sounds.
To make language comparisons, we need a quantitative way to compare languages. The most commonly used approaches focus on a key unit of languages: words. Most metrics used for evolutionary comparisons are based on 'edit distances', the number of changes required to turn one word into another (such as the one change required to turn 'film' into 'firm').
However, some sounds change more easily than others, and standard edit distances do not take these phonetic features into account. We can see the reason for this in the figure above, which shows where some different vowel sounds are articulated in the mouth. The sounds in the English words 'bit' and 'bet' are articulated closer together in the mouth than 'bit' and 'bat'. As languages change over time, we might expect to observe more changes between similar sounds and fewer changes between quite different sounds.
We therefore need a computational method in which rare sound changes are given more weighting than common sound changes. Any language distance also needs to be flexible enough to be used in a wide range of downstream linguistic, biological and evolutionary settings.
Downey SS, Hallmark B, Cox MP, Norquest P, Lansing JS. 2008. Computational feature-sensitive reconstruction of language relationships: developing the ALINE distance for comparative historical linguistic reconstruction.
Journal of Quantitative Linguistics 15:340-69. |
The ALINE metric is one such versatile approach. This method considers a pair of words from two languages that have the same fundamental meaning (called cognates, such as English 'hound' vs German 'Hund'). The two words are aligned and phonetic changes are used to generate a similarity score. Crucially, the comparison explicitly considers features that are important to language change, including such esoteric concepts as prosody (the tone or accent of a syllable), place, manner, phonation, vowel length and color.
Most written languages do not have enough symbols to represent all of their spoken sounds uniquely, so linguists use a special set of symbols, the International Phonetic Alphabet (IPA), which contains enough symbols to represent all sounds in all languages. You do not need to know the IPA symbols to follow how languages are compared, but if you are interested in knowing what sounds the symbols represent, click on the IPA link to find out.
To calculate ALINE similarity scores between languages, we can use the R package, alineR (Downey et al., 2017).
As an example, in the islands of eastern Indonesia, the word for 'moon' is 'bulan' in the Kolhua language of Timor, but 'wulaŋ' in the Wunga language of Sumba. Since they sound similar, we might think (correctly) that 'bulan' and 'wulaŋ' are cognates – related words – but how can we place a number on this similarity?
The ALINE algorithm aligns two words. Here, the same elements are shown in black, with different elements shown in red
| – b u l a n |
| w – u l a ŋ |
The method then considers how similar the sounds are. For instance, 'b' and 'w' are produced by only a small change in how you place your lips. (Try saying 'b' and 'w' out loud). Similarly, 'n' and 'ŋ' – the 'ng' sound in English 'sing' [IPA sɪŋ] – are also quite similar sounds.
In this case, the ALINE algorithm assigns the difference between 'bulan' and 'wulaŋ' a distance of 0.28. Interested readers can see the math behind this number in Downey et al. (2008), but for our purposes, it is not important to know the details. The key idea is that similar words have small distances, with identical words being given a distance of zero. Words that are not related at all, such Sumbanese 'wulaŋ' and the English 'mun' (the IPA representation of 'moon'), would produce a much larger distance (here, 0.6).
Once a distance has been assigned for two words with the same meaning in two languages (like 'bulan' and 'wulaŋ' above), this process can be extended to many words, normalizing across them to obtain an 'average' linguistic distance between two languages. In a similar process, comparisons can be made in pairwise fashion across multiple languages, in this way generating a matrix of distances between many spoken languages.
Once we have a matrix of distances, we can (in some cases) represent this by a tree – a branching diagram that places similar languages close together and different languages far apart. A tree representation is often the ultimate goal of language comparisons, as a key way of showing relationships between many different languages.
You can begin to explore how this process works using the following app. It shows words for the same underlying idea (e.g., moon) across multiple languages found on the eastern Indonesian islands of Sumba, Flores and Timor. You can explore how the words differ between languages, and from these differences, you can build trees of relationships between languages on Sumba, Flores and Timor.
Questions:
Most written languages do not have enough symbols to represent all of their spoken sounds uniquely, so linguists use a special set of symbols, the International Phonetic Alphabet (IPA), which contains enough symbols to represent all sounds in all languages. You do not need to know the IPA symbols to follow how languages are compared, but if you are interested in knowing what sounds the symbols represent, click on the IPA link to find out.
To calculate ALINE similarity scores between languages, we can use the R package, alineR (Downey et al., 2017).
As an example, in the islands of eastern Indonesia, the word for 'moon' is 'bulan' in the Kolhua language of Timor, but 'wulaŋ' in the Wunga language of Sumba. Since they sound similar, we might think (correctly) that 'bulan' and 'wulaŋ' are cognates – related words – but how can we place a number on this similarity?
The ALINE algorithm aligns two words. Here, the same elements are shown in black, with different elements shown in red
| – b u l a n |
| w – u l a ŋ |
The method then considers how similar the sounds are. For instance, 'b' and 'w' are produced by only a small change in how you place your lips. (Try saying 'b' and 'w' out loud). Similarly, 'n' and 'ŋ' – the 'ng' sound in English 'sing' [IPA sɪŋ] – are also quite similar sounds.
In this case, the ALINE algorithm assigns the difference between 'bulan' and 'wulaŋ' a distance of 0.28. Interested readers can see the math behind this number in Downey et al. (2008), but for our purposes, it is not important to know the details. The key idea is that similar words have small distances, with identical words being given a distance of zero. Words that are not related at all, such Sumbanese 'wulaŋ' and the English 'mun' (the IPA representation of 'moon'), would produce a much larger distance (here, 0.6).
Once a distance has been assigned for two words with the same meaning in two languages (like 'bulan' and 'wulaŋ' above), this process can be extended to many words, normalizing across them to obtain an 'average' linguistic distance between two languages. In a similar process, comparisons can be made in pairwise fashion across multiple languages, in this way generating a matrix of distances between many spoken languages.
Once we have a matrix of distances, we can (in some cases) represent this by a tree – a branching diagram that places similar languages close together and different languages far apart. A tree representation is often the ultimate goal of language comparisons, as a key way of showing relationships between many different languages.
You can begin to explore how this process works using the following app. It shows words for the same underlying idea (e.g., moon) across multiple languages found on the eastern Indonesian islands of Sumba, Flores and Timor. You can explore how the words differ between languages, and from these differences, you can build trees of relationships between languages on Sumba, Flores and Timor.
Questions:
- How are languages related between islands? Do languages tend to be more similar within islands, or are sister languages found on different islands?
- How do the trees change as you increase the number of words in the dataset? What is the effect of adding more words? What does this tell you about the importance of sample size, and about the distinction between signal and noise?
References:
Downey SS, Hallmark B, Cox MP, Norquest P, Lansing JS. 2008. Computational feature-sensitive reconstruction of language relationships: developing the ALINE distance for comparative historical linguistic reconstruction. Journal of Quantitative Linguistics 15:340-69.
Downey SS, Sun G, Norquest P. 2017. alineR: an R package for optimizing feature-weighted alignments and linguistic distances. The R Journal 9:138-52.
Downey SS, Hallmark B, Cox MP, Norquest P, Lansing JS. 2008. Computational feature-sensitive reconstruction of language relationships: developing the ALINE distance for comparative historical linguistic reconstruction. Journal of Quantitative Linguistics 15:340-69.
Downey SS, Sun G, Norquest P. 2017. alineR: an R package for optimizing feature-weighted alignments and linguistic distances. The R Journal 9:138-52.