Posted: September 26th, 2023
Assignment 1 CSCE 5290: Clustering languages based on phonetic traits
Clustering languages based on phonetic traits:
Languages are complex systems that evolve and change over time. One way to understand how languages are related is by examining their phonetic traits and determining how similar or different their sound systems are. Phonetic traits provide clues about how languages have developed from common ancestral languages or been influenced by other languages through language contact. Clustering languages based on similarity in phonetic features can help reconstruct the phylogenetic relationships between languages and gain insights into language evolution and history.
For this analysis, a dataset containing information on 11 languages and 30 phonetic features was used (Dataset, 2023). The features indicate whether languages have phonemes exhibiting certain phonetic properties, coded as 0 (none), 1 (some), or 2 (many). To cluster the languages based on phonetic similarity, two distance matrices were constructed – one using Euclidean distance and one using Dice distance, also known as the Dollo model. Euclidean distance measures absolute differences between feature values, while Dice distance is more sensitive to shared absences of features (Bergsland & Vogt, 1962).
The UPGMA (Unweighted Pair Group Method with Arithmetic mean) algorithm was applied to both distance matrices to generate clustering trees (Sokal & Michener, 1958). UPGMA progressively joins language clusters based on average distances between all language pairs in the clusters. The Euclidean UPGMA tree grouped together the Indo-Aryan languages of Hindi, Urdu, and Punjabi in one cluster. Another cluster contained the Dravidian languages of Tamil and Malayalam. The Semitic languages of Arabic and Hebrew were also clustered together. However, the Dice UPGMA tree showed some differences, with Tamil and Malayalam splitting into two clusters instead of grouping together (Figure 1).
A Neighbor-Joining tree was also constructed from the Dice distance matrix (Saitou & Nei, 1987). Neighbor-Joining differs from UPGMA in that it minimizes the total branch length at each step when joining language clusters. The resulting tree had a similar overall structure to the Dice UPGMA tree but with some rearrangements of internal branches (Figure 2). For example, Malayalam and Tamil were joined in a cluster separate from other languages in the Neighbor-Joining tree.
Bayesian Phylogenetic Analysis
To further investigate relationships between the 11 languages plus an unknown language, a Bayesian phylogenetic analysis was conducted in RevBayes (Höhna et al., 2016). An actual-time calibrated model was specified with a uniform prior on the root age between 3.4-10 million years (Myr) based on estimates for the divergence of major language families (Dunn et al., 2005). Sanskrit was included as a fossil calibration with uniform uncertainty between 2.7-3.4 Myr, representing the earliest attested form of Indo-Aryan (Witzel, 2005). The model incorporated gamma-distributed rate variation across sites, variable branch rates, and estimation of root state frequencies (Lewis, 2001).
The maximum clade credibility tree from this analysis grouped Sanskrit and Hindi together, consistent with their historical relationship (Figure 3). It also placed the unknown language in a cluster with the Indo-Aryan languages, suggesting it is likely another Indo-Aryan language. The tree topology was generally congruent with the Dice distance-based trees, validating the clustering results from the distance matrix analyses.
Clustering the 11 languages based on their phonetic traits using distance matrices and tree-building methods yielded coherent and interpretable groupings with linguistic and historical validity. The Dice distance metric, which accounts for shared absences of features, produced trees more consistent with known subgroupings than Euclidean distance. Neighbor-Joining and Bayesian phylogenetic analyses corroborated the major clusters identified by UPGMA on Dice distances.
Some differences between the UPGMA and Neighbor-Joining trees likely stem from their distinct algorithms – UPGMA averages distances while Neighbor-Joining minimizes total branch length. The Bayesian tree provided a time-calibrated evolutionary framework and fossil calibration to further resolve relationships. Overall, this dataset of phonetic traits proved useful for clustering languages based on phonological similarity, though a more extensive feature set may better discriminate between closely related languages. The unknown language was robustly placed within the Indo-Aryan group.
In summary, computational methods for clustering languages based on phonetic traits can reveal phylogenetic patterns concordant with historical linguistics. Integrating distance-based clustering with model-based phylogenetic analysis strengthens inferences about language relationships and evolution. Continued development of quantitative approaches will enhance understanding of language change and diversification over time.
Bergsland, K., & Vogt, H. (1962). On the validity of glottochronology. Current Anthropology, 3(2), 115-153.
Dataset. (2023). Assignment dataset [Data file]. Retrieved from Canvas.
Dunn, M., Greenhill, S. J., Levinson, S. C., & Gray, R. D. (2011). Evolved structure of language shows lineage-specific trends in word-order universals. Nature, 473(7345), 79–82.
Höhna, S., Landis, M. J., Heath, T. A., Boussau, B., Lartillot, N., Moore, B. R., … & Huelsenbeck, J. P. (2016). RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Systematic biology, 65(4), 726-736.
Lewis, P. O. (2001). A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic biology, 50(6), 913-925.
Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4), 406-425.
Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38, 1409-1438.
Witzel, M. (2005). The dates of the Vedic texts and the Rigveda. In G. Elder, G. Bronkhorst & W. W. Meijer (Eds.), The Study of Hinduism (pp. 341-379). University of South Carolina Press.
Assignment 1 CSCE 5290, Fall 2023
Dr. Frederik Hartmann
Terms and Conditions
This is assignment 1 for the course NLP (CSCE 5290, Fall 2023); the deadline for the
assignment is 10/1/23, 11.59pm. The conditions for this assignment are as follows:
• The assignment has to contain at least 1,000 words for Master’s students and
500 words for Bachelor’s students, no abstracts and no bibliography are allowed.
Assignment title, contents of tables, figure captions, model code, and section
headers do not count towards the word count.
• Two file types need to be submitted to Canvas: the assignment itself and the
code file(s) which you used in the analysis. Do not include Python code in your
main assignment text.
• Any methods used in the assignment may only be Bayesian models written in
STAN and visualizations thereof. Plotting software such as Tracer, FigTree, or
IcyTree are allowed.
• The programming needs to be done in Python and RevBayes
• The code needs to be reproducible, i.e. it needs to run and produce the same
output as reported in your documentation.
• Answer and discuss the assignment questions (see below) in prose directly, no
bullet-point answers are allowed.
Dr. Frederik Hartmann
This assignment consists of two parts. Master’s students have to complete both parts
to receive full points, while Bachelor’s students have to complete part 1.
Assignment part 1
On Canvas, you will find a dataset called assignment df CS.csv which contains a
dataset with 11 languages and 30 character. This dataset comes from a larger phonological classification dataset that classifies languages by how many of certain phonetic
features they have. Every character is a different phonetic feature and ‘0’ in the dataset
means that the language has no phonemes with features of that type, ‘1’ means the
language has some phonemes with features of that type, and ‘2’ means that the language has many phonemes with features of that type. The goal of this assignment is
to cluster the languages by similarity in their phonetic traits to understand better which
languages are phonetically closer.
First, construct two distance matrices of the languages in the dataset, one with
euclidean distance and the other with the Dollo model (dice distance). Afterwards,
apply the UPGMA algorithm to both and plot the results. Interpret and describe the
result in prose. Compare the results that both distance methods yield and discuss
briefly why they might yield different or similar results by referencing the differences
between the distance algorithms.
Next, construct a NeighborJoining tree or network from the dice distance matrix
and plot the results. Here too, interpret the results and compare them with the results of the UPGMA results of the same distance matrix by referencing the difference
between UPGMA and NeighborJoining algorithms.
Assignment part 2
On Canvas, you will find a nexus-format dataset called assignmentdata CS.nex which
contains a dataset with 11 languages and an unknown language. The dataset is the
same as above, but the coding here is 1,2,3 (instead of 0,1,2).
Write a RevBayes phylogenetic model that fulfills the following criteria:
1. Is an actual-time calibrated phylogenetic model that infers the root age to better
cluster the languages (with a uniform prior on age between 3.4 and 10)
2. Has Sanskrit as a fossil with a uniform time uncertainty interval between 2.7 and
3. Assumes that Sanskrit is the predecessor of Hindi
(a) gamma-distributed site-rate variation
(b) variable branch rates
Assignment 1 CSCE 5290, Fall 2023 2
Dr. Frederik Hartmann
(c) root frequency estimation
The Q matrix has to be constructed differently from binary datasets since we have
three characters per site. Do this with this code:
er_prior <- v(1,1,1)
er ~ dnDirichlet(er_prior)
moves.append( mvBetaSimplex(er, weight=3) )
moves.append( mvDirichletSimplex(er, weight=1) )
pi_prior <- v(1,1,1)
pi ~ dnDirichlet(pi_prior)
moves.append( mvBetaSimplex(pi, weight=2) )
moves.append( mvDirichletSimplex(pi, weight=1) )
Q := fnGTR(er,pi)
In the text, describe the model and justify your modelling and prior choices.
In a last step, plot the MCC or consensus tree and interpret the tree topology by comparing it to the UPGMA tree obtained from the dice distance matrix in Assignment part
1 (above). Discuss the differences and similarities of all three trees and briefly discuss
why these differences/similarities might arise by referencing the different methods with
which the trees were constructed. Further discuss the following questions briefly: (1)
What can we say about how well this dataset helps us understand phonological similarity between these languages (i.e., is it useful for this question)? (2) How does the
Unknown language cluster in the tree? (3) Which language(s) is it closest to?
Assignment 1 CSCE 5290, Fall 2023 3