This project was developed for the Machine Learning course in the Master’s in BDMA at UPC. The goal is to predict the root node in syntactic dependency trees derived from sentences in a 21-language parallel corpus. Each tree is represented as a set of undirected edges, and the root prediction task is framed as a binary classification problem at the node level. Centrality measures such as degree, closeness, betweenness, PageRank, and more are extracted and used as node features. The project involves preprocessing the data, constructing an expanded node-level dataset, engineering features, addressing class imbalance and dependency issues, and evaluating several classification models. The repository includes code for preprocessing, modeling, and evaluation.
Check the important pdf files:
- Final Report
- Appendix, containing all the visualizations
- Complete assignment description
Follow these steps to set up the environment and jupyter kernel:
-
Create a virtual environment:
python -m venv myenv
-
Activate virtual environment:
myenv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Add the virtual environment as a new Jupyter kernel
python -m ipykernel install --user --name=myenv --display-name "Python ML-project(myenv)"