Skip to content

A hybrid LLM-based framework that performs parallel mention expansion using multilingual and monolingual LLMs, then applies a similarity-based mechanism to select the most contextually appropriate expansion. The selected expansion is used to improve entity disambiguation model in low-resource languages, such as Indonesian.

Notifications You must be signed in to change notification settings

dice-group/CELESTA

Repository files navigation

CELESTA

GitHub license GitHub stars

CELESTA is a hybrid Entity Disambiguation (ED) framework designed for low-resource languages. In a case study on Indonesian, CELESTA performs parallel mention expansion using both multilingual and monolingual Large Language Models (LLMs). It then applies a similarity-based selection mechanism to choose the expansion that is most semantically aligned with the original context. Finally, the selected expansion is linked to a knowledge base entity using an off-the-shelf ED model—without requiring any fine-tuning. The following is the architecture of CELESTA:

📂 Repository Structure

│
├── datasets/                     # Input datasets (IndGEL, IndQEL, IndEL-WIKI)
├── images/                       # Architecture visualizations
│   └── celesta_architecture.jpg
├── src/                          # Source code for CELESTA modules
│   └── mention_expansion/        # Mention expansion scripts
├── requirements.txt              # Python dependencies
├── README.md                     # Project overview
└── LICENSE                       # License file

⚙️ Installation

  1. Clone the repository
   
   git clone https://github.com/dice-group/CELESTA.git
   cd CELESTA 
  1. Create the environment

conda create -n celesta python=3.10
conda activate celesta
pip install -r requirements.txt

  1. Install CELESTA-mGENRE

# change folder to entity_disambiguation directory
cd entity_disambiguation

# run script to install CELESTA-mGENRE
bash INSTALL-CELESTA-mGENRE.sh

Evaluation

📊 Datasets

CELESTA is evaluated on three Indonesian Entity Disambiguation (ED) datasets: IndGEL, IndQEL, and IndEL-WIKI.

  • IndGEL (general domain) and IndQEL (specific domain) are from the IndEL dataset.
  • IndEL-WIKI is a new dataset we created to provide additional evaluation data for CELESTA.
Dataset Property IndGEL IndQEL IndEL-WIKI
Sentences 2,114 2,621 24,678
Total entities 4,765 2,453 24,678
Unique entities 55 16 24,678
Entities / sentence 2.4 1.6 1.0
Train set sentences 1,674 2,076 17,172
Validation set sentences 230 284 4,958
Test set sentences 230 284 4,958

🤖 Large Language Models (LLMs)

CELESTA uses two hybrid LLMs:

🚀 Usage

Mention Expansion

  1. Run Mention Expansion
# Change directory to the src folder
cd src

# To run the mention expansion script
# usage: mention_expansion.py [-h] [--model_name MODEL_NAME] [--prompt_type PROMPT_TYPE] [--dataset DATASET] [--split SPLIT] [--llm_name LLM_NAME] [--input_dir INPUT_DIR]
#                            [--output_dir OUTPUT_DIR] [--batch_size BATCH_SIZE] [--save_every SAVE_EVERY] [--save_interval SAVE_INTERVAL]

python mention_expansion.py --model_name meta-llama/Meta-Llama-3-70B-Instruct --prompt_type few-shot --dataset IndGEL --llm_name llama-3

  1. Entity Disambiguation

Entity Disambiguation with mGENRE

# Change to mGENRE directory
cd entity_disambiguation/GENRE/CELESTA-mGENRE

# Run script to CELESTA-mGENRE
bash run-CELESTA-mGENRE.sh  ../../../../results/mension_expansion/celesta/IndGEL/few-shot_llama-3_komodo/test_set.json

📈 Results

The table below compares CELESTA with two baseline ED models (ReFinED and mGENRE) across the three evaluation datasets. Bold values indicate the highest score for each metric within a dataset.

Dataset Model Precision Recall F1
IndGEL ReFinED 0.749 0.547 0.633
mGENRE 0.742 0.718 0.730
CELESTA (ours) 0.748 0.722 0.735
IndQEL ReFinED 0.208 0.160 0.181
mGENRE 0.298 0.298 0.298
CELESTA (ours) 0.298 0.298 0.298
IndEL-WIKI ReFinED 0.627 0.327 0.430
mGENRE 0.601 0.489 0.539
CELESTA (ours) 0.595 0.495 0.540

The table below reports Precision (P), Recall (R), and F1 for CELESTA and individual LLM configurations across the three datasets, under zero-shot and few-shot prompting. Bold values indicate the highest F1 score within each dataset and prompting setting. The following results are obtained when CELESTA uses ReFinED to generate candidate entities and retrieve the corresponding Wikidata URIs.

Dataset Model Zero-shot Few-shot
PRF1 PRF1
IndGEL LLaMA-30.7270.4990.5920.7770.5310.631
Mistral0.6990.4110.5170.8060.3100.448
Komodo0.7090.4470.5480.7040.5270.603
Merak0.6540.4410.5260.7490.5470.633
CELESTA with ReFinED
LLaMA-3 & Komodo0.7310.4370.5470.7570.5130.612
LLaMA-3 & Merak0.6880.4310.5300.8020.5860.677
Mistral & Komodo0.7190.3900.5060.7810.3440.478
Mistral & Merak0.6780.4020.5050.7790.5030.611
IndQEL LLaMA-30.1540.0510.0770.3270.0580.099
Mistral0.1790.1310.1510.0720.0290.042
Komodo0.1580.1160.1340.2080.1600.181
Merak0.2030.1490.1720.1420.1060.121
CELESTA with ReFinED
LLaMA-3 & Komodo0.1380.0470.0710.2820.0730.116
LLaMA-3 & Merak0.1600.1130.1320.1300.0980.112
Mistral & Komodo0.1380.0950.1120.1070.0470.066
Mistral & Merak0.1960.1460.1670.1280.0950.109
IndEL-WIKI LLaMA-30.5810.2340.3320.6390.3220.428
Mistral0.5650.2320.3290.5520.2010.294
Komodo0.5920.2560.3570.5910.2700.370
Merak0.5910.2850.3850.5480.2930.382
CELESTA with ReFinED
LLaMA-3 & Komodo0.5770.2340.3320.6390.3220.428
LLaMA-3 & Merak0.5960.2730.3740.6410.3550.457
Mistral & Komodo0.5760.2310.3300.5750.2190.317
Mistral & Merak0.5640.2480.3450.5810.2700.369

The following results are obtained when CELESTA uses mGENRE to generate candidate entities and retrieve the corresponding Wikidata URIs.

Dataset Model Zero-shot Few-shot
PRF1 PRF1
IndGEL LLaMA-30.7200.6940.7070.7420.7180.730
Mistral0.6670.6400.6530.6070.5840.595
Komodo0.7020.6680.6850.7400.6980.718
Merak0.6110.5760.5940.6960.6720.684
CELESTA with mGENRE
LLaMA-3 & Komodo0.6950.6600.6770.7410.7080.724
LLaMA-3 & Merak0.6310.5960.6130.7480.7220.735
Mistral & Komodo0.6570.6320.6440.6230.6020.612
Mistral & Merak0.6200.5880.6030.7020.6760.686
IndQEL LLaMA-30.2980.2980.2980.2740.2730.273
Mistral0.2580.2580.2580.1850.1820.183
Komodo0.2520.2510.2510.2690.2690.269
Merak0.2330.2330.2330.2550.2550.255
CELESTA with mGENRE
LLaMA-3 & Komodo0.2980.2980.2980.2660.2660.266
LLaMA-3 & Merak0.2760.2760.2760.0.2560.2550.255
Mistral & Komodo0.2620.2620.2620.1850.1820.183
Mistral & Merak0.2360.2360.2360.2020.2000.201
IndEL-WIKI LLaMA-30.5160.4150.4600.6010.4890.539
Mistral0.4570.3600.4030.4470.3630.401
Komodo0.5420.4010.4610.5470.4220.476
Merak0.4740.3710.4170.4280.3530.387
CELESTA with mGENRE
LLaMA-3 & Komodo0.5480.4110.4700.6180.4810.537
LLaMA-3 & Merak0.5210.4120.4600.5950.4950.540
Mistral & Komodo0.5000.3680.4240.4840.3820.427
Mistral & Merak0.4470.3490.3920.5070.4130.455

About

A hybrid LLM-based framework that performs parallel mention expansion using multilingual and monolingual LLMs, then applies a similarity-based mechanism to select the most contextually appropriate expansion. The selected expansion is used to improve entity disambiguation model in low-resource languages, such as Indonesian.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published