Latxa-Instruct is an open-source project for reproducible training, evaluation, and benchmarking of Basque instruction-tuned language models. It provides:
- Open-source Basque LLMs: Instruction-tuned models based on Llama 3.1 (8B and 70B parameters)
- Synthetic instruction datasets: Large-scale generated instruction-response pairs in both Basque and English for reproducible training
- Human preference dataset: The first preference dataset in Basque with real user prompts, model responses, and 12,890 preference annotations from 1,285 participants
- Complete training pipeline: End-to-end scripts for data preprocessing, model training, and instruction-tuning on the CINECA Leonardo supercomputer
- Benchmark evaluation framework: Benchmarking scripts and results across 29 datasets in Basque, English, and Spanish
- Human evaluation framework: Full frontend and backend implementation for community-driven human evaluation of language models
All models, datasets, and evaluation tools are released under open licenses.
- 📒 Blog Post: TBA
- 📖 Paper: Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque
- 🤗 Models and Data: HiTZ/Latxa-Instruct
🚧 Note: The codebase is currently work in progress. Meanwhile, all datasets are available on Hugging Face.
- Corpus: Basque corpus of 4.3M documents
- Instructions: Magpie-generated Basque/English instructions
- Preferences: Human preferences on model outputs from 21 different LLMs, in response to real Basque prompts
🚧 Note: The codebase is currently work in progress and will require adaptation to your specific environment and setup.
Training is performed on the CINECA Leonardo cluster using Axolotl (a HuggingFace-based framework).
See model_training/README.md for full instructions, environment setup, and troubleshooting.
- Prepare the data
Preprocess instruction datasets and Basque corpora using scripts in model_training/. For example:This will tokenize and cache datasets as specified in the YAML configs.sbatch model_training/train_scripts/prepare_data.sh
- Train the models
Launch training with the provided SLURM scripts:sbatch model_training/train_scripts/Latxa-Llama-3.1-70B-Instruct-exp_2_101.sh
- Merge and transfer checkpoints
After training, merge distributed checkpoints and transfer to the target server usingmerge_weights.shandrsync_weights.sh.
- All configs are in model_training/train_configs/.
- Datasets are in JSONL format, with user/assistant conversations.
- Example config: exp_1_010_fixed.yaml
Our evaluation framework combines two complementary approaches to provide comprehensive model assessment: benchmark-based evaluation and human preferences.
See evaluation/README.md for full instructions, environment setup, and troubleshooting.
- 29 benchmarks across Basque, English, and Spanish
- 6 categories: reading comprehension, common sense, linguistic proficiency, knowledge, math & reasoning, and bias
- Automated evaluation using our fork of LM Evaluation Harness
- Full configuration files and results included
- Community-driven evaluation with 1,285 Basque speakers
- 12,890 preference annotations collected over 14 days
- Multi-dimensional assessment: content quality, linguistic quality, and overall preference
- 21 models evaluated including our variants, GPT-4o, and Claude 3.5 Sonnet
- Complete arena platform implementation provided
We use different naming conventions in the paper versus the codebase for clarity and simplicity respectively.
In the paper, we use descriptive names like:
-
$INST_{EN} C_{EU} I_{EN}$ = Llama 3.1 Instruct + Basque Corpus + English Instructions -
$BASE_{EU} C_{EU} I_{EN+EU}$ = Latxa Base + Basque Corpus + English & Basque Instructions
In the actual code, models, and evaluation results, we use a simplified format
exp_{BACKBONE}_{CORPUS_EU}{INST_EU}{INST_EN}
where:
-
BACKBONE:
-
0=$BASE_{EN}$ (Llama 3.1 8B) -
1=$BASE_{EU}$ (Latxa 3.1 8B) -
2=$INST_{EN}$ (Llama 3.1 8B Instruct)
-
-
CORPUS_EU / INST_EU / INST_EN:
-
0= not used -
1= used
-
Examples:
-
exp_1_101=$BASE_{EU}$ + Basque Corpus + English Instructions -
exp_2_111=$INST_{EN}$ + Basque Corpus + Basque Instructions + English Instructions
All 8B model variants are available as branches in the HuggingFace repository:
- Main model: https://huggingface.co/HiTZ/Latxa-Llama-3.1-8B-Instruct
- Example variant: https://huggingface.co/HiTZ/Latxa-Llama-3.1-8B-Instruct/tree/exp_0_010
If you use Latxa-Instruct, please cite:
@misc{sainz2025instructing,
title = {Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque},
author = {Oscar Sainz and Naiara Perez and Julen Etxaniz and Joseba Fernandez de Landa and Itziar Aldabe and Iker García-Ferrero and Aimar Zabala and Ekhi Azurmendi and German Rigau and Eneko Agirre and Mikel Artetxe and Aitor Soroa},
year = {2025},
eprint = {TBP},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}This work has been partially supported by the Basque Government (IKER-GAITU project), the Ministerio para la Transformación Digital y de la Función Pública (EU – NextGenerationEU, 2022/TL22/00215335), and trained on the Leonardo supercomputer at CINECA under EuroHPC Joint Undertaking, project EHPC-EXT-2024E01-042.
