Arabic-Swahili Machine Translation
Graduation Project — Low-Resource Neural Machine Translation
MACHINE TRANSLATION BASELINES FOR ARABIC - SWAHILI
Tiny Papers @ ICLR 2023
Authors: Asim Awad Osman, Ahmed Emadeldin Almahady, Muhammed Saeed, Hiba Hassan Sayed
Read Paper →Abstract
Building neural machine translation (NMT) systems for low-resource languages poses several challenges, mainly due to the lack of parallel data. In this research, we propose a baseline NMT system for translating between Arabic and Swahili. Despite being spoken by nearly 300 million individuals worldwide, the parallel corpus between these two languages is severely underrepresented. We scraped and processed the largest high-quality parallel corpus of Swahili and Arabic to our knowledge.
The Problem
Africa has 2,000+ languages but they're severely underrepresented in NLP. The focus has been on high-resource languages (English, Chinese, etc.), leaving millions of Arabic and Swahili speakers without translation technology.
Dataset Contribution
32,000
Sentence pairs
Largest
AR-SW corpus to date
300M
Speakers combined
We scraped and curated the largest high-quality parallel corpus for Arabic-Swahili translation, providing a foundation for future research in this language pair.
Translation Examples
Arabic (Source)
السلام عليكم ورحمة الله وبركاته
Swahili (Target)
Amani iwe juu yenu na rehema ya Mungu na baraka zake
English: Peace be upon you and God's mercy and blessings
Arabic (Source)
التعليم هو أساس التقدم والتنمية
Swahili (Target)
Elimu ni msingi wa maendeleo na ustawi
English: Education is the foundation of progress and development
Arabic (Source)
نحن نعمل معاً من أجل مستقبل أفضل
Swahili (Target)
Tunafanya kazi pamoja kwa mustakabali bora
English: We work together for a better future
Results
Baseline Model
~10
BLEU Score
Fine-tuned Model
30.9
BLEU Score ⭐
Key Findings:
- • Fine-tuning multilingual Transformers significantly outperforms training from scratch
- • Back-translation technique further improved performance
- • Even with limited data (32K pairs), fine-tuning achieves competitive results
Methodology
-
1
Data Collection
Scraped and processed the largest AR-SW parallel corpus (32K pairs)
-
2
Baseline Models
Trained Transformer models from scratch as reference
-
3
Fine-tuning
Fine-tuned multilingual Transformer variants with our dataset
-
4
Back-Translation
Used synthetic data generation to augment training
Key Takeaways
- ✅ Published at ICLR 2023 (Tiny Papers track)
- ✅ Created largest AR-SW corpus (32,000 sentence pairs)
- ✅ 3x BLEU improvement with fine-tuning (10 → 30.9)
- ✅ Addresses underrepresentation of African languages in NLP