🌍 Published @ ICLR 2023

Arabic-Swahili Machine Translation

Graduation Project — Low-Resource Neural Machine Translation

NMT Transformers Low-Resource Back-Translation
📄

MACHINE TRANSLATION BASELINES FOR ARABIC - SWAHILI

Tiny Papers @ ICLR 2023

Authors: Asim Awad Osman, Ahmed Emadeldin Almahady, Muhammed Saeed, Hiba Hassan Sayed

Read Paper →

Abstract

Building neural machine translation (NMT) systems for low-resource languages poses several challenges, mainly due to the lack of parallel data. In this research, we propose a baseline NMT system for translating between Arabic and Swahili. Despite being spoken by nearly 300 million individuals worldwide, the parallel corpus between these two languages is severely underrepresented. We scraped and processed the largest high-quality parallel corpus of Swahili and Arabic to our knowledge.

The Problem

Africa has 2,000+ languages but they're severely underrepresented in NLP. The focus has been on high-resource languages (English, Chinese, etc.), leaving millions of Arabic and Swahili speakers without translation technology.

Dataset Contribution

32,000

Sentence pairs

Largest

AR-SW corpus to date

300M

Speakers combined

We scraped and curated the largest high-quality parallel corpus for Arabic-Swahili translation, providing a foundation for future research in this language pair.

Translation Examples

Arabic (Source)

السلام عليكم ورحمة الله وبركاته

Swahili (Target)

Amani iwe juu yenu na rehema ya Mungu na baraka zake

English: Peace be upon you and God's mercy and blessings

Arabic (Source)

التعليم هو أساس التقدم والتنمية

Swahili (Target)

Elimu ni msingi wa maendeleo na ustawi

English: Education is the foundation of progress and development

Arabic (Source)

نحن نعمل معاً من أجل مستقبل أفضل

Swahili (Target)

Tunafanya kazi pamoja kwa mustakabali bora

English: We work together for a better future

Results

Baseline Model

~10

BLEU Score

Fine-tuned Model

30.9

BLEU Score ⭐

Key Findings:

  • Fine-tuning multilingual Transformers significantly outperforms training from scratch
  • Back-translation technique further improved performance
  • • Even with limited data (32K pairs), fine-tuning achieves competitive results

Methodology

  1. 1
    Data Collection

    Scraped and processed the largest AR-SW parallel corpus (32K pairs)

  2. 2
    Baseline Models

    Trained Transformer models from scratch as reference

  3. 3
    Fine-tuning

    Fine-tuned multilingual Transformer variants with our dataset

  4. 4
    Back-Translation

    Used synthetic data generation to augment training

Key Takeaways

  • Published at ICLR 2023 (Tiny Papers track)
  • Created largest AR-SW corpus (32,000 sentence pairs)
  • 3x BLEU improvement with fine-tuning (10 → 30.9)
  • Addresses underrepresentation of African languages in NLP