Our Datasets

Open-source datasets for evaluating and training AI models on African cultural knowledge, languages, and safety.

UCCB

Ugandan Cultural Context Benchmark

View on HuggingFace

Overview

The first comprehensive question-answering dataset for evaluating Large Language Models' understanding of Ugandan culture. Contains 1,039 carefully curated question-answer pairs across 24 cultural domains.

Dataset Statistics

Total Q&A Pairs 1,039
Cultural Domains 24
Primary Language English
License CC BY-NC-SA 4.0

Cultural Categories

• Education
• Ugandan Herbs
• Media
• Economy
• Notable Figures
• Literature
• Architecture
• Folklore
• Language
• Religion
• Sports
• And 13 more...

Use Cases

  • Question Answering evaluation
  • Knowledge probing of LLMs
  • Cultural awareness training
  • Bias evaluation in AI models

Usage Example

from datasets import load_dataset

# Load the UCCB dataset
dataset = load_dataset("CraneAILabs/UCCB")
test_data = dataset["test"]

# Access a sample question
print(test_data[0]["question"])
print(test_data[0]["answer"])

WildJailbreak Africa

Multilingual Safety Training Dataset

View on HuggingFace

Overview

Contains translations of 50,000 samples from the original WildJailbreak dataset into 5 African languages. Designed for instruction tuning and safety training of language models in low-resource African languages.

Dataset Statistics

Total Samples ~299,283
Languages 6 (incl. English)
Format JSONL
License ODC-BY-1.0

Languages Included

English 50,000 samples
Acholi 49,819 samples
Lugbara 49,854 samples
Luganda 49,864 samples
Swahili 49,875 samples
Ateso 49,871 samples

Use Cases

  • Safety training for AI models
  • Instruction tuning for conversational AI
  • Jailbreak research
  • Cross-lingual safety studies

Usage Example

from datasets import load_dataset

# Load Swahili subset
dataset = load_dataset("CraneAILabs/wildjailbreak-africa",
                       "swa")

# Access samples
for sample in dataset['train'][:5]:
    print(sample)

Pedagogy Benchmark Multilingual

Educational Assessment Dataset

View on HuggingFace

Overview

Multilingual translation of Chilean teacher training exam questions into African languages. Covers various pedagogical domains, education levels, and subject areas.

Dataset Statistics

Size 1K - 10K rows
Format Parquet
Task Type Multiple Choice
License Apache 2.0

Languages

  • Luganda (Ugandan language)
  • Nyankore (Ugandan language)
  • Swahili (coming soon)

Use Cases

  • Educational AI research
  • Multilingual question answering
  • Teacher training tools
  • Pedagogical content evaluation

Usage Example

from datasets import load_dataset

# Load the pedagogy benchmark dataset
dataset = load_dataset(
    "CraneAILabs/pedagogy-benchmark-multilingual"
)

# Access questions
print(dataset['train'][0])

Contribute to African Language AI Research

All our datasets are open source and available for research. Help us build better AI for African languages.

Visit HuggingFace