Our Datasets
Open-source datasets for evaluating and training AI models on African cultural knowledge, languages, and safety.
UCCB
Ugandan Cultural Context Benchmark
Overview
The first comprehensive question-answering dataset for evaluating Large Language Models' understanding of Ugandan culture. Contains 1,039 carefully curated question-answer pairs across 24 cultural domains.
Dataset Statistics
Cultural Categories
Use Cases
- • Question Answering evaluation
- • Knowledge probing of LLMs
- • Cultural awareness training
- • Bias evaluation in AI models
Usage Example
from datasets import load_dataset
# Load the UCCB dataset
dataset = load_dataset("CraneAILabs/UCCB")
test_data = dataset["test"]
# Access a sample question
print(test_data[0]["question"])
print(test_data[0]["answer"])
WildJailbreak Africa
Multilingual Safety Training Dataset
Overview
Contains translations of 50,000 samples from the original WildJailbreak dataset into 5 African languages. Designed for instruction tuning and safety training of language models in low-resource African languages.
Dataset Statistics
Languages Included
Use Cases
- • Safety training for AI models
- • Instruction tuning for conversational AI
- • Jailbreak research
- • Cross-lingual safety studies
Usage Example
from datasets import load_dataset
# Load Swahili subset
dataset = load_dataset("CraneAILabs/wildjailbreak-africa",
"swa")
# Access samples
for sample in dataset['train'][:5]:
print(sample)
Pedagogy Benchmark Multilingual
Educational Assessment Dataset
Overview
Multilingual translation of Chilean teacher training exam questions into African languages. Covers various pedagogical domains, education levels, and subject areas.
Dataset Statistics
Languages
- • Luganda (Ugandan language)
- • Nyankore (Ugandan language)
- • Swahili (coming soon)
Use Cases
- • Educational AI research
- • Multilingual question answering
- • Teacher training tools
- • Pedagogical content evaluation
Usage Example
from datasets import load_dataset
# Load the pedagogy benchmark dataset
dataset = load_dataset(
"CraneAILabs/pedagogy-benchmark-multilingual"
)
# Access questions
print(dataset['train'][0])
Contribute to African Language AI Research
All our datasets are open source and available for research. Help us build better AI for African languages.
Visit HuggingFace