DziriBERT — Algerian Darija Misinformation Detection

DziriBERT is a fine-tuned XLM-RoBERTa-large model for detecting misinformation in Algerian Darija text from social media and news.

Base model: xlm-roberta-large (355M parameters)
Task: Multi-class text classification (5 classes)
Classes:
- F: Fake
- R: Real
- N: Non-new
- M: Misleading
- S: Satire

Performance (Test set: 3,344 samples)

Accuracy: 78.32%
Macro F1: 68.22%
Weighted F1: 78.43%

Per-class F1:

Fake (F): 85.04%
Real (R): 80.44%
Non-new (N): 83.23%
Misleading (M): 64.57%
Satire (S): 27.83%

Training Summary

Max sequence length: 128
Epochs: 3 (early stopping)
Batch size: 8 (effective 16 with gradient accumulation)
Learning rate: 1e-5
Loss: Weighted CrossEntropy
Data augmentation: Applied to minority classes (M, S)
Seed: 42

Strengths & Limitations

Strengths

Strong performance on Fake, Real, and Non-new classes
Handles Darija, Arabic, and French code-switching well

Limitations

Low performance on Satire due to limited samples
Misleading class remains challenging

Usage

import os
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

os.environ["USE_TF"] = "0"
os.environ["USE_TORCH"] = "1"

MODEL_ID = "Rahilgh/model4_2"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE)
model.eval()

LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
LABEL_NAMES = {
    "F": "Fake",
    "R": "Real",
    "N": "Non-new",
    "M": "Misleading",
    "S": "Satire",
}

texts = [
    "الجزائر فازت ببطولة امم افريقيا 2019",
    "صورة زعيم عالمي يرتدي ملابس غريبة تثير السخرية",
]

for text in texts:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=128,
        truncation=True,
        padding=True,
    ).to(DEVICE)

    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1)
        pred_id = probs.argmax().item()
        confidence = probs[0][pred_id].item()

    label = LABEL_MAP[pred_id]
    print(f"Text: {text}")
    print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}")

Downloads last month: 41

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for Rahilgh/model4_2

Base model

FacebookAI/xlm-roberta-large

Finetuned

(886)

this model