DziriBERT — Algerian Darija Misinformation Detection

DziriBERT is a fine-tuned XLM-RoBERTa-large model for detecting misinformation in Algerian Darija text from social media and news.

  • Base model: xlm-roberta-large (355M parameters)
  • Task: Multi-class text classification (5 classes)
  • Classes:
    • F: Fake
    • R: Real
    • N: Non-new
    • M: Misleading
    • S: Satire

Performance (Test set: 3,344 samples)

  • Accuracy: 78.32%
  • Macro F1: 68.22%
  • Weighted F1: 78.43%

Per-class F1:

  • Fake (F): 85.04%
  • Real (R): 80.44%
  • Non-new (N): 83.23%
  • Misleading (M): 64.57%
  • Satire (S): 27.83%

Training Summary

  • Max sequence length: 128
  • Epochs: 3 (early stopping)
  • Batch size: 8 (effective 16 with gradient accumulation)
  • Learning rate: 1e-5
  • Loss: Weighted CrossEntropy
  • Data augmentation: Applied to minority classes (M, S)
  • Seed: 42

Strengths & Limitations

Strengths

  • Strong performance on Fake, Real, and Non-new classes
  • Handles Darija, Arabic, and French code-switching well

Limitations

  • Low performance on Satire due to limited samples
  • Misleading class remains challenging

Usage

import os
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

os.environ["USE_TF"] = "0"
os.environ["USE_TORCH"] = "1"

MODEL_ID = "Rahilgh/model4_2"
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE)
model.eval()

LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
LABEL_NAMES = {
    "F": "Fake",
    "R": "Real",
    "N": "Non-new",
    "M": "Misleading",
    "S": "Satire",
}

texts = [
    "الجزائر فازت ببطولة امم افريقيا 2019",
    "صورة زعيم عالمي يرتدي ملابس غريبة تثير السخرية",
]

for text in texts:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=128,
        truncation=True,
        padding=True,
    ).to(DEVICE)

    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1)
        pred_id = probs.argmax().item()
        confidence = probs[0][pred_id].item()

    label = LABEL_MAP[pred_id]
    print(f"Text: {text}")
    print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}")
Downloads last month
41
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rahilgh/model4_2

Finetuned
(886)
this model