mBERT — Algerian Darija Misinformation Detection

Fine-tuned BERT-base-multilingual-cased for detecting misinformation in Algerian Darija text.

Base model: bert-base-multilingual-cased (170M parameters)
Task: Multi-class text classification (5 classes)
Classes: F (Factual), R (Reporting), N (Non-factual), M (Misleading), S (Satire)

Performance (Test set: 3,344 samples)

Accuracy: 75.42%
Macro F1: 64.48%
Weighted F1: 75.70%

Per-class F1:

Factual (F): 83.72%
Reporting (R): 76.35%
Non-factual (N): 81.01%
Misleading (M): 61.46%
Satire (S): 19.86%

Training Summary

Max sequence length: 128
Epochs: 3 (early stopping)
Batch size: 16
Learning rate: 2e-5
Loss: Weighted CrossEntropy
Seed: 42 (reproducibility)

Usage

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

MODEL_ID = "Rahilgh/model4_1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device).eval()

LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
LABEL_NAMES = {
    "F": "Factual",
    "R": "Reporting",
    "N": "Non-factual",
    "M": "Misleading",
    "S": "Satire"
}

texts = [
    "قالك بلي رايحين ينحو الباك هذا العام",
    
]

for text in texts:
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=128,
        truncation=True,
        padding=True,
    ).to(device)

    with torch.no_grad():
        outputs = model(**inputs)
        probs = torch.softmax(outputs.logits, dim=1)[0]
        pred_id = probs.argmax().item()
        confidence = probs[pred_id].item()

    label = LABEL_MAP[pred_id]
    print(f"Text: {text}")
    print(f"Prediction: {LABEL_NAMES[label]} ({label}) — {confidence:.2%}\n")

Downloads last month: 19

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for Rahilgh/model4_1

Base model

google-bert/bert-base-multilingual-cased

Finetuned

(931)

this model