mBERT โ Algerian Darija Misinformation Detection
Fine-tuned BERT-base-multilingual-cased for detecting misinformation in Algerian Darija text.
- Base model:
bert-base-multilingual-cased(170M parameters) - Task: Multi-class text classification (5 classes)
- Classes: F (Factual), R (Reporting), N (Non-factual), M (Misleading), S (Satire)
Performance (Test set: 3,344 samples)
- Accuracy: 75.42%
- Macro F1: 64.48%
- Weighted F1: 75.70%
Per-class F1:
- Factual (F): 83.72%
- Reporting (R): 76.35%
- Non-factual (N): 81.01%
- Misleading (M): 61.46%
- Satire (S): 19.86%
Training Summary
- Max sequence length: 128
- Epochs: 3 (early stopping)
- Batch size: 16
- Learning rate: 2e-5
- Loss: Weighted CrossEntropy
- Seed: 42 (reproducibility)
Usage
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_ID = "Rahilgh/model4_1"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device).eval()
LABEL_MAP = {0: "F", 1: "R", 2: "N", 3: "M", 4: "S"}
LABEL_NAMES = {
"F": "Factual",
"R": "Reporting",
"N": "Non-factual",
"M": "Misleading",
"S": "Satire"
}
texts = [
"ูุงูู ุจูู ุฑุงูุญูู ููุญู ุงูุจุงู ูุฐุง ุงูุนุงู
",
]
for text in texts:
inputs = tokenizer(
text,
return_tensors="pt",
max_length=128,
truncation=True,
padding=True,
).to(device)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=1)[0]
pred_id = probs.argmax().item()
confidence = probs[pred_id].item()
label = LABEL_MAP[pred_id]
print(f"Text: {text}")
print(f"Prediction: {LABEL_NAMES[label]} ({label}) โ {confidence:.2%}\n")
- Downloads last month
- 19
Model tree for Rahilgh/model4_1
Base model
google-bert/bert-base-multilingual-cased