Algerian Darija Misinformation Detection - Linear SVM (Model 1.2)
This model is a Linear Support Vector Machine (SVM) classifier trained to detect misinformation in Algerian Darija (Algerian Arabic dialect) text.
Model Description
- Model Type: Linear Support Vector Machine (LinearSVC) with TF-IDF features
- Language: Algerian Darija with Arabic script
- Task: Multi-class text classification (5 classes)
- Member: 1
- Model ID: M1.2
Classes
The model classifies text into 5 categories:
- F (Fake): False or inaccurate information
- R (Real): Verified, factual content
- N (Non-News): Non news statement
- M (Misleading): Partially true but misleading content
- S (Satire): Satirical or humorous content
Model Architecture
Vectorization: TF-IDF
- Max features: 10,000
- N-gram range: (1, 3) - captures unigrams, bigrams, and trigrams
- Min document frequency: 2
- Max document frequency: 0.9
- Sublinear TF scaling: Enabled
Classifier: Linear Support Vector Machine
- Max iterations: 2000
- Regularization (C): 1.0
- Class weighting: balanced (to handle class imbalance)
- Solver: Auto-selected based on data characteristics
Features (Enhanced Model)
The enhanced model combines TF-IDF features with 6 metadata features:
- Cleaned word count (z-score normalized)
- Punctuation density
- Exclamation count
- Question count
- All caps ratio
- Emoji count
All metadata features are normalized using StandardScaler (z-score normalization) fitted on the training data.
Performance
Validation Set
- Macro F1: 0.6340
- Weighted F1: 0.7357
- Accuracy: 0.7360
Test Set (Final Evaluation)
- Macro F1: 0.6151
- Weighted F1: 0.7176
- Accuracy: 0.7165
Per-Class Performance (Test Set)
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| F | 0.8361 | 0.7876 | 0.8111 | 965 |
| R | 0.5978 | 0.5959 | 0.5969 | 641 |
| N | 0.7556 | 0.8104 | 0.7821 | 828 |
| M | 0.7450 | 0.7485 | 0.7468 | 851 |
| S | 0.2456 | 0.2222 | 0.2333 | 63 |
Usage
Installation
pip install huggingface_hub scikit-learn numpy scipy
Quick Start (Complete Example)
import joblib
import numpy as np
import re
from scipy.special import softmax
from scipy.sparse import hstack, csr_matrix
from huggingface_hub import hf_hub_download
# Configuration
REPO_NAME = "Kenny-lalek/algerian-darija-misinformation-svm"
# Load model artifacts
print("Loading model from Hugging Face...")
tfidf_vectorizer = joblib.load(
hf_hub_download(repo_id=REPO_NAME, filename="tfidf_vectorizer.joblib")
)
svm_model = joblib.load(
hf_hub_download(repo_id=REPO_NAME, filename="svm_model.joblib")
)
metadata_scaler = joblib.load(
hf_hub_download(repo_id=REPO_NAME, filename="metadata_scaler.joblib")
)
print("Model loaded successfully!")
# Preprocessing function
def remove_diacritics(text):
"""Remove Arabic diacritics (tashkeel)"""
arabic_diacritics = re.compile("""
ّ | # Tashdid
َ | # Fatha
ً | # Tanwin Fath
ُ | # Damma
ٌ | # Tanwin Damm
ِ | # Kasra
ٍ | # Tanwin Kasr
ْ | # Sukun
ـ # Tatwil/Kashida
""", re.VERBOSE)
return re.sub(arabic_diacritics, '', text)
def preprocess_text(text):
"""
Preprocess text according to training specifications
- Remove diacritics
- Normalize whitespace
"""
if text is None or text == "":
return ""
text = str(text)
text = remove_diacritics(text)
text = ' '.join(text.split()) # Normalize whitespace
return text.strip()
def extract_metadata_features(text):
"""
Extract metadata features for the enhanced model
Returns a numpy array of shape (1, 6)
"""
# Word count
word_count = len(text.split())
# Punctuation density
punctuation_chars = ".,!?;:،؛"
punctuation_count = sum(1 for char in text if char in punctuation_chars)
punctuation_density = punctuation_count / len(text) if len(text) > 0 else 0
# Exclamation and question counts
exclamation_count = text.count('!')
question_count = text.count('?') + text.count('؟') # Include Arabic question mark
# All caps ratio
letters = [c for c in text if c.isalpha()]
all_caps_ratio = sum(1 for c in letters if c.isupper()) / len(letters) if len(letters) > 0 else 0
# Emoji count (simplified - would need emoji library for full detection)
emoji_count = 0
return np.array([[word_count, punctuation_density, exclamation_count,
question_count, all_caps_ratio, emoji_count]])
def predict(text):
"""
Predict the misinformation class for input text
Args:
text (str): Input text in Algerian Darija
Returns:
dict: Dictionary containing:
- predicted_label: The predicted class (F, R, N, M, or S)
- confidence: Confidence score (0-1)
- probabilities: Dictionary of probabilities for each class
"""
# Step 1: Preprocess text
processed_text = preprocess_text(text)
# Step 2: Extract TF-IDF features
text_features = tfidf_vectorizer.transform([processed_text])
# Step 3: Extract and normalize metadata features
metadata = extract_metadata_features(processed_text)
metadata_scaled = metadata_scaler.transform(metadata)
# Step 4: Combine TF-IDF and metadata features
features = hstack([text_features, csr_matrix(metadata_scaled)])
# Step 5: Predict using SVM
prediction = svm_model.predict(features)[0]
# Step 6: Get decision function scores and convert to probabilities
# Note: SVM doesn't have predict_proba by default, so we use decision_function
decision_scores = svm_model.decision_function(features)[0]
probabilities = softmax(decision_scores) # Convert to pseudo-probabilities
# Step 7: Format results
classes = ['F', 'R', 'N', 'M', 'S']
result = {
'predicted_label': prediction,
'confidence': float(np.max(probabilities)),
'probabilities': {
class_label: float(prob)
for class_label, prob in zip(classes, probabilities)
}
}
return result
# Example usage
if __name__ == "__main__":
# Test examples
test_texts = [
"الحكومة الجزائرية تعلن عن إجراءات جديدة لدعم الاقتصاد",
"هذا الخبر كذب ومفبرك بالكامل!",
"تقرير رسمي من وزارة الصحة حول الوضع الصحي"
]
print("Testing Algerian Darija Misinformation Detection (SVM)")
for i, text in enumerate(test_texts, 1):
print(f"\nExample {i}:")
print(f"Text: {text}")
result = predict(text)
print(f"Predicted Label: {result['predicted_label']}")
print(f"Confidence: {result['confidence']:.4f}")
print(f"Class Probabilities:")
for label, prob in result['probabilities'].items():
print(f" {label}: {prob:.4f}")
Using the Wrapper Class
For easier integration, use the provided AlgerianDarijaSVMClassifier wrapper class:
from huggingface_hub import hf_hub_download
import joblib
import numpy as np
import re
from scipy.special import softmax
from scipy.sparse import hstack, csr_matrix
class AlgerianDarijaSVMClassifier:
"""
Wrapper class for Algerian Darija Misinformation Detection using SVM
This class handles model loading, text preprocessing, feature extraction,
and prediction in a convenient interface.
"""
def __init__(self, repo_name, model_type="enhanced"):
"""
Initialize the classifier
Args:
repo_name (str): HuggingFace repository name
model_type (str): "enhanced" (with metadata) or "baseline" (text only)
"""
self.repo_name = repo_name
self.model_type = model_type
self.classes = ['F', 'R', 'N', 'M', 'S']
print(f"Loading SVM model from {repo_name}...")
# Load TF-IDF vectorizer
self.tfidf_vectorizer = joblib.load(
hf_hub_download(repo_id=repo_name, filename="tfidf_vectorizer.joblib")
)
# Load SVM model
self.model = joblib.load(
hf_hub_download(repo_id=repo_name, filename="svm_model.joblib")
)
# Load metadata scaler if enhanced model
if model_type == "enhanced":
self.metadata_scaler = joblib.load(
hf_hub_download(repo_id=repo_name, filename="metadata_scaler.joblib")
)
else:
self.metadata_scaler = None
print("Model loaded successfully!")
def _remove_diacritics(self, text):
"""Remove Arabic diacritics"""
arabic_diacritics = re.compile("""
ّ | # Tashdid
َ | # Fatha
ً | # Tanwin Fath
ُ | # Damma
ٌ | # Tanwin Damm
ِ | # Kasra
ٍ | # Tanwin Kasr
ْ | # Sukun
ـ # Tatwil/Kashida
""", re.VERBOSE)
return re.sub(arabic_diacritics, '', text)
def _preprocess_text(self, text):
"""Preprocess text"""
if text is None or text == "":
return ""
text = str(text)
text = self._remove_diacritics(text)
text = ' '.join(text.split())
return text.strip()
def _extract_metadata_features(self, text):
"""Extract metadata features"""
word_count = len(text.split())
punctuation_chars = ".,!?;:،؛"
punctuation_count = sum(1 for char in text if char in punctuation_chars)
punctuation_density = punctuation_count / len(text) if len(text) > 0 else 0
exclamation_count = text.count('!')
question_count = text.count('?') + text.count('؟')
letters = [c for c in text if c.isalpha()]
all_caps_ratio = sum(1 for c in letters if c.isupper()) / len(letters) if len(letters) > 0 else 0
emoji_count = 0
return np.array([[word_count, punctuation_density, exclamation_count,
question_count, all_caps_ratio, emoji_count]])
def predict(self, text):
"""
Predict misinformation class for a single text
Args:
text (str): Input text in Algerian Darija
Returns:
dict: Prediction results with label, confidence, and probabilities
"""
# Preprocess
processed_text = self._preprocess_text(text)
# Extract TF-IDF features
text_features = self.tfidf_vectorizer.transform([processed_text])
# Add metadata if enhanced model
if self.model_type == "enhanced":
metadata = self._extract_metadata_features(processed_text)
metadata_scaled = self.metadata_scaler.transform(metadata)
features = hstack([text_features, csr_matrix(metadata_scaled)])
else:
features = text_features
# Predict
prediction = self.model.predict(features)[0]
decision_scores = self.model.decision_function(features)[0]
probabilities = softmax(decision_scores)
return {
'predicted_label': prediction,
'confidence': float(np.max(probabilities)),
'probabilities': {
class_label: float(prob)
for class_label, prob in zip(self.classes, probabilities)
}
}
def predict_batch(self, texts):
"""
Predict misinformation classes for multiple texts
Args:
texts (list): List of input texts
Returns:
list: List of prediction dictionaries
"""
return [self.predict(text) for text in texts]
# Usage example
classifier = AlgerianDarijaSVMClassifier(
repo_name="Kenny-lalek/algerian-darija-misinformation-svm",
model_type="enhanced"
)
# Single prediction
result = classifier.predict("النظام المافيوي العصاباباتي")
print(f"Predicted: {result['predicted_label']}")
print(f"Confidence: {result['confidence']:.4f}")
# Batch prediction
texts = [
"الحكومة تعلن عن إجراءات جديدة",
"هذا خبر كاذب ومفبرك",
"تقرير رسمي من الوزارة"
]
results = classifier.predict_batch(texts)
for text, result in zip(texts, results):
print(f"{text} -> {result['predicted_label']}")
Training Details
Training Configuration
- Training Date: December 24, 2024
- Framework: scikit-learn
- Random Seed: 42 (for reproducibility)
Dataset
- Training samples: 15,699 (14,871 unique groups)
- Validation samples: 3,348 (3,187 unique groups)
- Test samples: 3,344 (3,187 unique groups)
- Language: Algerian Darija with Arabic script
Preprocessing Steps
- Diacritics removal: All Arabic diacritical marks removed
- Whitespace normalization: Multiple spaces collapsed to single space
- No stopword removal: Dialectal context is preserved
- Pre-cleaned data: URLs, mentions, and hashtags already removed
Class Imbalance Handling
The dataset has severe class imbalance, particularly for the Satire class (~2% of data). To address this:
- Balanced class weights: SVM automatically adjusts weights inversely proportional to class frequencies
- Formula:
weight[class] = n_samples / (n_classes * n_samples_per_class)
Limitations and Considerations
Class Imbalance:
- The Satire class (S) is severely underrepresented (~2% of training data)
- This may result in lower recall for detecting satirical content
- Balanced class weights partially mitigate but don't fully solve this issue
Dialect Specificity:
- Model is specifically trained on Algerian Darija
- May not generalize well to other Arabic dialects (Moroccan, Tunisian, etc.)
- Performance on Modern Standard Arabic (MSA) may be suboptimal
Context Limitations:
- As a classical ML model, it relies on bag-of-words features
- May miss nuanced contextual information and semantic relationships
- Cannot capture long-range dependencies in text
Feature Engineering:
- Performance depends heavily on TF-IDF representation quality
- N-gram features (up to trigrams) capture local context only
- Metadata features provide additional signal but are hand-crafted
Probability Estimates:
- SVM doesn't naturally output probabilities
- Probabilities are derived via softmax of decision function scores
- These are pseudo-probabilities and should be interpreted with caution
License
This model is released under the Apache 2.0 License.