Algerian Darija Misinformation Detection - Linear SVM (Model 1.2)

This model is a Linear Support Vector Machine (SVM) classifier trained to detect misinformation in Algerian Darija (Algerian Arabic dialect) text.

Model Description

  • Model Type: Linear Support Vector Machine (LinearSVC) with TF-IDF features
  • Language: Algerian Darija with Arabic script
  • Task: Multi-class text classification (5 classes)
  • Member: 1
  • Model ID: M1.2

Classes

The model classifies text into 5 categories:

  • F (Fake): False or inaccurate information
  • R (Real): Verified, factual content
  • N (Non-News): Non news statement
  • M (Misleading): Partially true but misleading content
  • S (Satire): Satirical or humorous content

Model Architecture

Vectorization: TF-IDF

  • Max features: 10,000
  • N-gram range: (1, 3) - captures unigrams, bigrams, and trigrams
  • Min document frequency: 2
  • Max document frequency: 0.9
  • Sublinear TF scaling: Enabled

Classifier: Linear Support Vector Machine

  • Max iterations: 2000
  • Regularization (C): 1.0
  • Class weighting: balanced (to handle class imbalance)
  • Solver: Auto-selected based on data characteristics

Features (Enhanced Model)

The enhanced model combines TF-IDF features with 6 metadata features:

  1. Cleaned word count (z-score normalized)
  2. Punctuation density
  3. Exclamation count
  4. Question count
  5. All caps ratio
  6. Emoji count

All metadata features are normalized using StandardScaler (z-score normalization) fitted on the training data.

Performance

Validation Set

  • Macro F1: 0.6340
  • Weighted F1: 0.7357
  • Accuracy: 0.7360

Test Set (Final Evaluation)

  • Macro F1: 0.6151
  • Weighted F1: 0.7176
  • Accuracy: 0.7165

Per-Class Performance (Test Set)

Class Precision Recall F1-Score Support
F 0.8361 0.7876 0.8111 965
R 0.5978 0.5959 0.5969 641
N 0.7556 0.8104 0.7821 828
M 0.7450 0.7485 0.7468 851
S 0.2456 0.2222 0.2333 63

Usage

Installation

pip install huggingface_hub scikit-learn numpy scipy

Quick Start (Complete Example)

import joblib
import numpy as np
import re
from scipy.special import softmax
from scipy.sparse import hstack, csr_matrix
from huggingface_hub import hf_hub_download

# Configuration
REPO_NAME = "Kenny-lalek/algerian-darija-misinformation-svm"

# Load model artifacts
print("Loading model from Hugging Face...")
tfidf_vectorizer = joblib.load(
    hf_hub_download(repo_id=REPO_NAME, filename="tfidf_vectorizer.joblib")
)
svm_model = joblib.load(
    hf_hub_download(repo_id=REPO_NAME, filename="svm_model.joblib")
)
metadata_scaler = joblib.load(
    hf_hub_download(repo_id=REPO_NAME, filename="metadata_scaler.joblib")
)
print("Model loaded successfully!")

# Preprocessing function
def remove_diacritics(text):
    """Remove Arabic diacritics (tashkeel)"""
    arabic_diacritics = re.compile("""
        ّ    | # Tashdid
        َ    | # Fatha
        ً    | # Tanwin Fath
        ُ    | # Damma
        ٌ    | # Tanwin Damm
        ِ    | # Kasra
        ٍ    | # Tanwin Kasr
        ْ    | # Sukun
        ـ     # Tatwil/Kashida
    """, re.VERBOSE)
    return re.sub(arabic_diacritics, '', text)

def preprocess_text(text):
    """
    Preprocess text according to training specifications
    - Remove diacritics
    - Normalize whitespace
    """
    if text is None or text == "":
        return ""
    
    text = str(text)
    text = remove_diacritics(text)
    text = ' '.join(text.split())  # Normalize whitespace
    
    return text.strip()

def extract_metadata_features(text):
    """
    Extract metadata features for the enhanced model
    Returns a numpy array of shape (1, 6)
    """
    # Word count
    word_count = len(text.split())
    
    # Punctuation density
    punctuation_chars = ".,!?;:،؛"
    punctuation_count = sum(1 for char in text if char in punctuation_chars)
    punctuation_density = punctuation_count / len(text) if len(text) > 0 else 0
    
    # Exclamation and question counts
    exclamation_count = text.count('!')
    question_count = text.count('?') + text.count('؟')  # Include Arabic question mark
    
    # All caps ratio
    letters = [c for c in text if c.isalpha()]
    all_caps_ratio = sum(1 for c in letters if c.isupper()) / len(letters) if len(letters) > 0 else 0
    
    # Emoji count (simplified - would need emoji library for full detection)
    emoji_count = 0
    
    return np.array([[word_count, punctuation_density, exclamation_count, 
                      question_count, all_caps_ratio, emoji_count]])

def predict(text):
    """
    Predict the misinformation class for input text
    
    Args:
        text (str): Input text in Algerian Darija
        
    Returns:
        dict: Dictionary containing:
            - predicted_label: The predicted class (F, R, N, M, or S)
            - confidence: Confidence score (0-1)
            - probabilities: Dictionary of probabilities for each class
    """
    # Step 1: Preprocess text
    processed_text = preprocess_text(text)
    
    # Step 2: Extract TF-IDF features
    text_features = tfidf_vectorizer.transform([processed_text])
    
    # Step 3: Extract and normalize metadata features
    metadata = extract_metadata_features(processed_text)
    metadata_scaled = metadata_scaler.transform(metadata)
    
    # Step 4: Combine TF-IDF and metadata features
    features = hstack([text_features, csr_matrix(metadata_scaled)])
    
    # Step 5: Predict using SVM
    prediction = svm_model.predict(features)[0]
    
    # Step 6: Get decision function scores and convert to probabilities
    # Note: SVM doesn't have predict_proba by default, so we use decision_function
    decision_scores = svm_model.decision_function(features)[0]
    probabilities = softmax(decision_scores)  # Convert to pseudo-probabilities
    
    # Step 7: Format results
    classes = ['F', 'R', 'N', 'M', 'S']
    result = {
        'predicted_label': prediction,
        'confidence': float(np.max(probabilities)),
        'probabilities': {
            class_label: float(prob) 
            for class_label, prob in zip(classes, probabilities)
        }
    }
    
    return result

# Example usage
if __name__ == "__main__":
    # Test examples
    test_texts = [
        "الحكومة الجزائرية تعلن عن إجراءات جديدة لدعم الاقتصاد",
        "هذا الخبر كذب ومفبرك بالكامل!",
        "تقرير رسمي من وزارة الصحة حول الوضع الصحي"
    ]
    
    print("Testing Algerian Darija Misinformation Detection (SVM)")
    
    for i, text in enumerate(test_texts, 1):
        print(f"\nExample {i}:")
        print(f"Text: {text}")
        
        result = predict(text)
        
        print(f"Predicted Label: {result['predicted_label']}")
        print(f"Confidence: {result['confidence']:.4f}")
        print(f"Class Probabilities:")
        for label, prob in result['probabilities'].items():
            print(f"  {label}: {prob:.4f}")

Using the Wrapper Class

For easier integration, use the provided AlgerianDarijaSVMClassifier wrapper class:

from huggingface_hub import hf_hub_download
import joblib
import numpy as np
import re
from scipy.special import softmax
from scipy.sparse import hstack, csr_matrix

class AlgerianDarijaSVMClassifier:
    """
    Wrapper class for Algerian Darija Misinformation Detection using SVM
    
    This class handles model loading, text preprocessing, feature extraction,
    and prediction in a convenient interface.
    """
    
    def __init__(self, repo_name, model_type="enhanced"):
        """
        Initialize the classifier
        
        Args:
            repo_name (str): HuggingFace repository name
            model_type (str): "enhanced" (with metadata) or "baseline" (text only)
        """
        self.repo_name = repo_name
        self.model_type = model_type
        self.classes = ['F', 'R', 'N', 'M', 'S']
        
        print(f"Loading SVM model from {repo_name}...")
        
        # Load TF-IDF vectorizer
        self.tfidf_vectorizer = joblib.load(
            hf_hub_download(repo_id=repo_name, filename="tfidf_vectorizer.joblib")
        )
        
        # Load SVM model
        self.model = joblib.load(
            hf_hub_download(repo_id=repo_name, filename="svm_model.joblib")
        )
        
        # Load metadata scaler if enhanced model
        if model_type == "enhanced":
            self.metadata_scaler = joblib.load(
                hf_hub_download(repo_id=repo_name, filename="metadata_scaler.joblib")
            )
        else:
            self.metadata_scaler = None
        
        print("Model loaded successfully!")
    
    def _remove_diacritics(self, text):
        """Remove Arabic diacritics"""
        arabic_diacritics = re.compile("""
            ّ    | # Tashdid
            َ    | # Fatha
            ً    | # Tanwin Fath
            ُ    | # Damma
            ٌ    | # Tanwin Damm
            ِ    | # Kasra
            ٍ    | # Tanwin Kasr
            ْ    | # Sukun
            ـ     # Tatwil/Kashida
        """, re.VERBOSE)
        return re.sub(arabic_diacritics, '', text)
    
    def _preprocess_text(self, text):
        """Preprocess text"""
        if text is None or text == "":
            return ""
        text = str(text)
        text = self._remove_diacritics(text)
        text = ' '.join(text.split())
        return text.strip()
    
    def _extract_metadata_features(self, text):
        """Extract metadata features"""
        word_count = len(text.split())
        punctuation_chars = ".,!?;:،؛"
        punctuation_count = sum(1 for char in text if char in punctuation_chars)
        punctuation_density = punctuation_count / len(text) if len(text) > 0 else 0
        exclamation_count = text.count('!')
        question_count = text.count('?') + text.count('؟')
        letters = [c for c in text if c.isalpha()]
        all_caps_ratio = sum(1 for c in letters if c.isupper()) / len(letters) if len(letters) > 0 else 0
        emoji_count = 0
        return np.array([[word_count, punctuation_density, exclamation_count, 
                          question_count, all_caps_ratio, emoji_count]])
    
    def predict(self, text):
        """
        Predict misinformation class for a single text
        
        Args:
            text (str): Input text in Algerian Darija
            
        Returns:
            dict: Prediction results with label, confidence, and probabilities
        """
        # Preprocess
        processed_text = self._preprocess_text(text)
        
        # Extract TF-IDF features
        text_features = self.tfidf_vectorizer.transform([processed_text])
        
        # Add metadata if enhanced model
        if self.model_type == "enhanced":
            metadata = self._extract_metadata_features(processed_text)
            metadata_scaled = self.metadata_scaler.transform(metadata)
            features = hstack([text_features, csr_matrix(metadata_scaled)])
        else:
            features = text_features
        
        # Predict
        prediction = self.model.predict(features)[0]
        decision_scores = self.model.decision_function(features)[0]
        probabilities = softmax(decision_scores)
        
        return {
            'predicted_label': prediction,
            'confidence': float(np.max(probabilities)),
            'probabilities': {
                class_label: float(prob) 
                for class_label, prob in zip(self.classes, probabilities)
            }
        }
    
    def predict_batch(self, texts):
        """
        Predict misinformation classes for multiple texts
        
        Args:
            texts (list): List of input texts
            
        Returns:
            list: List of prediction dictionaries
        """
        return [self.predict(text) for text in texts]

# Usage example
classifier = AlgerianDarijaSVMClassifier(
    repo_name="Kenny-lalek/algerian-darija-misinformation-svm",
    model_type="enhanced"
)

# Single prediction
result = classifier.predict("النظام المافيوي العصاباباتي")
print(f"Predicted: {result['predicted_label']}")
print(f"Confidence: {result['confidence']:.4f}")

# Batch prediction
texts = [
    "الحكومة تعلن عن إجراءات جديدة",
    "هذا خبر كاذب ومفبرك",
    "تقرير رسمي من الوزارة"
]
results = classifier.predict_batch(texts)
for text, result in zip(texts, results):
    print(f"{text} -> {result['predicted_label']}")

Training Details

Training Configuration

  • Training Date: December 24, 2024
  • Framework: scikit-learn
  • Random Seed: 42 (for reproducibility)

Dataset

  • Training samples: 15,699 (14,871 unique groups)
  • Validation samples: 3,348 (3,187 unique groups)
  • Test samples: 3,344 (3,187 unique groups)
  • Language: Algerian Darija with Arabic script

Preprocessing Steps

  1. Diacritics removal: All Arabic diacritical marks removed
  2. Whitespace normalization: Multiple spaces collapsed to single space
  3. No stopword removal: Dialectal context is preserved
  4. Pre-cleaned data: URLs, mentions, and hashtags already removed

Class Imbalance Handling

The dataset has severe class imbalance, particularly for the Satire class (~2% of data). To address this:

  • Balanced class weights: SVM automatically adjusts weights inversely proportional to class frequencies
  • Formula: weight[class] = n_samples / (n_classes * n_samples_per_class)

Limitations and Considerations

  1. Class Imbalance:

    • The Satire class (S) is severely underrepresented (~2% of training data)
    • This may result in lower recall for detecting satirical content
    • Balanced class weights partially mitigate but don't fully solve this issue
  2. Dialect Specificity:

    • Model is specifically trained on Algerian Darija
    • May not generalize well to other Arabic dialects (Moroccan, Tunisian, etc.)
    • Performance on Modern Standard Arabic (MSA) may be suboptimal
  3. Context Limitations:

    • As a classical ML model, it relies on bag-of-words features
    • May miss nuanced contextual information and semantic relationships
    • Cannot capture long-range dependencies in text
  4. Feature Engineering:

    • Performance depends heavily on TF-IDF representation quality
    • N-gram features (up to trigrams) capture local context only
    • Metadata features provide additional signal but are hand-crafted
  5. Probability Estimates:

    • SVM doesn't naturally output probabilities
    • Probabilities are derived via softmax of decision function scores
    • These are pseudo-probabilities and should be interpreted with caution

License

This model is released under the Apache 2.0 License.


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support