Ordis-1.5B-V355-VarGH
A 1.5B fine-tuned model focused on practical capabilities: anti-hallucination, honest refusal ("I don't know"), structured reasoning.
Built on Qwen2.5-1.5B-Instruct using LoRA with a 4-stage progressive training pipeline (PIT), iterated through 16+ controlled experiments.
Standard Benchmarks
Evaluation: lm-eval v0.4.10, 0-shot, A100-80GB. Both models tested identically.
Fine-tuning introduced a small alignment tax on standard benchmarks. Most scores are slightly below the base model. This is expected: the model gained anti-hallucination, structured reasoning, and honest refusal capabilities, which consume some of the 1.5B parameter budget. The one exception is TruthfulQA (+1.02%), where training to resist hallucination directly improved truthfulness.
| Benchmark | Ordis 1.5B | Base Qwen2.5-1.5B | Delta |
|-----------|-----------|-------------------|-------|
| TruthfulQA MC2 | 47.73% | 46.71% | +1.02 |
| GPQA | 27.90% | 28.35% | -0.45 |
| HellaSwag | 68.14% | 68.22% | -0.08 |
| ARC-Challenge | 45.22% | 46.84% | -1.62 |
| MMLU | 57.93% | 60.15% | -2.22 |
| GSM8K (CoT) | 50.80% | — | Not directly comparable* |
| AIME 2024 | 0% | 0% | — |
*GSM8K uses text generation and is sensitive to chat template configuration. We report the Ordis score but do not claim a delta. AIME is beyond 1.5B capability; zero score reported honestly.
Where Ordis differs from the base model is in practical capabilities that standard benchmarks do not measure: structured self-correction, three-layer cognition ("I don't know" honesty), causal reasoning.
CLadder Causal Reasoning
CLadder is an academic benchmark based on Judea Pearl's Ladder of Causation (300 questions, 3 levels). Paper
| Rung | Meaning | Score |
|------|---------|-------|
| Rung 1 (Association) | Statistical correlation | 46.0% (40/87) |
| Rung 2 (Intervention) | Active intervention | 50.6% (45/89) |
| Rung 3 (Counterfactual) | "What if" reasoning | 62.9% (78/124) |
| Overall | | 54.33% (163/300) |
For reference, the CLadder paper reported LLaMA-6.7B at ~50% and GPT-3.5 at ~55-60%.
BigBench CRASS AI & Causal Judgment
Independently tested by community members. BigBench CRASS AI (Counterfactual Reasoning About Scenes and Situations) tests the model's ability to reason about hypothetical scenarios.
| Benchmark | Shot | Ordis 1.5B | Base Qwen2.5-1.5B | Delta |
|-----------|------|-----------|-------------------|-------|
| CRASS AI | 0 | 34.09% | 52.27% | -18.18pp |
| CRASS AI | 25 | 81.82% | 88.64% | -6.82pp |
| Causal Judgment | 0 | 47.89% | 50.00% | -2.11pp |
| Causal Judgment | 25 | 55.79% | 53.68% | +2.11pp |
Key finding: CRASS AI 0-shot shows a significant -18.18pp regression. This is the alignment tax manifesting in counterfactual reasoning — the model's anti-hallucination training makes it more conservative on hypothetical scenarios. With 25-shot examples, the gap narrows to -6.82pp, confirming the capability exists but the 0-shot default behavior has shifted.
Custom Evaluation
| Benchmark | Score |
|-----------|-------|
| 60-Question Eval (6 dimensions) | 85.0% (51/60) |
| 124-Point Comprehensive | 75.4% (86/114) |
60-Question Breakdown:
| Dimension | Score |
|-----------|-------|
| Reasoning | 100% |
| Common Sense | 100% |
| Defense Overload | 100% |
| Anti-Hallucination | 90% |
| Identity | 60% |
| IDK Ability | 60% |
Key Capabilities
Anti-Hallucination: Uncertainty from reasoning, not safety templates
Three-Layer Cognition: Know what you know / know what you don't / assess before answering
Structured Self-Correction: Acknowledge, attribute, correct, verify
Causal Reasoning: Cross-domain causal structure transfer
Known Limitations
Alignment tax: Standard benchmarks regress slightly vs base model; CRASS AI counterfactual reasoning -18.18pp at 0-shot
Anti-gaslighting: Cannot resist persistent false memory injection (open-loop limitation)
Mid-confidence instability: 1.5B capacity ceiling
English identity leakage: Base model prior occasionally surfaces
Proper noun hallucination: Limited parametric memory at 1.5B scale
Usage
This repo contains LoRA adapter weights. To use:
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = PeftModel.from_pretrained(base_model, "sugiken/Ordis-1.5B-V355-VarGH")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
For GGUF quantized versions (recommended for deployment), see Ordis-1.5B-V355-VarGH-GGUF.
Model Details
| Property | Value |
|----------|-------|
| Base Model | Qwen/Qwen2.5-1.5B-Instruct |
| Parameters | 1.5B |
| Fine-tuning | LoRA (r=32, alpha=64) |
| Training | 4-Stage PIT (Progressive Identity Training) |
| Context Length | 32K (base), trained at 2048 |
| Languages | Chinese (primary), English |
| License | Apache 2.0 |
- Downloads last month
- 5