Ordis-1.5B-V355-VarGH

A 1.5B fine-tuned model focused on practical capabilities: anti-hallucination, honest refusal ("I don't know"), structured reasoning.

Built on Qwen2.5-1.5B-Instruct using LoRA with a 4-stage progressive training pipeline (PIT), iterated through 16+ controlled experiments.

Website | GGUF Versions | ModelScope

Standard Benchmarks

Evaluation: lm-eval v0.4.10, 0-shot, A100-80GB. Both models tested identically.

Fine-tuning introduced a small alignment tax on standard benchmarks. Most scores are slightly below the base model. This is expected: the model gained anti-hallucination, structured reasoning, and honest refusal capabilities, which consume some of the 1.5B parameter budget. The one exception is TruthfulQA (+1.02%), where training to resist hallucination directly improved truthfulness.

|-----------|-----------|-------------------|-------|

| TruthfulQA MC2 | 47.73% | 46.71% | +1.02 |

| GPQA | 27.90% | 28.35% | -0.45 |

| HellaSwag | 68.14% | 68.22% | -0.08 |

| ARC-Challenge | 45.22% | 46.84% | -1.62 |

| MMLU | 57.93% | 60.15% | -2.22 |

| GSM8K (CoT) | 50.80% | — | Not directly comparable* |

| AIME 2024 | 0% | 0% | — |

*GSM8K uses text generation and is sensitive to chat template configuration. We report the Ordis score but do not claim a delta. AIME is beyond 1.5B capability; zero score reported honestly.

Where Ordis differs from the base model is in practical capabilities that standard benchmarks do not measure: structured self-correction, three-layer cognition ("I don't know" honesty), causal reasoning.

CLadder Causal Reasoning

CLadder is an academic benchmark based on Judea Pearl's Ladder of Causation (300 questions, 3 levels). Paper

| Rung | Meaning | Score |

|------|---------|-------|

| Rung 1 (Association) | Statistical correlation | 46.0% (40/87) |

| Rung 2 (Intervention) | Active intervention | 50.6% (45/89) |

| Rung 3 (Counterfactual) | "What if" reasoning | 62.9% (78/124) |

| Overall | | 54.33% (163/300) |

For reference, the CLadder paper reported LLaMA-6.7B at ~50% and GPT-3.5 at ~55-60%.

BigBench CRASS AI & Causal Judgment

Independently tested by community members. BigBench CRASS AI (Counterfactual Reasoning About Scenes and Situations) tests the model's ability to reason about hypothetical scenarios.

|-----------|------|-----------|-------------------|-------|

| CRASS AI | 0 | 34.09% | 52.27% | -18.18pp |

| CRASS AI | 25 | 81.82% | 88.64% | -6.82pp |

| Causal Judgment | 0 | 47.89% | 50.00% | -2.11pp |

| Causal Judgment | 25 | 55.79% | 53.68% | +2.11pp |

Key finding: CRASS AI 0-shot shows a significant -18.18pp regression. This is the alignment tax manifesting in counterfactual reasoning — the model's anti-hallucination training makes it more conservative on hypothetical scenarios. With 25-shot examples, the gap narrows to -6.82pp, confirming the capability exists but the 0-shot default behavior has shifted.

Custom Evaluation

| Benchmark | Score |

|-----------|-------|

| 60-Question Eval (6 dimensions) | 85.0% (51/60) |

| 124-Point Comprehensive | 75.4% (86/114) |

60-Question Breakdown:

| Dimension | Score |

|-----------|-------|

| Reasoning | 100% |

| Common Sense | 100% |

| Defense Overload | 100% |

| Anti-Hallucination | 90% |

| Identity | 60% |

| IDK Ability | 60% |

Key Capabilities

Anti-Hallucination: Uncertainty from reasoning, not safety templates
Three-Layer Cognition: Know what you know / know what you don't / assess before answering
Structured Self-Correction: Acknowledge, attribute, correct, verify
Causal Reasoning: Cross-domain causal structure transfer

Known Limitations

Alignment tax: Standard benchmarks regress slightly vs base model; CRASS AI counterfactual reasoning -18.18pp at 0-shot
Anti-gaslighting: Cannot resist persistent false memory injection (open-loop limitation)
Mid-confidence instability: 1.5B capacity ceiling
English identity leakage: Base model prior occasionally surfaces
Proper noun hallucination: Limited parametric memory at 1.5B scale

Usage

This repo contains LoRA adapter weights. To use:


from transformers import AutoModelForCausalLM, AutoTokenizer

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

model = PeftModel.from_pretrained(base_model, "sugiken/Ordis-1.5B-V355-VarGH")

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")

For GGUF quantized versions (recommended for deployment), see Ordis-1.5B-V355-VarGH-GGUF.

Model Details

| Property | Value |

|----------|-------|

| Base Model | Qwen/Qwen2.5-1.5B-Instruct |

| Parameters | 1.5B |

| Fine-tuning | LoRA (r=32, alpha=64) |

| Training | 4-Stage PIT (Progressive Identity Training) |

| Context Length | 32K (base), trained at 2048 |

| Languages | Chinese (primary), English |

| License | Apache 2.0 |

Downloads last month: 5

Safetensors

Model size

2B params

Tensor type

F16

Model tree for sugiken/Ordis-1.5B-V355-VarGH

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Adapter

(892)

this model

Paper for sugiken/Ordis-1.5B-V355-VarGH

CLadder: Assessing Causal Reasoning in Language Models

Paper • 2312.04350 • Published Dec 7, 2023