For now guide. Detailed version is here.
You are better off building the system layers first, then doing targeted post-training (SFT → preference tuning). Training a new base model from scratch is usually the worst ROI path for a small team unless “pretraining research” is explicitly the goal.
Below is a practical roadmap that matches “clarity, reasoning quality, reliability, real-world usability” and small-team constraints.
1) First reality check: what’s feasible for a small team
Training from scratch (pretraining)
Pretraining is dominated by (1) data scale + legal provenance, (2) compute scale, (3) infra and reproducibility. The compute-optimal scaling literature shows you generally need a lot of tokens and training runs to land a strong base model, and the “optimal” tokens increase with model size. (arXiv)
Small teams can absolutely pretrain tiny models as an educational exercise, but if your goal is a usable assistant, the best leverage is elsewhere.
Post-training (fine-tuning/alignment)
Post-training is where small teams can win:
- You start from a strong open-weights base.
- You adapt behavior (clarity, tool discipline, refusal calibration).
- You do it with tractable hardware using parameter-efficient methods.
Methods like LoRA and QLoRA exist specifically to make this feasible by reducing trainable parameters and memory. (arXiv)
2) Base model choice: how to choose and a short shortlist
Pick base models the way you pick an engine: license + serving + context + ecosystem.
Selection criteria that matter in practice
- License posture
- If you want fewer surprises for redistribution and commercial use, Apache-2.0 style licensing is simplest.
- “Open weights” models sometimes have custom license terms that create obligations or restrictions.
- Serving compatibility
- Decide early what you will serve with (common choices: vLLM, llama.cpp, TGI).
- If you want structured outputs (JSON schema, regex, grammars), check backend support. vLLM supports multiple guided decoding modes including JSON schema. (docs.vllm.ai)
- Context length and KV-cache cost
- Long context is not free. It increases KV cache memory pressure and often becomes the bottleneck before raw FLOPs. vLLM’s docs explicitly call out controlling context length and batch sizing to manage memory. (docs.vllm.ai)
Practical late-2025-ish shortlist (small-to-mid models)
Apache-2.0 options (cleaner for “open” projects):
- Qwen2.5-7B-Instruct (Apache-2.0; strong generalist; long-context support is a focus). (Hugging Face)
- Mistral-Nemo-Instruct-2407 (12B) (Apache-2.0; 128k context; positioned as a drop-in for “small model” deployments). (Hugging Face)
Custom-license options (still popular, but read the terms carefully):
- Llama 3.1: the license includes redistribution/attribution obligations and a large-scale MAU commercial term (the “>700M monthly active users” clause). (Hugging Face)
- Gemma: distributed under Google’s terms with specific distribution conditions and a prohibited use policy reference. (Google AI for Developers)
A pragmatic recommendation
Pick two bases:
- Fast/small (developer experience, iteration speed): ~7B–8B
- Better reasoning (still tractable): ~12B–14B
Then force both through the same orchestration, tool schema, and evaluation harness so the base model is swappable.
3) Fine-tuning vs training from scratch: what to do first
Recommended order for a small team
- System reliability layer (orchestration, context policy, tool execution, schema validation, tracing)
- SFT (supervised fine-tuning) on a small, high-quality set aligned to your rubric
- Preference tuning (DPO-style) driven by failure cases
Why this order:
- Wrappers can enforce many constraints deterministically (schemas, tool execution verification).
- Training is best used for behaviors the wrapper cannot reliably enforce (decision policies, grounding habits, refusal calibration).
What to use for post-training
- LoRA: add small trainable low-rank adapters instead of updating all weights. (arXiv)
- QLoRA: LoRA on top of a 4-bit quantized base; the original paper shows feasibility even for very large models under constrained VRAM. (OpenReview)
- DPO (Direct Preference Optimization): aligns to pairwise preferences without the full RLHF reward-model pipeline complexity. (OpenReview)
4) Dataset strategy: quality beats quantity, and provenance matters
Data for post-training (SFT + preference)
For your goals (clarity, reasoning, usability), you want:
- Rubric-tagged instruction conversations
- Failure-trace mining (collect real failures from your runtime and turn them into training examples)
- Length-controlled preference pairs so “verbosity doesn’t auto-win”
If you ever do continued pretraining (optional)
Use large open corpora mainly as a reference point and for controlled experiments:
- FineWeb (web-derived, processed/deduplicated; created as a high-quality pretraining dataset artifact). (Hugging Face)
- Dolma (3T tokens, documented mixture; released for open pretraining research). (arXiv)
- RedPajama-V2 (web-only dataset with quality signals and deduped subsets). (arXiv)
Synthetic data: use it, but cap it
Synthetic data is useful for:
- Expanding coverage of rare formats (tool calls, structured outputs)
- Creating preference pairs at scale
But it can also create self-reinforcing artifacts if it dominates training. Use “real anchors” and hard validation gates.
5) Hardware and serving constraints: plan around VRAM and KV-cache
Training
- PEFT saves memory because you avoid storing optimizer states and gradients for the full base model. (Hugging Face)
- 4-bit quantization with bitsandbytes is a common path for QLoRA-style workflows. (Hugging Face)
Rule of thumb: if you cannot afford multiple GPUs, bias toward:
- smaller base models
- LoRA/QLoRA
- shorter training contexts first, then scale context later
Inference
- Long context can make the KV cache your primary scaling pain.
- vLLM explicitly documents memory controls (max context, batch sizing, GPU memory utilization). (docs.vllm.ai)
- Techniques like quantized KV cache exist to reduce KV storage memory. (docs.vllm.ai)
If your product needs reliable structured outputs, verify that your inference stack supports guided decoding (vLLM does). (docs.vllm.ai)
6) Evaluation beyond “one benchmark”: build a layered scorecard
You want evaluations that match your target behaviors.
Layer A: deterministic validators
- JSON schema validity
- tool-call argument validation
- citation/grounding checks (if using retrieval)
These are cheap, fast, and regression-friendly.
Layer B: standardized task suites
- lm-evaluation-harness gives broad coverage across classic NLP/LLM tasks and is actively maintained/releases are tracked. (Zenodo)
Layer C: judge-based, debiased comparisons
- MT-Bench / Chatbot Arena judge methodology is widely referenced, with analysis of judge vs human agreement. (arXiv)
- AlpacaEval 2 emphasizes length-controlled win rates to reduce verbosity bias. (GitHub)
- Arena-Hard-Auto is designed to correlate with Chatbot Arena-style signals. (GitHub)
Layer D: tool and agent evaluations (if you support tools)
- ToolBench / ToolLLM provides data and an eval framing for tool use. (arXiv)
- AgentBench evaluates LLMs as agents in interactive environments. (arXiv)
- If you care about “real coding usefulness,” SWE-bench is a strong reality check because it is grounded in real repo issues and tests. (GitHub)
Security regression pack
If you do retrieval or tools, prompt injection is not optional. OWASP’s LLM Top 10 is a decent baseline threat checklist. (OWASP)
7) Wrappers/layers vs “new base model”: the highest-leverage decision
If the goal is a usable assistant system, prioritize:
- Orchestration + context policy (windowing, summarization slots, token budgets)
- Tool execution with verification (no fake tool traces)
- Structured outputs with bounded repair loops
- Tracing/observability (prompt hashes, tool I/O, validator results, latency, token counts)
- Evaluation harness that gates releases
Then do targeted post-training to improve:
- tool selection policies
- refusal calibration
- clarity and reasoning style consistency
- grounding discipline
This approach also makes base models swappable without rewriting your product surface.
8) Forming a small open team: what attracts serious contributors
Make the project legible and testable:
-
A public “north star” spec (3–5 workflows, success criteria, non-negotiables)
-
A reproducible eval harness that anyone can run
-
A clear data policy (sources, licenses, dedup strategy, what you will not use)
-
A contribution ladder:
- add eval cases
- add tools and validators
- improve training recipes
- ship reference adapters
A surprisingly effective recruiting move is: “Here is the eval suite. Here are the current failures. Pick one.”
Curated references worth keeping open
Summary
- Build the assistant system layer first. It produces reliability and the best training signals.
- Start from two base models (small + mid) and standardize serving + eval so swapping is easy.
- Do LoRA/QLoRA SFT, then DPO-style preference tuning from real failures. (OpenReview)
- Treat context length and KV cache as first-class deployment constraints. (docs.vllm.ai)
- Use a layered eval scorecard and include security regression tests. (OWASP)