Qwen3.5-35B-A3B Layer-Split GGUF
Qwen3.5-35B-A3B (Qwen3.5-MoE) decomposed into per-layer GGUF files -- a novel layer-split format (
lemonslice-gguf-v1) that achieves 17.4x memory reduction for sequential loading and per-layer inspection.
Overview
This repository presents the Qwen3.5-35B-A3B Mixture-of-Experts model restructured into individual GGUF files -- one per transformer block. Starting from the Q4_K_M quantized base model (by Unsloth), the weights were manually decomposed, layer tensors were reorganized, shared components were physically separated, and a complete manifest was built -- resulting in 40 layer files + 1 shared file. This custom lemonslice-gguf-v1 format enables:
- Sequential layer loading -- peak memory reduced by 17.4x (from 21.2 GB to ~1.13 GB)
- Layer-level inspection -- examine, modify, or swap individual blocks
- Selective inference -- run only a subset of layers for ablation studies
- Custom pipeline parallelism -- distribute layers across multiple devices
- Per-layer fine-tuning -- target specific layer types (SSM vs Attention)
What Makes This Different From a Standard GGUF
Standard GGUF (monolithic)
A normal GGUF file contains all 733 tensors in a single file. Loading it requires the entire 21.2 GB to reside in RAM/VRAM simultaneously. Every tensor -- embeddings, all 40 blocks, and the LM head -- is mapped into memory at once.
This Repository (layer-split / lemonslice-gguf-v1)
This format represents a fundamentally different file organization:
| Aspect | Standard GGUF | Layer-Split GGUF |
|---|---|---|
| File count | 1 file | 42 files (1 shared + 40 layers + manifest) |
| Peak memory | 21.2 GB (entire model) | 1.13 GB (one layer at a time) |
| Memory reduction | 1x | 17.4x |
| KV metadata | In single file | Replicated in each layer file |
| Self-contained | Yes | Each layer file carries full model metadata |
| Tensor isolation | No (all tensors together) | Yes (per-block boundaries) |
Key Architectural Differences
Every layer file is a standalone GGUF -- each
layer_XXXX.ggufcontains its own complete GGUF header with all 54 KV metadata entries (architecture, hyperparameters, tokenizer info, sampling params). This is unlike a standard GGUF where metadata lives once in the single file.Shared tensors are physically separated --
token_embd.weight,output_norm.weight, andoutput.weightare extracted intoshared.gguf, decoupling the embedding/projection layer from the transformer body.Layer files preserve tensor count boundaries -- the manifest tracks exactly how many tensors each layer contains (16 for attention-only, 19 for SSM layers), making it possible to verify file integrity independently.
Hybrid SSM + Attention architecture exposed -- the layer split makes visible what's hidden in a monolithic file: the regular alternation between SSM (Mamba) and Attention layers every 4 blocks.
Self-contained layer files -- each
layer_XXXX.ggufcarries full model metadata (54 KV entries), making it independently inspectable by any GGUF parser
Model Hyperparameters
From the GGUF KV metadata (shared.gguf):
| Parameter | Value |
|---|---|
| Architecture | qwen35moe |
| Model name | Qwen3.5-35B-A3B |
| Block count | 40 |
| Context length | 262,144 (256K) |
| Embedding dimension | 2,048 |
| Attention heads | 16 |
| KV heads (GQA) | 2 |
| Key/Value length | 256 |
| RoPE dimensions | 64 (sections: [11, 11, 10, 0]) |
| RoPE freq base | 10,000,000 |
| RMS norm epsilon | 1e-6 |
| Total experts | 256 |
| Active experts | 8 |
| Expert FFN dim | 512 |
| Expert shared FFN dim | 512 |
| SSM state size | 128 |
| SSM conv kernel | 4 |
| SSM group count | 16 |
| SSM time step rank | 32 |
| SSM inner size | 4,096 |
| Full attention interval | Every 4 layers |
| Tokenizer | GPT2 (qwen35 pre-tokenizer) |
| Vocabulary | 248,320 tokens |
| Quantization | Q4_K_M (by Unsloth) |
| Sampling | top_k=20, top_p=0.95, temp=1.0 |
| License | Apache 2.0 |
Architecture
Qwen3.5-35B-A3B is a hybrid SSM + Attention Mixture-of-Experts model with 40 layers and 733 total tensors (Q4_K_M quantized).
File Inventory
| Component | File | Tensors | Size |
|---|---|---|---|
| Shared (embeddings, norm, output) | shared.gguf |
3 | 714 MB |
| SSM layers (30 blocks) | layer_0000-0002, 0004-0006, ... |
19 each | 472-548 MB |
| Attention layers (10 blocks) | layer_0003, 0007, 0011, ... |
16 each | 472-552 MB |
Hybrid Layer Pattern
The model alternates between two layer types in a regular every-4-layers pattern (full_attention_interval = 4):
| Layer Type | Count | Pattern | Tensors | Key Differentiator |
|---|---|---|---|---|
| SSM (Mamba) | 30 | 0-2, 4-6, 8-10, 12-14, 16-18, 20-22, 24-26, 28-30, 32-34, 36-38 | 19 | Has ssm_* tensors: ssm_a, ssm_alpha, ssm_beta, ssm_conv1d, ssm_dt, ssm_norm, ssm_out |
| Attention-only | 10 | 3, 7, 11, 15, 19, 23, 27, 31, 35, 39 | 16 | Has separate attn_k, attn_q, attn_v, attn_output + K/Q norm tensors |
SSM Layer Tensors (19 per block)
blk.N.attn_gate.weight # SSM gate control
blk.N.attn_norm.weight # Pre-norm RMS
blk.N.attn_qkv.weight # Combined QKV projection (SSM-style)
blk.N.ffn_down_exps.weight # Expert down projection
blk.N.ffn_down_shexp.weight # Shared expert down projection
blk.N.ffn_gate_exps.weight # Expert gate up projection
blk.N.ffn_gate_inp.weight # Expert routing weights
blk.N.ffn_gate_inp_shexp.weight # Shared expert routing
blk.N.ffn_gate_shexp.weight # Shared expert gate
blk.N.ffn_up_exps.weight # Expert up projection
blk.N.ffn_up_shexp.weight # Shared expert up projection
blk.N.post_attention_norm.weight # Post-norm RMS
blk.N.ssm_a # SSM discretization parameter
blk.N.ssm_alpha.weight # SSM alpha scaling
blk.N.ssm_beta.weight # SSM beta scaling
blk.N.ssm_conv1d.weight # SSM causal conv1d kernel
blk.N.ssm_dt.bias # SSM delta timestep bias
blk.N.ssm_norm.weight # SSM normalization
blk.N.ssm_out.weight # SSM output projection
Attention-Only Layer Tensors (16 per block)
blk.N.attn_k.weight # Separate K projection
blk.N.attn_k_norm.weight # K normalization
blk.N.attn_norm.weight # Pre-norm RMS
blk.N.attn_output.weight # Attention output projection
blk.N.attn_q.weight # Separate Q projection
blk.N.attn_q_norm.weight # Q normalization
blk.N.attn_v.weight # Separate V projection
blk.N.ffn_down_exps.weight # Expert down projection
blk.N.ffn_down_shexp.weight # Shared expert down projection
blk.N.ffn_gate_exps.weight # Expert gate up projection
blk.N.ffn_gate_inp.weight # Expert routing weights
blk.N.ffn_gate_inp_shexp.weight # Shared expert routing
blk.N.ffn_gate_shexp.weight # Shared expert gate
blk.N.ffn_up_exps.weight # Expert up projection
blk.N.ffn_up_shexp.weight # Shared expert up projection
blk.N.post_attention_norm.weight # Post-norm RMS
MoE Expert Structure
Both layer types share the same 256 experts, 8 active MoE pattern:
| Tensor | Purpose |
|---|---|
ffn_gate_inp.weight |
Top-8 expert routing gate |
ffn_gate_inp_shexp.weight |
Shared expert routing |
ffn_gate_exps.weight |
Per-expert gate projection |
ffn_gate_shexp.weight |
Shared expert gate projection |
ffn_up_exps.weight |
Per-expert up projection (FFN dim 512) |
ffn_up_shexp.weight |
Shared expert up projection |
ffn_down_exps.weight |
Per-expert down projection |
ffn_down_shexp.weight |
Shared expert down projection |
Shared Tensors
shared.gguf contains 3 tensors that are not layer-specific:
| Tensor | Purpose |
|---|---|
token_embd.weight |
Input token embeddings (248,320 x 2,048) |
output_norm.weight |
Final RMS layer normalization |
output.weight |
Output projection / LM head |
Layer Size Map
| Layer | Type | Size | Tensors | Notes |
|---|---|---|---|---|
| 00 | SSM | 523 MB | 19 | |
| 01 | SSM | 523 MB | 19 | |
| 02 | SSM | 523 MB | 19 | |
| 03 | ATTN | 516 MB | 16 | First attention layer |
| 04 | SSM | 523 MB | 19 | |
| 05 | SSM | 457 MB | 19 | Smaller SSM block |
| 06 | SSM | 457 MB | 19 | |
| 07 | ATTN | 516 MB | 16 | |
| 08 | SSM | 457 MB | 19 | |
| 09 | SSM | 457 MB | 19 | |
| 10 | SSM | 523 MB | 19 | |
| 11 | ATTN | 450 MB | 16 | Smallest attention |
| 12 | SSM | 457 MB | 19 | |
| 13 | SSM | 523 MB | 19 | |
| 14 | SSM | 457 MB | 19 | |
| 15 | ATTN | 450 MB | 16 | |
| 16 | SSM | 523 MB | 19 | |
| 17 | SSM | 457 MB | 19 | |
| 18 | SSM | 457 MB | 19 | |
| 19 | ATTN | 516 MB | 16 | Mid-model attention |
| 20 | SSM | 457 MB | 19 | |
| 21 | SSM | 457 MB | 19 | |
| 22 | SSM | 523 MB | 19 | |
| 23 | ATTN | 450 MB | 16 | |
| 24 | SSM | 457 MB | 19 | |
| 25 | SSM | 523 MB | 19 | |
| 26 | SSM | 457 MB | 19 | |
| 27 | ATTN | 450 MB | 16 | |
| 28 | SSM | 523 MB | 19 | |
| 29 | SSM | 457 MB | 19 | |
| 30 | SSM | 457 MB | 19 | |
| 31 | ATTN | 516 MB | 16 | |
| 32 | SSM | 457 MB | 19 | |
| 33 | SSM | 457 MB | 19 | |
| 34 | SSM | 523 MB | 19 | |
| 35 | ATTN | 516 MB | 16 | |
| 36 | SSM | 523 MB | 19 | Deep model SSM |
| 37 | SSM | 523 MB | 19 | |
| 38 | SSM | 523 MB | 19 | |
| 39 | ATTN | 526 MB | 16 | Final attention layer |
Total: ~19.76 GB across 41 GGUF files
Memory Efficiency
| Metric | Standard Load | Sequential Load | Improvement |
|---|---|---|---|
| Peak memory | 21.2 GB | 1.13 GB | 17.4x reduction |
| Files needed | 1 | 2 (shared + 1 layer) |
The sequential loading pattern loads shared.gguf + one layer at a time, processes the layer, releases it, then moves to the next. This enables running a 35B parameter model on hardware with as little as ~2 GB of available memory.
Format Specification
- Format:
lemonslice-gguf-v1(layer-split GGUF) - GGUF Version: 3
- Quantization: Q4_K_M (4-bit mixed, by Unsloth)
- Alignment: 32 bytes
- Byte order: Little-endian
- Source:
Qwen3.5-35B-A3B-Q4_K_M.gguf - Architecture ID:
qwen35moe - Total tensors: 733 (3 shared + 730 across layers)
File Manifest
Qwen35_35B_Layers/
โโโ README.md # This file
โโโ manifest.json # Complete layer/tensor manifest
โโโ shared.gguf # Shared tensors (714 MB, 3 tensors)
โโโ layer_0000.gguf # Block 0 -- SSM (523 MB, 19 tensors)
โโโ layer_0001.gguf # Block 1 -- SSM (523 MB, 19 tensors)
โโโ layer_0002.gguf # Block 2 -- SSM (523 MB, 19 tensors)
โโโ layer_0003.gguf # Block 3 -- Attention (516 MB, 16 tensors)
โโโ layer_0004.gguf # Block 4 -- SSM (523 MB, 19 tensors)
โโโ layer_0005.gguf # Block 5 -- SSM (457 MB, 19 tensors)
โโโ layer_0006.gguf # Block 6 -- SSM (457 MB, 19 tensors)
โโโ layer_0007.gguf # Block 7 -- Attention (516 MB, 16 tensors)
โโโ ...
โโโ layer_0039.gguf # Block 39 -- Attention (526 MB, 16 tensors)
Usage
Inspecting Individual Layers
# View tensor names and KV metadata in a layer file
gguf-dump layer_0000.gguf
# View shared tensors and full model hyperparameters
gguf-dump shared.gguf
# View the complete manifest
cat manifest.json | python -m json.tool
Sequential Loading (Python)
import gguf
import numpy as np
# Load shared tensors (embeddings, norms, output)
shared = gguf.GGUFReader("shared.gguf", "r")
embeddings = shared.get_tensor("token_embd.weight")
output_weight = shared.get_tensor("output.weight")
# Sequentially process each layer
for i in range(40):
layer_file = f"layer_{i:04d}.gguf"
reader = gguf.GGUFReader(layer_file, "r")
for tensor in reader.tensors:
name = str(tensor.name)
data = np.asarray(tensor.data)
# ... forward pass logic
del reader # Release layer from memory
Selective Layer Ablation
# Run inference with only SSM layers (skip attention layers)
ssm_layers = [i for i in range(40) if i % 4 != 3]
for i in ssm_layers:
layer = load_layer(f"layer_{i:04d}.gguf")
hidden = forward(hidden, layer)
The LemonSlice Logic
The lemonslice-gguf-v1 format gets its name from slicing a monolithic GGUF into discrete, self-contained "slices" -- much like cutting a lemon into wedges where each wedge retains the full flavor profile. Each file is exactly one transformer block.
Each GGUF File = One Transformer Block
In a standard GGUF, the entire model -- embedding table, all 40 transformer blocks, layer norms, and the LM head -- is packed into a single 21.2 GB binary. You load the whole file or nothing.
LemonSlice changes this at the file level:
Standard GGUF: model.gguf = [embd][blk.0][blk.1]...[blk.39][output_norm][output]
LemonSlice: shared.gguf = [embd]
layer_0000.gguf = [blk.0] (19 tensors, 523 MB)
layer_0001.gguf = [blk.1] (19 tensors, 523 MB)
...
layer_0039.gguf = [blk.39] (16 tensors, 527 MB)
One file per block. No more, no less. The mapping is deterministic and encoded in the filename: layer_NNNN.gguf contains only tensors with the prefix blk.N..
This means:
- File = Block --
layer_0012.ggufis literally just transformer block 12, nothing else - Blocks are independently loadable -- you can open, inspect, modify, or replace any single block without touching the other 39
- Blocks are independently valid -- every
layer_XXXX.ggufis a complete GGUF v3 file with full architecture metadata (54 KV entries), so any GGUF tool can read it - The model is a playlist, not a monolith -- inference is a sequential walk through the layer files, one at a time
Why One-File-Per-Block Matters for Inference
1. Memory: Run 35B Models on 2 GB
The most immediate impact is peak memory. Standard GGUF loading mmap()s the entire 21.2 GB file into virtual memory. With LemonSlice:
Forward pass for token t:
1. mmap(shared.gguf) -> 714 MB [token_embd.weight resident]
2. mmap(layer_0000.gguf) -> 523 MB [block 0 weights]
-> compute block 0 -> unmap layer_0000.gguf [memory freed]
3. mmap(layer_0001.gguf) -> 523 MB [block 1 weights]
-> compute block 1 -> unmap layer_0001.gguf [memory freed]
...
40. mmap(layer_0039.gguf) -> 527 MB [block 39 weights]
-> compute block 39 -> unmap
41. mmap(shared.gguf) -> already resident [output_norm + LM head]
-> produce logits
Peak memory: ~1.2 GB (shared + largest single block)
The 17.4x reduction (21.2 GB -> 1.2 GB) means models that previously required 48 GB of VRAM can now run in system RAM on machines with as little as 4 GB total.
2. Selective Loading: Skip What You Don't Need
With blocks as files, you can choose which ones to load:
- Early exit -- if block 15 already produces high-confidence logits, skip blocks 16-39. This trades quality for speed dynamically.
- Layer ablation -- remove attention-only layers (3, 7, 11, ...) to test their contribution. Remove SSM layers to understand state-space dynamics.
- Progressive depth -- start with blocks 0-9 for fast responses, progressively load more blocks as the conversation deepens.
- Hot-path caching -- if certain layers are accessed repeatedly (common in batch inference), keep them memory-resident while streaming others.
3. Per-Layer Quantization
In a monolithic GGUF, the entire model shares one quantization scheme (Q4_K_M in this case). With LemonSlice, each layer file can be independently re-quantized:
shared.gguf -> Q8_0 (embeddings need precision)
layer_0000-0009 -> Q4_K_M (early layers: moderate precision)
layer_0010-0029 -> Q3_K_M (middle layers: aggressive quant)
layer_0030-0039 -> Q5_K_M (late layers: higher precision for output quality)
This lets you optimize the precision/speed/size trade-off per layer rather than globally.
4. Multi-GPU and Hybrid Execution
Blocks as files enable natural distribution:
GPU 0: shared.gguf + layer_0000 to layer_0019 (first 20 blocks)
GPU 1: layer_0020 to layer_0039 (last 20 blocks)
Or hybrid CPU/GPU:
GPU: shared.gguf + layer_0000 to layer_0009 (first 10 blocks in VRAM)
CPU: layer_0010 to layer_0039 (remaining 30 blocks in RAM)
The file boundary makes it trivial to decide "which blocks live where."
5. Bandwidth and Streaming
When model weights are loaded from disk (not pre-loaded into RAM), LemonSlice streams blocks on-demand:
- Only read the bytes you need -- each
layer_XXXX.ggufis ~500 MB, not 21.2 GB - Predictable I/O pattern -- sequential file reads, no random access into a 21 GB file
- Interruptibility -- if inference is cancelled mid-forward-pass, you've only loaded the blocks you reached, not wasted I/O loading everything
The Inference Conveyor Belt: How Hidden States Flow Between Layers
A common question: what passes from one layer file to the next? It's not the KV cache being "transferred" between files -- it's the hidden state, which is a tiny vector compared to the weights.
Here's exactly what happens during a forward pass:
Input token IDs
|
v
[shared.gguf: token_embd.weight]
| produces hidden state h_0 (2048-dim vector per token)
v
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ mmap(layer_0000.gguf) -> load block 0 weights (523 MB) โ
โ h_0 --[block 0 forward]--> h_1 โ
โ (KV cache for block 0 computed and stored) โ
โ munmap(layer_0000.gguf) -> release 523 MB โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| hidden state h_1 passes in RAM (just a few KB)
v
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ mmap(layer_0001.gguf) -> load block 1 weights (523 MB) โ
โ h_1 --[block 1 forward]--> h_2 โ
โ (KV cache for block 1 accumulated) โ
โ munmap(layer_0001.gguf) -> release 523 MB โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| hidden state h_2 passes in RAM
v
...repeat for all 40 blocks...
|
v
[shared.gguf: output_norm.weight + output.weight]
| produces logits (248,320-dim vector)
v
Softmax -> token probabilities
Three things exist in memory during this process:
| What | Size | Lifetime |
|---|---|---|
| Layer weights | ~500 MB | Only while that layer file is mmap'd |
| Hidden state | ~8 KB per token (2048 * 4 bytes) | Lives in RAM throughout, passed between layers |
| KV cache | Grows per layer (~256 KB per layer per token) | Accumulates as each layer is processed, freed after logits |
The hidden state is the "product" on the conveyor belt -- each layer file is a "station" that transforms it. The hidden state is tiny (a few KB) compared to the weights (hundreds of MB), so passing it between layer files is essentially free.
The KV cache is a "side buffer" that accumulates as each layer computes its K/V projections. In a standard GGUF, the KV cache is allocated separately from the weights and works identically. The difference with LemonSlice is that the KV cache builds up incrementally as layers are loaded one by one, rather than all at once.
How It Works: The Split Process
- Parse the monolithic header -- read
GGUF.version,GGUF.tensor_count,GGUF.kv_count, and all 54 metadata key-value pairs (architecture, hyperparameters, tokenizer, sampling config) - Identify tensor boundaries -- each tensor has a name, shape, dtype, and byte offset. The
blk.N.prefix in tensor names (e.g.,blk.0.attn_qkv.weight) naturally partitions tensors into layer groups - Extract shared tensors -- tensors without a
blk.N.prefix (token_embd.weight,output_norm.weight,output.weight) are separated intoshared.ggufbecause they span the entire model, not a single layer - Write each layer as its own GGUF -- for each layer N, write a complete GGUF v3 file with:
- Full header (magic + version)
- All 54 KV metadata entries replicated from the original
- Only the tensors belonging to that layer (
blk.N.*) - 32-byte alignment padding for memory-mapped loading
Why Full Metadata in Every File
Unlike splitting approaches that store metadata externally, every layer_XXXX.gguf is a valid, standalone GGUF file:
gguf-dump layer_0005.ggufshows the full model architecture (context length, head count, expert count, SSM parameters, tokenizer)- Any layer file can be loaded by standard GGUF parsers without external config
- Layers can be shared independently -- send someone a single layer and they can inspect it
- The KV metadata duplication is ~2KB per layer file -- a deliberate trade-off for complete self-containment
Tensor Count Variation
SSM layers have 19 tensors while attention-only layers have 16 because:
- SSM layers include 7
ssm_*tensors (state-space model:ssm_a,ssm_alpha,ssm_beta,ssm_conv1d,ssm_dt,ssm_norm,ssm_out) but use a combinedattn_qkv.weight - Attention layers have separate
attn_k,attn_q,attn_v,attn_outputplus K/Q normalization, but no SSM tensors - Both layer types share the same 8 MoE FFN tensors (gate, up, down for per-expert and shared)
Future Vision: Where LemonSlice Goes
The layer-split format is a foundation, not a destination. Here's where this can go:
Layer Surgery
Mix and match transformer blocks from different models. Take the first 20 blocks from Qwen3.5-35B-A3B and the last 20 from Llama-3-70B. Swap in attention-only layers from a model with better reasoning capabilities. The file-per-block format makes this as simple as copying .gguf files into a directory and updating manifest.json.
Incremental Model Updates
When a model is fine-tuned and only 5 of 40 layers change, you only need to download those 5 layer files (~2.5 GB) instead of the entire 21.2 GB model. This is particularly powerful for:
- LoRA/DoRA merges that affect specific layers
- Safety fine-tunes that modify late-layer behavior
- Domain-specific adapters (medical, legal, code) as layer replacements
Progressive Inference
Start generating tokens with just the first 10 layers. If the output is confident, you saved 75% of the compute. If not, load more layers and refine. This creates a natural speed/quality trade-off that adapts per-token.
Edge and Constrained Deployment
Run a 35B model on a Raspberry Pi by loading one layer at a time from an SD card. It will be slow, but it will work -- and that's impossible with a monolithic GGUF. LemonSlice makes the full model spectrum accessible: from "runs in 2 GB RAM but takes 5 minutes per token" to "runs in 48 GB VRAM at 100 tokens/sec" with the same files.
Layer-Level Benchmarking and Analysis
Benchmark individual blocks for compute time, memory bandwidth, and activation patterns. Identify which layers are bottlenecks. Compare SSM vs attention-only layer performance. The file-per-block format makes it trivial to isolate and profile any layer independently.
P2P Model Distribution
Share individual layers peer-to-peer. Someone can share just the attention-only layers (10 files, ~5 GB) while you already have the SSM layers from another source. The manifest verifies integrity regardless of download source.
Layer-Specific Fine-Tuning
Fine-tune only layers 30-39 (the "output-facing" layers) while keeping layers 0-29 frozen. With LemonSlice, this is just replacing files 30-39 with your fine-tuned versions. No need to save the entire 21.2 GB model when only 5 GB changed.
Hybrid Architecture Research
The Qwen3.5-35B-A3B model already uses a hybrid SSM + Attention pattern (visible in the layer files). LemonSlice makes it easy to experiment with:
- Different SSM/Attention ratios
- Reordering layers (attention layers at the start vs end)
- Inserting custom layers between existing blocks
- Removing the MoE experts from specific layers to test their contribution
Citation
Original model: Qwen/Qwen3.5-35B-A3B
@misc{qwen3.5,
title={Qwen3.5 Technical Report},
author={Qwen Team},
year={2025},
publisher={Hugging Face}
}
License
Apache 2.0 -- inherited from the original Qwen3.5-35B-A3B license.
Quantization by Unsloth.
- Downloads last month
- 345
We're not able to determine the quantization variants.