Qwen3.5-35B-A3B Layer-Split GGUF

Qwen3.5-35B-A3B (Qwen3.5-MoE) decomposed into per-layer GGUF files -- a novel layer-split format (lemonslice-gguf-v1) that achieves 17.4x memory reduction for sequential loading and per-layer inspection.

Overview

This repository presents the Qwen3.5-35B-A3B Mixture-of-Experts model restructured into individual GGUF files -- one per transformer block. Starting from the Q4_K_M quantized base model (by Unsloth), the weights were manually decomposed, layer tensors were reorganized, shared components were physically separated, and a complete manifest was built -- resulting in 40 layer files + 1 shared file. This custom lemonslice-gguf-v1 format enables:

  • Sequential layer loading -- peak memory reduced by 17.4x (from 21.2 GB to ~1.13 GB)
  • Layer-level inspection -- examine, modify, or swap individual blocks
  • Selective inference -- run only a subset of layers for ablation studies
  • Custom pipeline parallelism -- distribute layers across multiple devices
  • Per-layer fine-tuning -- target specific layer types (SSM vs Attention)

What Makes This Different From a Standard GGUF

Standard GGUF (monolithic)

A normal GGUF file contains all 733 tensors in a single file. Loading it requires the entire 21.2 GB to reside in RAM/VRAM simultaneously. Every tensor -- embeddings, all 40 blocks, and the LM head -- is mapped into memory at once.

This Repository (layer-split / lemonslice-gguf-v1)

This format represents a fundamentally different file organization:

Aspect Standard GGUF Layer-Split GGUF
File count 1 file 42 files (1 shared + 40 layers + manifest)
Peak memory 21.2 GB (entire model) 1.13 GB (one layer at a time)
Memory reduction 1x 17.4x
KV metadata In single file Replicated in each layer file
Self-contained Yes Each layer file carries full model metadata
Tensor isolation No (all tensors together) Yes (per-block boundaries)

Key Architectural Differences

  1. Every layer file is a standalone GGUF -- each layer_XXXX.gguf contains its own complete GGUF header with all 54 KV metadata entries (architecture, hyperparameters, tokenizer info, sampling params). This is unlike a standard GGUF where metadata lives once in the single file.

  2. Shared tensors are physically separated -- token_embd.weight, output_norm.weight, and output.weight are extracted into shared.gguf, decoupling the embedding/projection layer from the transformer body.

  3. Layer files preserve tensor count boundaries -- the manifest tracks exactly how many tensors each layer contains (16 for attention-only, 19 for SSM layers), making it possible to verify file integrity independently.

  4. Hybrid SSM + Attention architecture exposed -- the layer split makes visible what's hidden in a monolithic file: the regular alternation between SSM (Mamba) and Attention layers every 4 blocks.

  5. Self-contained layer files -- each layer_XXXX.gguf carries full model metadata (54 KV entries), making it independently inspectable by any GGUF parser

Model Hyperparameters

From the GGUF KV metadata (shared.gguf):

Parameter Value
Architecture qwen35moe
Model name Qwen3.5-35B-A3B
Block count 40
Context length 262,144 (256K)
Embedding dimension 2,048
Attention heads 16
KV heads (GQA) 2
Key/Value length 256
RoPE dimensions 64 (sections: [11, 11, 10, 0])
RoPE freq base 10,000,000
RMS norm epsilon 1e-6
Total experts 256
Active experts 8
Expert FFN dim 512
Expert shared FFN dim 512
SSM state size 128
SSM conv kernel 4
SSM group count 16
SSM time step rank 32
SSM inner size 4,096
Full attention interval Every 4 layers
Tokenizer GPT2 (qwen35 pre-tokenizer)
Vocabulary 248,320 tokens
Quantization Q4_K_M (by Unsloth)
Sampling top_k=20, top_p=0.95, temp=1.0
License Apache 2.0

Architecture

Qwen3.5-35B-A3B is a hybrid SSM + Attention Mixture-of-Experts model with 40 layers and 733 total tensors (Q4_K_M quantized).

File Inventory

Component File Tensors Size
Shared (embeddings, norm, output) shared.gguf 3 714 MB
SSM layers (30 blocks) layer_0000-0002, 0004-0006, ... 19 each 472-548 MB
Attention layers (10 blocks) layer_0003, 0007, 0011, ... 16 each 472-552 MB

Hybrid Layer Pattern

The model alternates between two layer types in a regular every-4-layers pattern (full_attention_interval = 4):

Layer Type Count Pattern Tensors Key Differentiator
SSM (Mamba) 30 0-2, 4-6, 8-10, 12-14, 16-18, 20-22, 24-26, 28-30, 32-34, 36-38 19 Has ssm_* tensors: ssm_a, ssm_alpha, ssm_beta, ssm_conv1d, ssm_dt, ssm_norm, ssm_out
Attention-only 10 3, 7, 11, 15, 19, 23, 27, 31, 35, 39 16 Has separate attn_k, attn_q, attn_v, attn_output + K/Q norm tensors

SSM Layer Tensors (19 per block)

blk.N.attn_gate.weight          # SSM gate control
blk.N.attn_norm.weight          # Pre-norm RMS
blk.N.attn_qkv.weight           # Combined QKV projection (SSM-style)
blk.N.ffn_down_exps.weight      # Expert down projection
blk.N.ffn_down_shexp.weight     # Shared expert down projection
blk.N.ffn_gate_exps.weight      # Expert gate up projection
blk.N.ffn_gate_inp.weight       # Expert routing weights
blk.N.ffn_gate_inp_shexp.weight # Shared expert routing
blk.N.ffn_gate_shexp.weight     # Shared expert gate
blk.N.ffn_up_exps.weight        # Expert up projection
blk.N.ffn_up_shexp.weight       # Shared expert up projection
blk.N.post_attention_norm.weight # Post-norm RMS
blk.N.ssm_a                     # SSM discretization parameter
blk.N.ssm_alpha.weight          # SSM alpha scaling
blk.N.ssm_beta.weight           # SSM beta scaling
blk.N.ssm_conv1d.weight         # SSM causal conv1d kernel
blk.N.ssm_dt.bias               # SSM delta timestep bias
blk.N.ssm_norm.weight           # SSM normalization
blk.N.ssm_out.weight            # SSM output projection

Attention-Only Layer Tensors (16 per block)

blk.N.attn_k.weight             # Separate K projection
blk.N.attn_k_norm.weight        # K normalization
blk.N.attn_norm.weight          # Pre-norm RMS
blk.N.attn_output.weight        # Attention output projection
blk.N.attn_q.weight             # Separate Q projection
blk.N.attn_q_norm.weight        # Q normalization
blk.N.attn_v.weight             # Separate V projection
blk.N.ffn_down_exps.weight      # Expert down projection
blk.N.ffn_down_shexp.weight     # Shared expert down projection
blk.N.ffn_gate_exps.weight      # Expert gate up projection
blk.N.ffn_gate_inp.weight       # Expert routing weights
blk.N.ffn_gate_inp_shexp.weight # Shared expert routing
blk.N.ffn_gate_shexp.weight     # Shared expert gate
blk.N.ffn_up_exps.weight        # Expert up projection
blk.N.ffn_up_shexp.weight       # Shared expert up projection
blk.N.post_attention_norm.weight # Post-norm RMS

MoE Expert Structure

Both layer types share the same 256 experts, 8 active MoE pattern:

Tensor Purpose
ffn_gate_inp.weight Top-8 expert routing gate
ffn_gate_inp_shexp.weight Shared expert routing
ffn_gate_exps.weight Per-expert gate projection
ffn_gate_shexp.weight Shared expert gate projection
ffn_up_exps.weight Per-expert up projection (FFN dim 512)
ffn_up_shexp.weight Shared expert up projection
ffn_down_exps.weight Per-expert down projection
ffn_down_shexp.weight Shared expert down projection

Shared Tensors

shared.gguf contains 3 tensors that are not layer-specific:

Tensor Purpose
token_embd.weight Input token embeddings (248,320 x 2,048)
output_norm.weight Final RMS layer normalization
output.weight Output projection / LM head

Layer Size Map

Layer Type Size Tensors Notes
00 SSM 523 MB 19
01 SSM 523 MB 19
02 SSM 523 MB 19
03 ATTN 516 MB 16 First attention layer
04 SSM 523 MB 19
05 SSM 457 MB 19 Smaller SSM block
06 SSM 457 MB 19
07 ATTN 516 MB 16
08 SSM 457 MB 19
09 SSM 457 MB 19
10 SSM 523 MB 19
11 ATTN 450 MB 16 Smallest attention
12 SSM 457 MB 19
13 SSM 523 MB 19
14 SSM 457 MB 19
15 ATTN 450 MB 16
16 SSM 523 MB 19
17 SSM 457 MB 19
18 SSM 457 MB 19
19 ATTN 516 MB 16 Mid-model attention
20 SSM 457 MB 19
21 SSM 457 MB 19
22 SSM 523 MB 19
23 ATTN 450 MB 16
24 SSM 457 MB 19
25 SSM 523 MB 19
26 SSM 457 MB 19
27 ATTN 450 MB 16
28 SSM 523 MB 19
29 SSM 457 MB 19
30 SSM 457 MB 19
31 ATTN 516 MB 16
32 SSM 457 MB 19
33 SSM 457 MB 19
34 SSM 523 MB 19
35 ATTN 516 MB 16
36 SSM 523 MB 19 Deep model SSM
37 SSM 523 MB 19
38 SSM 523 MB 19
39 ATTN 526 MB 16 Final attention layer

Total: ~19.76 GB across 41 GGUF files

Memory Efficiency

Metric Standard Load Sequential Load Improvement
Peak memory 21.2 GB 1.13 GB 17.4x reduction
Files needed 1 2 (shared + 1 layer)

The sequential loading pattern loads shared.gguf + one layer at a time, processes the layer, releases it, then moves to the next. This enables running a 35B parameter model on hardware with as little as ~2 GB of available memory.

Format Specification

  • Format: lemonslice-gguf-v1 (layer-split GGUF)
  • GGUF Version: 3
  • Quantization: Q4_K_M (4-bit mixed, by Unsloth)
  • Alignment: 32 bytes
  • Byte order: Little-endian
  • Source: Qwen3.5-35B-A3B-Q4_K_M.gguf
  • Architecture ID: qwen35moe
  • Total tensors: 733 (3 shared + 730 across layers)

File Manifest

Qwen35_35B_Layers/
โ”œโ”€โ”€ README.md                # This file
โ”œโ”€โ”€ manifest.json            # Complete layer/tensor manifest
โ”œโ”€โ”€ shared.gguf              # Shared tensors (714 MB, 3 tensors)
โ”œโ”€โ”€ layer_0000.gguf          # Block 0 -- SSM (523 MB, 19 tensors)
โ”œโ”€โ”€ layer_0001.gguf          # Block 1 -- SSM (523 MB, 19 tensors)
โ”œโ”€โ”€ layer_0002.gguf          # Block 2 -- SSM (523 MB, 19 tensors)
โ”œโ”€โ”€ layer_0003.gguf          # Block 3 -- Attention (516 MB, 16 tensors)
โ”œโ”€โ”€ layer_0004.gguf          # Block 4 -- SSM (523 MB, 19 tensors)
โ”œโ”€โ”€ layer_0005.gguf          # Block 5 -- SSM (457 MB, 19 tensors)
โ”œโ”€โ”€ layer_0006.gguf          # Block 6 -- SSM (457 MB, 19 tensors)
โ”œโ”€โ”€ layer_0007.gguf          # Block 7 -- Attention (516 MB, 16 tensors)
โ”œโ”€โ”€ ...
โ”œโ”€โ”€ layer_0039.gguf          # Block 39 -- Attention (526 MB, 16 tensors)

Usage

Inspecting Individual Layers

# View tensor names and KV metadata in a layer file
gguf-dump layer_0000.gguf

# View shared tensors and full model hyperparameters
gguf-dump shared.gguf

# View the complete manifest
cat manifest.json | python -m json.tool

Sequential Loading (Python)

import gguf
import numpy as np

# Load shared tensors (embeddings, norms, output)
shared = gguf.GGUFReader("shared.gguf", "r")
embeddings = shared.get_tensor("token_embd.weight")
output_weight = shared.get_tensor("output.weight")

# Sequentially process each layer
for i in range(40):
    layer_file = f"layer_{i:04d}.gguf"
    reader = gguf.GGUFReader(layer_file, "r")
    for tensor in reader.tensors:
        name = str(tensor.name)
        data = np.asarray(tensor.data)
        # ... forward pass logic
    del reader  # Release layer from memory

Selective Layer Ablation

# Run inference with only SSM layers (skip attention layers)
ssm_layers = [i for i in range(40) if i % 4 != 3]
for i in ssm_layers:
    layer = load_layer(f"layer_{i:04d}.gguf")
    hidden = forward(hidden, layer)

The LemonSlice Logic

The lemonslice-gguf-v1 format gets its name from slicing a monolithic GGUF into discrete, self-contained "slices" -- much like cutting a lemon into wedges where each wedge retains the full flavor profile. Each file is exactly one transformer block.

Each GGUF File = One Transformer Block

In a standard GGUF, the entire model -- embedding table, all 40 transformer blocks, layer norms, and the LM head -- is packed into a single 21.2 GB binary. You load the whole file or nothing.

LemonSlice changes this at the file level:

Standard GGUF:     model.gguf          =  [embd][blk.0][blk.1]...[blk.39][output_norm][output]
LemonSlice:        shared.gguf         =  [embd]
                   layer_0000.gguf     =  [blk.0]       (19 tensors, 523 MB)
                   layer_0001.gguf     =  [blk.1]       (19 tensors, 523 MB)
                   ...
                   layer_0039.gguf     =  [blk.39]      (16 tensors, 527 MB)

One file per block. No more, no less. The mapping is deterministic and encoded in the filename: layer_NNNN.gguf contains only tensors with the prefix blk.N..

This means:

  • File = Block -- layer_0012.gguf is literally just transformer block 12, nothing else
  • Blocks are independently loadable -- you can open, inspect, modify, or replace any single block without touching the other 39
  • Blocks are independently valid -- every layer_XXXX.gguf is a complete GGUF v3 file with full architecture metadata (54 KV entries), so any GGUF tool can read it
  • The model is a playlist, not a monolith -- inference is a sequential walk through the layer files, one at a time

Why One-File-Per-Block Matters for Inference

1. Memory: Run 35B Models on 2 GB

The most immediate impact is peak memory. Standard GGUF loading mmap()s the entire 21.2 GB file into virtual memory. With LemonSlice:

Forward pass for token t:
  1. mmap(shared.gguf)      -> 714 MB  [token_embd.weight resident]
  2. mmap(layer_0000.gguf)  -> 523 MB  [block 0 weights]
     -> compute block 0 -> unmap layer_0000.gguf  [memory freed]
  3. mmap(layer_0001.gguf)  -> 523 MB  [block 1 weights]
     -> compute block 1 -> unmap layer_0001.gguf  [memory freed]
  ...
  40. mmap(layer_0039.gguf) -> 527 MB  [block 39 weights]
      -> compute block 39 -> unmap
  41. mmap(shared.gguf)     -> already resident [output_norm + LM head]
     -> produce logits

Peak memory: ~1.2 GB (shared + largest single block)

The 17.4x reduction (21.2 GB -> 1.2 GB) means models that previously required 48 GB of VRAM can now run in system RAM on machines with as little as 4 GB total.

2. Selective Loading: Skip What You Don't Need

With blocks as files, you can choose which ones to load:

  • Early exit -- if block 15 already produces high-confidence logits, skip blocks 16-39. This trades quality for speed dynamically.
  • Layer ablation -- remove attention-only layers (3, 7, 11, ...) to test their contribution. Remove SSM layers to understand state-space dynamics.
  • Progressive depth -- start with blocks 0-9 for fast responses, progressively load more blocks as the conversation deepens.
  • Hot-path caching -- if certain layers are accessed repeatedly (common in batch inference), keep them memory-resident while streaming others.

3. Per-Layer Quantization

In a monolithic GGUF, the entire model shares one quantization scheme (Q4_K_M in this case). With LemonSlice, each layer file can be independently re-quantized:

shared.gguf         -> Q8_0  (embeddings need precision)
layer_0000-0009     -> Q4_K_M (early layers: moderate precision)
layer_0010-0029     -> Q3_K_M (middle layers: aggressive quant)
layer_0030-0039     -> Q5_K_M (late layers: higher precision for output quality)

This lets you optimize the precision/speed/size trade-off per layer rather than globally.

4. Multi-GPU and Hybrid Execution

Blocks as files enable natural distribution:

GPU 0: shared.gguf + layer_0000 to layer_0019  (first 20 blocks)
GPU 1: layer_0020 to layer_0039                (last 20 blocks)

Or hybrid CPU/GPU:

GPU:   shared.gguf + layer_0000 to layer_0009  (first 10 blocks in VRAM)
CPU:   layer_0010 to layer_0039                 (remaining 30 blocks in RAM)

The file boundary makes it trivial to decide "which blocks live where."

5. Bandwidth and Streaming

When model weights are loaded from disk (not pre-loaded into RAM), LemonSlice streams blocks on-demand:

  • Only read the bytes you need -- each layer_XXXX.gguf is ~500 MB, not 21.2 GB
  • Predictable I/O pattern -- sequential file reads, no random access into a 21 GB file
  • Interruptibility -- if inference is cancelled mid-forward-pass, you've only loaded the blocks you reached, not wasted I/O loading everything

The Inference Conveyor Belt: How Hidden States Flow Between Layers

A common question: what passes from one layer file to the next? It's not the KV cache being "transferred" between files -- it's the hidden state, which is a tiny vector compared to the weights.

Here's exactly what happens during a forward pass:

Input token IDs
      |
      v
[shared.gguf: token_embd.weight]
      |  produces hidden state h_0  (2048-dim vector per token)
      v
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  mmap(layer_0000.gguf)    -> load block 0 weights (523 MB)  โ”‚
  โ”‚  h_0 --[block 0 forward]--> h_1                            โ”‚
  โ”‚  (KV cache for block 0 computed and stored)                  โ”‚
  โ”‚  munmap(layer_0000.gguf)  -> release 523 MB                 โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      |  hidden state h_1 passes in RAM (just a few KB)
      v
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  mmap(layer_0001.gguf)    -> load block 1 weights (523 MB)  โ”‚
  โ”‚  h_1 --[block 1 forward]--> h_2                            โ”‚
  โ”‚  (KV cache for block 1 accumulated)                          โ”‚
  โ”‚  munmap(layer_0001.gguf)  -> release 523 MB                 โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      |  hidden state h_2 passes in RAM
      v
  ...repeat for all 40 blocks...
      |
      v
[shared.gguf: output_norm.weight + output.weight]
      |  produces logits (248,320-dim vector)
      v
  Softmax -> token probabilities

Three things exist in memory during this process:

What Size Lifetime
Layer weights ~500 MB Only while that layer file is mmap'd
Hidden state ~8 KB per token (2048 * 4 bytes) Lives in RAM throughout, passed between layers
KV cache Grows per layer (~256 KB per layer per token) Accumulates as each layer is processed, freed after logits

The hidden state is the "product" on the conveyor belt -- each layer file is a "station" that transforms it. The hidden state is tiny (a few KB) compared to the weights (hundreds of MB), so passing it between layer files is essentially free.

The KV cache is a "side buffer" that accumulates as each layer computes its K/V projections. In a standard GGUF, the KV cache is allocated separately from the weights and works identically. The difference with LemonSlice is that the KV cache builds up incrementally as layers are loaded one by one, rather than all at once.

How It Works: The Split Process

  1. Parse the monolithic header -- read GGUF.version, GGUF.tensor_count, GGUF.kv_count, and all 54 metadata key-value pairs (architecture, hyperparameters, tokenizer, sampling config)
  2. Identify tensor boundaries -- each tensor has a name, shape, dtype, and byte offset. The blk.N. prefix in tensor names (e.g., blk.0.attn_qkv.weight) naturally partitions tensors into layer groups
  3. Extract shared tensors -- tensors without a blk.N. prefix (token_embd.weight, output_norm.weight, output.weight) are separated into shared.gguf because they span the entire model, not a single layer
  4. Write each layer as its own GGUF -- for each layer N, write a complete GGUF v3 file with:
    • Full header (magic + version)
    • All 54 KV metadata entries replicated from the original
    • Only the tensors belonging to that layer (blk.N.*)
    • 32-byte alignment padding for memory-mapped loading

Why Full Metadata in Every File

Unlike splitting approaches that store metadata externally, every layer_XXXX.gguf is a valid, standalone GGUF file:

  • gguf-dump layer_0005.gguf shows the full model architecture (context length, head count, expert count, SSM parameters, tokenizer)
  • Any layer file can be loaded by standard GGUF parsers without external config
  • Layers can be shared independently -- send someone a single layer and they can inspect it
  • The KV metadata duplication is ~2KB per layer file -- a deliberate trade-off for complete self-containment

Tensor Count Variation

SSM layers have 19 tensors while attention-only layers have 16 because:

  • SSM layers include 7 ssm_* tensors (state-space model: ssm_a, ssm_alpha, ssm_beta, ssm_conv1d, ssm_dt, ssm_norm, ssm_out) but use a combined attn_qkv.weight
  • Attention layers have separate attn_k, attn_q, attn_v, attn_output plus K/Q normalization, but no SSM tensors
  • Both layer types share the same 8 MoE FFN tensors (gate, up, down for per-expert and shared)

Future Vision: Where LemonSlice Goes

The layer-split format is a foundation, not a destination. Here's where this can go:

Layer Surgery

Mix and match transformer blocks from different models. Take the first 20 blocks from Qwen3.5-35B-A3B and the last 20 from Llama-3-70B. Swap in attention-only layers from a model with better reasoning capabilities. The file-per-block format makes this as simple as copying .gguf files into a directory and updating manifest.json.

Incremental Model Updates

When a model is fine-tuned and only 5 of 40 layers change, you only need to download those 5 layer files (~2.5 GB) instead of the entire 21.2 GB model. This is particularly powerful for:

  • LoRA/DoRA merges that affect specific layers
  • Safety fine-tunes that modify late-layer behavior
  • Domain-specific adapters (medical, legal, code) as layer replacements

Progressive Inference

Start generating tokens with just the first 10 layers. If the output is confident, you saved 75% of the compute. If not, load more layers and refine. This creates a natural speed/quality trade-off that adapts per-token.

Edge and Constrained Deployment

Run a 35B model on a Raspberry Pi by loading one layer at a time from an SD card. It will be slow, but it will work -- and that's impossible with a monolithic GGUF. LemonSlice makes the full model spectrum accessible: from "runs in 2 GB RAM but takes 5 minutes per token" to "runs in 48 GB VRAM at 100 tokens/sec" with the same files.

Layer-Level Benchmarking and Analysis

Benchmark individual blocks for compute time, memory bandwidth, and activation patterns. Identify which layers are bottlenecks. Compare SSM vs attention-only layer performance. The file-per-block format makes it trivial to isolate and profile any layer independently.

P2P Model Distribution

Share individual layers peer-to-peer. Someone can share just the attention-only layers (10 files, ~5 GB) while you already have the SSM layers from another source. The manifest verifies integrity regardless of download source.

Layer-Specific Fine-Tuning

Fine-tune only layers 30-39 (the "output-facing" layers) while keeping layers 0-29 frozen. With LemonSlice, this is just replacing files 30-39 with your fine-tuned versions. No need to save the entire 21.2 GB model when only 5 GB changed.

Hybrid Architecture Research

The Qwen3.5-35B-A3B model already uses a hybrid SSM + Attention pattern (visible in the layer files). LemonSlice makes it easy to experiment with:

  • Different SSM/Attention ratios
  • Reordering layers (attention layers at the start vs end)
  • Inserting custom layers between existing blocks
  • Removing the MoE experts from specific layers to test their contribution

Citation

Original model: Qwen/Qwen3.5-35B-A3B

@misc{qwen3.5,
  title={Qwen3.5 Technical Report},
  author={Qwen Team},
  year={2025},
  publisher={Hugging Face}
}

License

Apache 2.0 -- inherited from the original Qwen3.5-35B-A3B license.

Quantization by Unsloth.

Downloads last month
345
GGUF
Model size
0.8B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for anthonymikinka/Qwen35-35B-Layers-Split

Quantized
(248)
this model