Instructions to use anthonymikinka/Qwen35-35B-Layers-Split with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use anthonymikinka/Qwen35-35B-Layers-Split with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="anthonymikinka/Qwen35-35B-Layers-Split")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("anthonymikinka/Qwen35-35B-Layers-Split", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use anthonymikinka/Qwen35-35B-Layers-Split with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "anthonymikinka/Qwen35-35B-Layers-Split"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anthonymikinka/Qwen35-35B-Layers-Split",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/anthonymikinka/Qwen35-35B-Layers-Split

SGLang

How to use anthonymikinka/Qwen35-35B-Layers-Split with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "anthonymikinka/Qwen35-35B-Layers-Split" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anthonymikinka/Qwen35-35B-Layers-Split",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "anthonymikinka/Qwen35-35B-Layers-Split" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "anthonymikinka/Qwen35-35B-Layers-Split",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Unsloth Studio new

How to use anthonymikinka/Qwen35-35B-Layers-Split with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for anthonymikinka/Qwen35-35B-Layers-Split to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for anthonymikinka/Qwen35-35B-Layers-Split to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for anthonymikinka/Qwen35-35B-Layers-Split to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="anthonymikinka/Qwen35-35B-Layers-Split",
    max_seq_length=2048,
)

Docker Model Runner
How to use anthonymikinka/Qwen35-35B-Layers-Split with Docker Model Runner:
```
docker model run hf.co/anthonymikinka/Qwen35-35B-Layers-Split
```

Qwen3.5-35B-A3B Layer-Split GGUF

Qwen3.5-35B-A3B (Qwen3.5-MoE) decomposed into per-layer GGUF files -- a novel layer-split format (lemonslice-gguf-v1) that achieves 17.4x memory reduction for sequential loading and per-layer inspection.

Overview

This repository presents the Qwen3.5-35B-A3B Mixture-of-Experts model restructured into individual GGUF files -- one per transformer block. Starting from the Q4_K_M quantized base model (by Unsloth), the weights were manually decomposed, layer tensors were reorganized, shared components were physically separated, and a complete manifest was built -- resulting in 40 layer files + 1 shared file. This custom lemonslice-gguf-v1 format enables:

Sequential layer loading -- peak memory reduced by 17.4x (from 21.2 GB to ~1.13 GB)
Layer-level inspection -- examine, modify, or swap individual blocks
Selective inference -- run only a subset of layers for ablation studies
Custom pipeline parallelism -- distribute layers across multiple devices
Per-layer fine-tuning -- target specific layer types (SSM vs Attention)

What Makes This Different From a Standard GGUF

Standard GGUF (monolithic)

A normal GGUF file contains all 733 tensors in a single file. Loading it requires the entire 21.2 GB to reside in RAM/VRAM simultaneously. Every tensor -- embeddings, all 40 blocks, and the LM head -- is mapped into memory at once.

This Repository (layer-split / `lemonslice-gguf-v1`)

This format represents a fundamentally different file organization:

Aspect	Standard GGUF	Layer-Split GGUF
File count	1 file	42 files (1 shared + 40 layers + manifest)
Peak memory	21.2 GB (entire model)	1.13 GB (one layer at a time)
Memory reduction	1x	17.4x
KV metadata	In single file	Replicated in each layer file
Self-contained	Yes	Each layer file carries full model metadata
Tensor isolation	No (all tensors together)	Yes (per-block boundaries)

Key Architectural Differences

Every layer file is a standalone GGUF -- each layer_XXXX.gguf contains its own complete GGUF header with all 54 KV metadata entries (architecture, hyperparameters, tokenizer info, sampling params). This is unlike a standard GGUF where metadata lives once in the single file.
Shared tensors are physically separated -- token_embd.weight, output_norm.weight, and output.weight are extracted into shared.gguf, decoupling the embedding/projection layer from the transformer body.
Layer files preserve tensor count boundaries -- the manifest tracks exactly how many tensors each layer contains (16 for attention-only, 19 for SSM layers), making it possible to verify file integrity independently.
Hybrid SSM + Attention architecture exposed -- the layer split makes visible what's hidden in a monolithic file: the regular alternation between SSM (Mamba) and Attention layers every 4 blocks.
Self-contained layer files -- each layer_XXXX.gguf carries full model metadata (54 KV entries), making it independently inspectable by any GGUF parser

Model Hyperparameters

From the GGUF KV metadata (shared.gguf):

Parameter	Value
Architecture	`qwen35moe`
Model name	Qwen3.5-35B-A3B
Block count	40
Context length	262,144 (256K)
Embedding dimension	2,048
Attention heads	16
KV heads (GQA)	2
Key/Value length	256
RoPE dimensions	64 (sections: [11, 11, 10, 0])
RoPE freq base	10,000,000
RMS norm epsilon	1e-6
Total experts	256
Active experts	8
Expert FFN dim	512
Expert shared FFN dim	512
SSM state size	128
SSM conv kernel	4
SSM group count	16
SSM time step rank	32
SSM inner size	4,096
Full attention interval	Every 4 layers
Tokenizer	GPT2 (qwen35 pre-tokenizer)
Vocabulary	248,320 tokens
Quantization	Q4_K_M (by Unsloth)
Sampling	top_k=20, top_p=0.95, temp=1.0
License	Apache 2.0

Architecture

Qwen3.5-35B-A3B is a hybrid SSM + Attention Mixture-of-Experts model with 40 layers and 733 total tensors (Q4_K_M quantized).

File Inventory

Component	File	Tensors	Size
Shared (embeddings, norm, output)	`shared.gguf`	3	714 MB
SSM layers (30 blocks)	`layer_0000-0002, 0004-0006, ...`	19 each	472-548 MB
Attention layers (10 blocks)	`layer_0003, 0007, 0011, ...`	16 each	472-552 MB

Hybrid Layer Pattern

The model alternates between two layer types in a regular every-4-layers pattern (full_attention_interval = 4):

Layer Type	Count	Pattern	Tensors	Key Differentiator
SSM (Mamba)	30	0-2, 4-6, 8-10, 12-14, 16-18, 20-22, 24-26, 28-30, 32-34, 36-38	19	Has `ssm_*` tensors: `ssm_a`, `ssm_alpha`, `ssm_beta`, `ssm_conv1d`, `ssm_dt`, `ssm_norm`, `ssm_out`
Attention-only	10	3, 7, 11, 15, 19, 23, 27, 31, 35, 39	16	Has separate `attn_k`, `attn_q`, `attn_v`, `attn_output` + K/Q norm tensors

SSM Layer Tensors (19 per block)

blk.N.attn_gate.weight          # SSM gate control
blk.N.attn_norm.weight          # Pre-norm RMS
blk.N.attn_qkv.weight           # Combined QKV projection (SSM-style)
blk.N.ffn_down_exps.weight      # Expert down projection
blk.N.ffn_down_shexp.weight     # Shared expert down projection
blk.N.ffn_gate_exps.weight      # Expert gate up projection
blk.N.ffn_gate_inp.weight       # Expert routing weights
blk.N.ffn_gate_inp_shexp.weight # Shared expert routing
blk.N.ffn_gate_shexp.weight     # Shared expert gate
blk.N.ffn_up_exps.weight        # Expert up projection
blk.N.ffn_up_shexp.weight       # Shared expert up projection
blk.N.post_attention_norm.weight # Post-norm RMS
blk.N.ssm_a                     # SSM discretization parameter
blk.N.ssm_alpha.weight          # SSM alpha scaling
blk.N.ssm_beta.weight           # SSM beta scaling
blk.N.ssm_conv1d.weight         # SSM causal conv1d kernel
blk.N.ssm_dt.bias               # SSM delta timestep bias
blk.N.ssm_norm.weight           # SSM normalization
blk.N.ssm_out.weight            # SSM output projection

Attention-Only Layer Tensors (16 per block)

blk.N.attn_k.weight             # Separate K projection
blk.N.attn_k_norm.weight        # K normalization
blk.N.attn_norm.weight          # Pre-norm RMS
blk.N.attn_output.weight        # Attention output projection
blk.N.attn_q.weight             # Separate Q projection
blk.N.attn_q_norm.weight        # Q normalization
blk.N.attn_v.weight             # Separate V projection
blk.N.ffn_down_exps.weight      # Expert down projection
blk.N.ffn_down_shexp.weight     # Shared expert down projection
blk.N.ffn_gate_exps.weight      # Expert gate up projection
blk.N.ffn_gate_inp.weight       # Expert routing weights
blk.N.ffn_gate_inp_shexp.weight # Shared expert routing
blk.N.ffn_gate_shexp.weight     # Shared expert gate
blk.N.ffn_up_exps.weight        # Expert up projection
blk.N.ffn_up_shexp.weight       # Shared expert up projection
blk.N.post_attention_norm.weight # Post-norm RMS

MoE Expert Structure

Both layer types share the same 256 experts, 8 active MoE pattern:

Tensor	Purpose
`ffn_gate_inp.weight`	Top-8 expert routing gate
`ffn_gate_inp_shexp.weight`	Shared expert routing
`ffn_gate_exps.weight`	Per-expert gate projection
`ffn_gate_shexp.weight`	Shared expert gate projection
`ffn_up_exps.weight`	Per-expert up projection (FFN dim 512)
`ffn_up_shexp.weight`	Shared expert up projection
`ffn_down_exps.weight`	Per-expert down projection
`ffn_down_shexp.weight`	Shared expert down projection

Shared Tensors

shared.gguf contains 3 tensors that are not layer-specific:

Tensor	Purpose
`token_embd.weight`	Input token embeddings (248,320 x 2,048)
`output_norm.weight`	Final RMS layer normalization
`output.weight`	Output projection / LM head

Layer Size Map

Layer	Type	Size	Tensors	Notes
00	SSM	523 MB	19
01	SSM	523 MB	19
02	SSM	523 MB	19
03	ATTN	516 MB	16	First attention layer
04	SSM	523 MB	19
05	SSM	457 MB	19	Smaller SSM block
06	SSM	457 MB	19
07	ATTN	516 MB	16
08	SSM	457 MB	19
09	SSM	457 MB	19
10	SSM	523 MB	19
11	ATTN	450 MB	16	Smallest attention
12	SSM	457 MB	19
13	SSM	523 MB	19
14	SSM	457 MB	19
15	ATTN	450 MB	16
16	SSM	523 MB	19
17	SSM	457 MB	19
18	SSM	457 MB	19
19	ATTN	516 MB	16	Mid-model attention
20	SSM	457 MB	19
21	SSM	457 MB	19
22	SSM	523 MB	19
23	ATTN	450 MB	16
24	SSM	457 MB	19
25	SSM	523 MB	19
26	SSM	457 MB	19
27	ATTN	450 MB	16
28	SSM	523 MB	19
29	SSM	457 MB	19
30	SSM	457 MB	19
31	ATTN	516 MB	16
32	SSM	457 MB	19
33	SSM	457 MB	19
34	SSM	523 MB	19
35	ATTN	516 MB	16
36	SSM	523 MB	19	Deep model SSM
37	SSM	523 MB	19
38	SSM	523 MB	19
39	ATTN	526 MB	16	Final attention layer

Total: ~19.76 GB across 41 GGUF files

Memory Efficiency

Metric	Standard Load	Sequential Load	Improvement
Peak memory	21.2 GB	1.13 GB	17.4x reduction
Files needed	1	2 (shared + 1 layer)

The sequential loading pattern loads shared.gguf + one layer at a time, processes the layer, releases it, then moves to the next. This enables running a 35B parameter model on hardware with as little as ~2 GB of available memory.

Format Specification

Format: lemonslice-gguf-v1 (layer-split GGUF)
GGUF Version: 3
Quantization: Q4_K_M (4-bit mixed, by Unsloth)
Alignment: 32 bytes
Byte order: Little-endian
Source: Qwen3.5-35B-A3B-Q4_K_M.gguf
Architecture ID: qwen35moe
Total tensors: 733 (3 shared + 730 across layers)

File Manifest

Qwen35_35B_Layers/
├── README.md                # This file
├── manifest.json            # Complete layer/tensor manifest
├── shared.gguf              # Shared tensors (714 MB, 3 tensors)
├── layer_0000.gguf          # Block 0 -- SSM (523 MB, 19 tensors)
├── layer_0001.gguf          # Block 1 -- SSM (523 MB, 19 tensors)
├── layer_0002.gguf          # Block 2 -- SSM (523 MB, 19 tensors)
├── layer_0003.gguf          # Block 3 -- Attention (516 MB, 16 tensors)
├── layer_0004.gguf          # Block 4 -- SSM (523 MB, 19 tensors)
├── layer_0005.gguf          # Block 5 -- SSM (457 MB, 19 tensors)
├── layer_0006.gguf          # Block 6 -- SSM (457 MB, 19 tensors)
├── layer_0007.gguf          # Block 7 -- Attention (516 MB, 16 tensors)
├── ...
├── layer_0039.gguf          # Block 39 -- Attention (526 MB, 16 tensors)

Usage

Inspecting Individual Layers

# View tensor names and KV metadata in a layer file
gguf-dump layer_0000.gguf

# View shared tensors and full model hyperparameters
gguf-dump shared.gguf

# View the complete manifest
cat manifest.json | python -m json.tool

Sequential Loading (Python)

import gguf
import numpy as np

# Load shared tensors (embeddings, norms, output)
shared = gguf.GGUFReader("shared.gguf", "r")
embeddings = shared.get_tensor("token_embd.weight")
output_weight = shared.get_tensor("output.weight")

# Sequentially process each layer
for i in range(40):
    layer_file = f"layer_{i:04d}.gguf"
    reader = gguf.GGUFReader(layer_file, "r")
    for tensor in reader.tensors:
        name = str(tensor.name)
        data = np.asarray(tensor.data)
        # ... forward pass logic
    del reader  # Release layer from memory

Selective Layer Ablation

# Run inference with only SSM layers (skip attention layers)
ssm_layers = [i for i in range(40) if i % 4 != 3]
for i in ssm_layers:
    layer = load_layer(f"layer_{i:04d}.gguf")
    hidden = forward(hidden, layer)

The LemonSlice Logic

The lemonslice-gguf-v1 format gets its name from slicing a monolithic GGUF into discrete, self-contained "slices" -- much like cutting a lemon into wedges where each wedge retains the full flavor profile. Each file is exactly one transformer block.

Each GGUF File = One Transformer Block

In a standard GGUF, the entire model -- embedding table, all 40 transformer blocks, layer norms, and the LM head -- is packed into a single 21.2 GB binary. You load the whole file or nothing.

LemonSlice changes this at the file level:

Standard GGUF:     model.gguf          =  [embd][blk.0][blk.1]...[blk.39][output_norm][output]
LemonSlice:        shared.gguf         =  [embd]
                   layer_0000.gguf     =  [blk.0]       (19 tensors, 523 MB)
                   layer_0001.gguf     =  [blk.1]       (19 tensors, 523 MB)
                   ...
                   layer_0039.gguf     =  [blk.39]      (16 tensors, 527 MB)

One file per block. No more, no less. The mapping is deterministic and encoded in the filename: layer_NNNN.gguf contains only tensors with the prefix blk.N..

This means:

File = Block -- layer_0012.gguf is literally just transformer block 12, nothing else
Blocks are independently loadable -- you can open, inspect, modify, or replace any single block without touching the other 39
Blocks are independently valid -- every layer_XXXX.gguf is a complete GGUF v3 file with full architecture metadata (54 KV entries), so any GGUF tool can read it
The model is a playlist, not a monolith -- inference is a sequential walk through the layer files, one at a time

Why One-File-Per-Block Matters for Inference

1. Memory: Run 35B Models on 2 GB

The most immediate impact is peak memory. Standard GGUF loading mmap()s the entire 21.2 GB file into virtual memory. With LemonSlice:

Forward pass for token t:
  1. mmap(shared.gguf)      -> 714 MB  [token_embd.weight resident]
  2. mmap(layer_0000.gguf)  -> 523 MB  [block 0 weights]
     -> compute block 0 -> unmap layer_0000.gguf  [memory freed]
  3. mmap(layer_0001.gguf)  -> 523 MB  [block 1 weights]
     -> compute block 1 -> unmap layer_0001.gguf  [memory freed]
  ...
  40. mmap(layer_0039.gguf) -> 527 MB  [block 39 weights]
      -> compute block 39 -> unmap
  41. mmap(shared.gguf)     -> already resident [output_norm + LM head]
     -> produce logits

Peak memory: ~1.2 GB (shared + largest single block)

The 17.4x reduction (21.2 GB -> 1.2 GB) means models that previously required 48 GB of VRAM can now run in system RAM on machines with as little as 4 GB total.

2. Selective Loading: Skip What You Don't Need

With blocks as files, you can choose which ones to load:

Early exit -- if block 15 already produces high-confidence logits, skip blocks 16-39. This trades quality for speed dynamically.
Layer ablation -- remove attention-only layers (3, 7, 11, ...) to test their contribution. Remove SSM layers to understand state-space dynamics.
Progressive depth -- start with blocks 0-9 for fast responses, progressively load more blocks as the conversation deepens.
Hot-path caching -- if certain layers are accessed repeatedly (common in batch inference), keep them memory-resident while streaming others.

3. Per-Layer Quantization

In a monolithic GGUF, the entire model shares one quantization scheme (Q4_K_M in this case). With LemonSlice, each layer file can be independently re-quantized:

shared.gguf         -> Q8_0  (embeddings need precision)
layer_0000-0009     -> Q4_K_M (early layers: moderate precision)
layer_0010-0029     -> Q3_K_M (middle layers: aggressive quant)
layer_0030-0039     -> Q5_K_M (late layers: higher precision for output quality)

This lets you optimize the precision/speed/size trade-off per layer rather than globally.

4. Multi-GPU and Hybrid Execution

Blocks as files enable natural distribution:

GPU 0: shared.gguf + layer_0000 to layer_0019  (first 20 blocks)
GPU 1: layer_0020 to layer_0039                (last 20 blocks)

Or hybrid CPU/GPU:

GPU:   shared.gguf + layer_0000 to layer_0009  (first 10 blocks in VRAM)
CPU:   layer_0010 to layer_0039                 (remaining 30 blocks in RAM)

The file boundary makes it trivial to decide "which blocks live where."

5. Bandwidth and Streaming

When model weights are loaded from disk (not pre-loaded into RAM), LemonSlice streams blocks on-demand:

Only read the bytes you need -- each layer_XXXX.gguf is ~500 MB, not 21.2 GB
Predictable I/O pattern -- sequential file reads, no random access into a 21 GB file
Interruptibility -- if inference is cancelled mid-forward-pass, you've only loaded the blocks you reached, not wasted I/O loading everything

The Inference Conveyor Belt: How Hidden States Flow Between Layers

A common question: what passes from one layer file to the next? It's not the KV cache being "transferred" between files -- it's the hidden state, which is a tiny vector compared to the weights.

Here's exactly what happens during a forward pass:

Input token IDs
      |
      v
[shared.gguf: token_embd.weight]
      |  produces hidden state h_0  (2048-dim vector per token)
      v
  ┌─────────────────────────────────────────────────────────────┐
  │  mmap(layer_0000.gguf)    -> load block 0 weights (523 MB)  │
  │  h_0 --[block 0 forward]--> h_1                            │
  │  (KV cache for block 0 computed and stored)                  │
  │  munmap(layer_0000.gguf)  -> release 523 MB                 │
  └─────────────────────────────────────────────────────────────┘
      |  hidden state h_1 passes in RAM (just a few KB)
      v
  ┌─────────────────────────────────────────────────────────────┐
  │  mmap(layer_0001.gguf)    -> load block 1 weights (523 MB)  │
  │  h_1 --[block 1 forward]--> h_2                            │
  │  (KV cache for block 1 accumulated)                          │
  │  munmap(layer_0001.gguf)  -> release 523 MB                 │
  └─────────────────────────────────────────────────────────────┘
      |  hidden state h_2 passes in RAM
      v
  ...repeat for all 40 blocks...
      |
      v
[shared.gguf: output_norm.weight + output.weight]
      |  produces logits (248,320-dim vector)
      v
  Softmax -> token probabilities

Three things exist in memory during this process:

What	Size	Lifetime
Layer weights	~500 MB	Only while that layer file is mmap'd
Hidden state	~8 KB per token (2048 * 4 bytes)	Lives in RAM throughout, passed between layers
KV cache	Grows per layer (~256 KB per layer per token)	Accumulates as each layer is processed, freed after logits

The hidden state is the "product" on the conveyor belt -- each layer file is a "station" that transforms it. The hidden state is tiny (a few KB) compared to the weights (hundreds of MB), so passing it between layer files is essentially free.

The KV cache is a "side buffer" that accumulates as each layer computes its K/V projections. In a standard GGUF, the KV cache is allocated separately from the weights and works identically. The difference with LemonSlice is that the KV cache builds up incrementally as layers are loaded one by one, rather than all at once.

How It Works: The Split Process

Parse the monolithic header -- read GGUF.version, GGUF.tensor_count, GGUF.kv_count, and all 54 metadata key-value pairs (architecture, hyperparameters, tokenizer, sampling config)
Identify tensor boundaries -- each tensor has a name, shape, dtype, and byte offset. The blk.N. prefix in tensor names (e.g., blk.0.attn_qkv.weight) naturally partitions tensors into layer groups
Extract shared tensors -- tensors without a blk.N. prefix (token_embd.weight, output_norm.weight, output.weight) are separated into shared.gguf because they span the entire model, not a single layer
Write each layer as its own GGUF -- for each layer N, write a complete GGUF v3 file with:
- Full header (magic + version)
- All 54 KV metadata entries replicated from the original
- Only the tensors belonging to that layer (blk.N.*)
- 32-byte alignment padding for memory-mapped loading

Why Full Metadata in Every File

Unlike splitting approaches that store metadata externally, every layer_XXXX.gguf is a valid, standalone GGUF file:

gguf-dump layer_0005.gguf shows the full model architecture (context length, head count, expert count, SSM parameters, tokenizer)
Any layer file can be loaded by standard GGUF parsers without external config
Layers can be shared independently -- send someone a single layer and they can inspect it
The KV metadata duplication is ~2KB per layer file -- a deliberate trade-off for complete self-containment

Tensor Count Variation

SSM layers have 19 tensors while attention-only layers have 16 because:

SSM layers include 7 ssm_* tensors (state-space model: ssm_a, ssm_alpha, ssm_beta, ssm_conv1d, ssm_dt, ssm_norm, ssm_out) but use a combined attn_qkv.weight
Attention layers have separate attn_k, attn_q, attn_v, attn_output plus K/Q normalization, but no SSM tensors
Both layer types share the same 8 MoE FFN tensors (gate, up, down for per-expert and shared)

Future Vision: Where LemonSlice Goes

The layer-split format is a foundation, not a destination. Here's where this can go:

Layer Surgery

Mix and match transformer blocks from different models. Take the first 20 blocks from Qwen3.5-35B-A3B and the last 20 from Llama-3-70B. Swap in attention-only layers from a model with better reasoning capabilities. The file-per-block format makes this as simple as copying .gguf files into a directory and updating manifest.json.

Incremental Model Updates

When a model is fine-tuned and only 5 of 40 layers change, you only need to download those 5 layer files (~2.5 GB) instead of the entire 21.2 GB model. This is particularly powerful for:

LoRA/DoRA merges that affect specific layers
Safety fine-tunes that modify late-layer behavior
Domain-specific adapters (medical, legal, code) as layer replacements

Progressive Inference

Start generating tokens with just the first 10 layers. If the output is confident, you saved 75% of the compute. If not, load more layers and refine. This creates a natural speed/quality trade-off that adapts per-token.

Edge and Constrained Deployment

Run a 35B model on a Raspberry Pi by loading one layer at a time from an SD card. It will be slow, but it will work -- and that's impossible with a monolithic GGUF. LemonSlice makes the full model spectrum accessible: from "runs in 2 GB RAM but takes 5 minutes per token" to "runs in 48 GB VRAM at 100 tokens/sec" with the same files.

Layer-Level Benchmarking and Analysis

Benchmark individual blocks for compute time, memory bandwidth, and activation patterns. Identify which layers are bottlenecks. Compare SSM vs attention-only layer performance. The file-per-block format makes it trivial to isolate and profile any layer independently.

P2P Model Distribution

Share individual layers peer-to-peer. Someone can share just the attention-only layers (10 files, ~5 GB) while you already have the SSM layers from another source. The manifest verifies integrity regardless of download source.

Layer-Specific Fine-Tuning

Fine-tune only layers 30-39 (the "output-facing" layers) while keeping layers 0-29 frozen. With LemonSlice, this is just replacing files 30-39 with your fine-tuned versions. No need to save the entire 21.2 GB model when only 5 GB changed.

Hybrid Architecture Research

The Qwen3.5-35B-A3B model already uses a hybrid SSM + Attention pattern (visible in the layer files). LemonSlice makes it easy to experiment with:

Different SSM/Attention ratios
Reordering layers (attention layers at the start vs end)
Inserting custom layers between existing blocks
Removing the MoE experts from specific layers to test their contribution

Citation

Original model: Qwen/Qwen3.5-35B-A3B

@misc{qwen3.5,
  title={Qwen3.5 Technical Report},
  author={Qwen Team},
  year={2025},
  publisher={Hugging Face}
}

License

Apache 2.0 -- inherited from the original Qwen3.5-35B-A3B license.

Quantization by Unsloth.

Downloads last month: 345

GGUF

Model size

0.8B params

Architecture

qwen35moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for anthonymikinka/Qwen35-35B-Layers-Split

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(248)

this model

Qwen3.5-35B-A3B Layer-Split GGUF

Overview

What Makes This Different From a Standard GGUF

Standard GGUF (monolithic)

This Repository (layer-split / lemonslice-gguf-v1)

Key Architectural Differences

Model Hyperparameters

Architecture

File Inventory

Hybrid Layer Pattern

SSM Layer Tensors (19 per block)

Attention-Only Layer Tensors (16 per block)

MoE Expert Structure

Shared Tensors

Layer Size Map

Memory Efficiency

Format Specification

File Manifest

Usage

Inspecting Individual Layers

Sequential Loading (Python)

Selective Layer Ablation

The LemonSlice Logic

Each GGUF File = One Transformer Block

Why One-File-Per-Block Matters for Inference

1. Memory: Run 35B Models on 2 GB

2. Selective Loading: Skip What You Don't Need

3. Per-Layer Quantization

4. Multi-GPU and Hybrid Execution

5. Bandwidth and Streaming

The Inference Conveyor Belt: How Hidden States Flow Between Layers

How It Works: The Split Process

Why Full Metadata in Every File

Tensor Count Variation

Future Vision: Where LemonSlice Goes

Layer Surgery

Incremental Model Updates

Progressive Inference

Edge and Constrained Deployment

Layer-Level Benchmarking and Analysis

P2P Model Distribution

Layer-Specific Fine-Tuning

Hybrid Architecture Research

Citation

License

Model tree for anthonymikinka/Qwen35-35B-Layers-Split

This Repository (layer-split / `lemonslice-gguf-v1`)