Instructions to use AbstractPhil/geolip-sdxl-aleph with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use AbstractPhil/geolip-sdxl-aleph with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", dtype=torch.bfloat16, device_map="cuda") pipe.load_lora_weights("AbstractPhil/geolip-sdxl-aleph") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
SDXL–Aleph (geolip-sdxl-aleph)
Experimental research artifact — Phase 0. This is not a finished general-purpose text-to-image model. It is the first stage of an effort to retrain Stable Diffusion XL around a different text encoder (Qwen in place of CLIP-G) plus an encoder-invariant geometric address, using a rectified-flow objective rather than the original ε-prediction one. It is part of the wider geolip program on geometric, constraint-driven representation learning.
Phase 0 trains a cross-attention LoRA on the SDXL UNet together with a small conditioning front-end, while every large component (UNet base, VAE, CLIP-L, Qwen) stays frozen. CLIP-L is kept real; only the CLIP-G half of the conditioning is replaced.
9 run lora sweep
With 9 full training runs we determine the best target for flow match conversion.
This is how we choose our target to flow match convert before we feed any major data.
We then FID score calculate.
Bug present
CFG isn't calibrated right, so lower CFG is required to correctly attune the model. The symptoms are similar to what happened to Lune originally before retraining, so the probability that this model will need a reinfusion is clear.
Bug Solution
I will run a sweep per 10 epochs for the 4 image samples when the runs conclude with cfg 1.0, cfg 2.0, and cfg 3.0.
Lune took a very long time to figure out CFG after conversion so I expect something similar from our new SDXL model.
I have yet to decide on a name for the weights.
Preliminary results in
And they are UUUUGLYYYY. As expected, I didn't expect CFG to be so rigid however.
https://huggingface.co/datasets/AbstractPhil/geolip-sdxl-fid-scoring
The most cost effective measure I can see based on this sweep, WAS IN FACT my original choice.
I have a very good track record just guessing.
THE TARGET will not be #1 however, we have two viable alternatives. The third on the list being the most promising.
We snap out the CLIP_L and we snap out the CLIP_G POOLED.
RETAINING the clip_g sequence.
This is our phase 1 goal, retrain SDXL to flow matching using the qwen instead of clip_l and the qwen instead of CLIP_G pooled.
I am fairly certain I know WHY the clip_g sequence failed, and it's because it was projected upward.
Traditionally, dense vectors when upcasted in this nature will comply with the stronger of the two. By effect, this model simply wasn't strong enough to fill the CLIP_G role with the chosen qwen model, even with the aleph scaffolding.
SIMPLY TOO WEAK!
Without CLIP_G's strength, the model entered catastrophic fault. Essentially sequential solidity couldn't fill the bag of tokens role. That's fine, and actually expected.
I have a viable solution for this, and it involves an aleph transformer. For now, we'll train this one!
86,000 images, 12~ epochs. Roughly 1 million samples full finetune.
REGULAR CHECKUPS
Regular applications of training assistances will be applied for phase 1.
- FID score checkups, every epoch - no less than 100 images per cfg.
- This is quite time consuming, but absolutely mission critical.
- common prompt testing,
- 4 times per epoch, 8 prompts, 3 cfg types to check.
- CFG tweaking, to ensure we are getting the most out of dropout.
- SCHEDULED decrease in dropout over time, with high dropout at first
- SHIFT timing adjusted TWICE through the training
- Epoch 0 = shift 2 <- sd15-flow-lune native pretrain
- Epoch 4 = shift 2.25
- Epoch 8 = shift 2.5
- Lune trained under SHIFT 2 for a HUGE portion of pretraining
- LATER Lune was moved to shift 2.5 AFTER coalescence and convergence
- MUCH LATER Lune finetunes were moved BACK to 2 and 3 as testing agents
- Shift 2 and Shift 3 both provided utilizable FID offsets after training shift agnostic behavior.
Why this exists
A cheaper path — freezing the UNet and training an adapter to impersonate CLIP-G from a different encoder — was measured and ruled out: a pooled text vector transfers across encoder architectures (a rotation lands it in CLIP-G's space at reference quality), but CLIP's per-token sequence does not (cosine caps around 0.5, i.e. texture, not content). So the conditioning here is built from a pooled Qwen representation, and the UNet is allowed to co-adapt to it rather than being asked to read a faithful CLIP imitation.
Separately, the aleph address is computed from the bytes of the caption, not from any text encoder. It is therefore identical across the CLIP-G→Qwen swap — a deterministic, scale- and patch-agnostic geometric scaffold the UNet can lean on while the encoder changes underneath it. It enters as a near-zero-initialised anchor so the model starts well-behaved and learns to use the address as training proceeds.
Architecture
Frozen: SDXL UNet base weights · SDXL VAE · SDXL CLIP-L text encoder · Qwen3.5 (0.8B).
Trainable: cross-attention LoRA (attn2, rank 16) + the conditioning front-end
(pool_proj, a learned positional table, and a near-zero addr_adapter).
Text representation (Qwen, pooled — the proven causal-LM extraction): the caption is chat-templated, optionally re-described via two-shot generation, then encoded with last-token pooling (the [EOS]-position hidden state that aggregates the sequence under causal masking) — not mean-pooling, which smears that aggregate. This yields one rich pooled vector per caption.
Conditioning assembly (cross-attention encoder_hidden_states, width 2048):
| positions | CLIP-L half (768) | CLIP-G half (1280) |
|---|---|---|
0…76 |
real CLIP-L penultimate hidden states | pool_proj(qwen_pool) broadcast + learned positional offsets |
77…77+N |
zeros | addr_adapter(aleph_address) (near-zero init) |
The same pooled-projected vector also supplies the SDXL added-condition text_embeds
(1280). Micro-conditioning time_ids are the standard [H, W, 0, 0, H, W].
Objective (rectified flow, "Lune" recipe): with shift = 2.5,
s ~ U(0,1); s' = shift·s / (1 + (shift-1)·s); t = s'·1000
x_t = (1 - s')·x0 + s'·noise; target v = noise - x0; loss = MSE(unet(x_t, t, cond), v)
10% classifier-free-guidance dropout (the full conditioning is zeroed). SDXL VAE scaling is
0.13025. Sampling is an Euler integration of the velocity field from σ=1 to σ=0 — not
the default SDXL scheduler.
Training data
AbstractPhil/sdxl-qwen-phase0
— SDXL-self-generated (image, caption, aleph_address) triples. The SDXL render is the
flow-matching target (x0 source); the student learns to reproduce it from the new
conditioning. See that dataset's card for construction details.
Training procedure
| Optimizer | AdamW (β=(0.9, 0.999), wd 0.01) — to be replaced with pure Adam in a later pass |
| Learning rate | 1e-4 with linear warmup (LoRA + front-end are trained from scratch) |
| Grad clip | 1.0 |
| Precision | fp32 (bf16 optional) |
| LoRA | rank 16, α 16, targets attn2.{to_q,to_k,to_v,to_out.0} |
| Resolution | 1024×1024 (128×128 latent) |
Latents, CLIP-L sequences, and Qwen-pooled vectors are precomputed once and cached; the trainable parts then run on the cached frozen features.
How to use
This is a checkpoint of LoRA + front-end weights, not a standard diffusers pipeline. Loading and sampling require the front-end module and the rectified-flow sampler that ship with the training code. Sketch:
# 1. load frozen SDXL UNet, apply the cross-attn LoRA, load LoRA weights
# 2. instantiate SDXLQwenFrontEnd, load its state_dict
# 3. load frozen Qwen (rich-pooled encoder), CLIP-L, VAE, and the aleph model
# 4. for a prompt:
# qwen_pool = QwenPooledEncoder.encode([prompt]) # rich, last-token pooled
# clip_l_seq = CLIP-L penultimate hidden states # real, 77×768
# address = aleph address from the caption bytes # (32, 128)
# ehs, text_embeds = frontend(qwen_pool, clip_l_seq, address)
# 5. Euler-integrate the velocity field (σ: 1→0) with added_cond_kwargs, then VAE-decode
(The QwenPooledEncoder, SDXLQwenFrontEnd, and the sampler are defined in the project's
Phase-0 training script.)
Status, limitations, intended use
- Research only. Phase 0 is a feasibility stage; expect rough or incomplete generations. It is not a drop-in SDXL replacement and does not use the standard SDXL sampler.
- The open question Phase 0 answers is qualitative: do samples show content (the pooled Qwen signal + the aleph address actually steering the denoise) or merely texture? Content advances the program to a full-UNet Phase 1; texture sends the conditioning design back for revision.
- Only the CLIP-G→Qwen swap is made here; CLIP-L is still the original encoder.
- The aleph address is a surface-form reconstruction code, not a semantic embedding (see the dataset card) — it contributes geometric structure, not meaning.
- Inherits the biases and failure modes of SDXL, Qwen, and the caption source.
Licensing
The UNet/VAE/CLIP components and any weights derived from them are governed by the
CreativeML Open RAIL++-M license of SDXL base, and use is subject to that license's
restrictions. The Qwen text encoder carries its own license (Apache-2.0 for the Qwen3 line);
the aleph model comes from AbstractEyes/geolip-svae.
Confirm each component's terms before redistribution or deployment.
Acknowledgements / context
Built on Stability AI's Stable Diffusion XL and the Qwen text models. The rectified-flow recipe follows the Lune SD1.5 flow-matching lineage; the aleph address comes from the geolip-svae spectral-VAE work. Part of the geolip program (geofractal towers/routers, geolip-core constellation, geovocab) on parameter-efficient geometric learning.
Maintainer: AbstractPhil. Status: active research, Phase 0.
- Downloads last month
- -
Model tree for AbstractPhil/geolip-sdxl-aleph
Base model
stabilityai/stable-diffusion-xl-base-1.0
