Breadcrumb boxes summarize what you just learned, connect it to the tenets, and point to what’s coming Next. Think of them as narrative signposts to help you keep track.
-
+
Breadcrumb boxes summarize what you just learned, connect it to the tenets, and point to what’s coming Next. Think of them as narrative signposts to help you keep track.
We will get started by enumerating the tenets. Then we’ll look at concrete examples that show how they shape our decision-making. These examples are necessarily detailed, and sometimes complex, because they illustrate the challenges to maintain and grow a large codebase that caters to multiple collectives, has millions of users, hundreds of contributors, and always strives for simplicity and consistency.
We summarize the foundations on which we’ve built everything, and write the “tenets” of the library. They behave like software interfaces, hence it is crucial that they are explicitly written down. However opinionated they are, they have evolved over time.
These principles were not decided in a vacuum. The library evolved towards them, and once they emerged, they were recognized as critical.
-
Source of Truth
We aim to be a source of truth for all model definitions. This is more of a goal than a tenet, but it strongly guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original implementations. If we are successful, they should become reference baselines for the ecosystem, so they’ll be easily adopted by downstream libraries and projects. It’s much easier for a project to always refer to the transformers implementation, than to learn a different research codebase every time a new architecture is released.
This overarching guideline ensures quality and reproducibility across all models in the library, and aspires to make the community work easier.
One Model, One File
All inference and training core logic has to be visible, top‑to‑bottom, to maximize each model’s hackability.
Every model should be completely understandable and hackable by reading a single file from top to bottom.
Code is Product
Optimize for reading, diffing, and tweaking, our users are power users. Variables should be explicit, full words, even several words, readability is primordial.
Code quality matters as much as functionality - optimize for human readers, not just computers.
Standardize, Don’t Abstract
If it’s model behavior, keep it in the file; use abstractions only for generic infra.
Model-specific logic belongs in the model file, not hidden behind abstractions.
DRY* (DO Repeat Yourself)
Copy when it helps users; keep successors in sync without centralizing behavior.
Evolution: With the introduction and global adoption of modular transformers, we do not repeat any logic in the modular files, but end user files remain faithful to the original tenet.
Strategic duplication can improve readability and maintainability when done thoughtfully.
Minimal User API
Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.
Keep the public interface simple and predictable, users should know what to expect.
Backwards Compatibility
Evolve by additive standardization, never break public APIs.
Any artifact that was once on the hub and loadable with transformers should be usable indefinitely with the same interface. Further, public methods should not change to avoid breaking dependencies. If we do deprecate something, it’s with very long cycles beforehand.
Once something is public, it stays public, evolution through addition, not breaking changes.
Consistent Public Surface
Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goal as well as a tenet.
All models should feel familiar - consistent interfaces reduce cognitive load.
+
Source of Truth
We aim to be a source of truth for all model definitions. This is more of a goal than a tenet, but it strongly guides our decisions. Model implementations should be reliable, reproducible, and faithful to the original implementations. If we are successful, they should become reference baselines for the ecosystem, so they’ll be easily adopted by downstream libraries and projects. It’s much easier for a project to always refer to the transformers implementation, than to learn a different research codebase every time a new architecture is released.
This overarching guideline ensures quality and reproducibility across all models in the library, and aspires to make the community work easier.
One Model, One File
All inference and training core logic has to be visible, top‑to‑bottom, to maximize each model’s hackability.
Every model should be completely understandable and hackable by reading a single file from top to bottom.
Code is Product
Optimize for reading, diffing, and tweaking, our users are power users. Variables should be explicit, full words, even several words, readability is primordial.
Code quality matters as much as functionality - optimize for human readers, not just computers.
Standardize, Don’t Abstract
If it’s model behavior, keep it in the file; use abstractions only for generic infra.
Model-specific logic belongs in the model file, not hidden behind abstractions.
DRY* (DO Repeat Yourself)
Copy when it helps users; keep successors in sync without centralizing behavior.
Evolution:
With the introduction and global adoption of modular transformers, we do not repeat any logic in the modular files, but end user files remain faithful to the original tenet.
Strategic duplication can improve readability and maintainability when done thoughtfully.
Minimal User API
Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. We want the least amount of codepaths. Reading should be obvious, configurations should be obvious.
Keep the public interface simple and predictable, users should know what to expect.
Backwards Compatibility
Evolve by additive standardization, never break public APIs.
Any artifact that was once on the hub and loadable with transformers should be usable indefinitely with the same interface. Further, public methods should not change to avoid breaking dependencies. If we do deprecate something, it’s with very long cycles beforehand.
Once something is public, it stays public, evolution through addition, not breaking changes.
Consistent Public Surface
Same argument names, same outputs, hidden states and attentions exposed, enforced by tests. This is a goal as well as a tenet.
All models should feel familiar - consistent interfaces reduce cognitive load.
When a PR is merged, it is because the contribution is worthwhile, and because the transformers team finds the design of the contribution to be aligned with the tenets.
-
Does all the code in the library strictly follow these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere, built by thousands of different workers. We try to make it so all the code added is compliant, because if we fail and merge it, we cannot change it lest we break backwards compatibility.
+
Does all the code in the library strictly follow these tenets? No. The library is a gigantic house with connected nooks, corridors, crannies everywhere, built by thousands of different workers. We try to make it so all the code added is compliant, because if we fail and merge it, we cannot change it lest we break backwards compatibility
backwards-compatibility
Any artifact once on the hub must remain loadable. Breaking changes are unacceptable.
.
To see what constitutes adherence to the tenets, let’s take the example of code repetition.
The following function, essential to the implementation of Rotary Positional Embeddings can be found in more than 70 modeling_<file>.py across src/transformers/models/. Why keep it? Because we want all the model logic to be contained in the modeling file
one-model-one-file
All inference and training core logic visible, top‑to‑bottom, in a single file.
. In order to do that, we do repeat ourselves
do-repeat-yourself
Strategic duplication can improve readability and maintainability when done thoughtfully.
.
def rotate_half(x): """Rotates half the hidden dims of the input.""" x1 = x[..., : x.shape[-1] // 2]
@@ -439,205 +564,983 @@ We continue to support all new models and expect to do so for the foreseeable fu
We want all models to have self-contained modeling code. Every core functionality must be in the modeling code, every non-core functionality can be outside of it.
This comes at a great cost. For years, we have used what we call the #Copied from... mechanism: we added comments of a specific format documenting that some code was copied from another model, saving time both for the reviewers and for the CI: we had tooling to ensure that the copied blocks remained in sync.
But the LOC count kept creeping up. Each new model copied over hundreds of lines that we considered largely boilerplate, yet, we could not remove them.
-
We needed to separate two principles that were so far intertwined, repetition and hackability.
+
We needed to separate two principles that were so far intertwined, repetition
do-repeat-yourself
Strategic duplication can improve readability and maintainability when done thoughtfully.
and hackability
one-model-one-file
All inference and training core logic visible, top‑to‑bottom, in a single file.
.
What was the solution to this? Let’s talk about modular transformers.
Next: how modular transformers honor these while removing boilerplate.
Modular transformers
-
Transformers is an opinionated library. The previous philosophy page, and the blog post were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. modular transformers was introduced to allow a form of inheritance without breaking One model, One file.
-
We amended the principle of DRY* by progressively removing all pieces of code that were “copied from” another file.
+
Transformers is an opinionated library. The previous philosophy page, and the blog post were already pointing at the drawbacks mentioned just above, which have been iteratively addressed. modular transformers was introduced to allow a form of inheritance without breaking the one model, one file rule.
one-model-one-file
All inference and training core logic visible, top‑to‑bottom, in a single file.
+
We amended the principle of DRY*
do-repeat-yourself
Strategic duplication can improve readability and maintainability when done thoughtfully.
by progressively removing all pieces of code that were “copied from” another file.
It works as follows. In order to contribute a model, GLM for instance, we define a modular_ file that can inherit from any function across all other modeling, configuration and processor files already existing in the library.
The modular file can use inheritance across models: and then, it will be unravelled into a fully functional modeling file.
Left: Clean modular definition with inheritance. Right: Auto-expanded version with all inherited functionality visible.
As you can see, we can define a new model as a modular combination of fragments taken from others.
-
You might think “well that’s just how inheritance works”. The crucial difference is that we do visibly what is essentially the compiler’s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it all in one piece.
+
You might think “well that’s just how inheritance works”. The crucial difference is that we do visibly what is essentially the compiler’s job: by unrolling the inheritances, we make visible all of the modeling code, keeping it all in one piece.
one-model-one-file
All inference and training core logic visible, top‑to‑bottom, in a single file.
You can see below the difference between GlmAttention and LlamaAttention, with the latter having been copied with minimal changes.
-
+
Figure 1: Comparison of attention implementations between Llama and GLM, showing code reuse with minimal modifications.
What is the consequence? When adding a model, we do not need to go over the entire modeling file. The modular (left side above) is enough.
When AutoModel.from_pretrained(...) is called, it is indeed the modeling (right side) that is ran, and all the tests run on the modeling code.
More importantly, the auto-generated modeling file is what users read to understand the code, what they step through in their debuggers and what they hack for their needs.
What does that give us?
-
TL;DR: A small modular_*.py declares reuse; the expanded modeling file stays visible (One Model, One File tenet preserved). Reviewers and contributors maintain the shard, not the repetition.
Next: the measurable effect on effective LOC and maintenance cost.
+
TL;DR: A small modular_*.py declares reuse; the expanded modeling file stays visible and unique
one-model-one-file
All inference and training core logic visible, top‑to‑bottom, in a single file.
. Reviewers and contributors maintain the shard, not the repetition.
Next: the measurable effect on effective LOC and maintenance cost.
The effect of modular can be measured in lines of code (LOC). If a model only has a modeling file, we add its LOC count.
However, if a model has a modular_.py and a corresponding automatically generated modeling_/.py, we only count the LOC under the modular file. The modeling code has no maintenance cost as it is strictly dependent on the modular file.
That gives an “effective LOC” curve: the 𝗺𝗮𝗶𝗻𝘁𝗲𝗻𝗮𝗻𝗰𝗲 𝘀𝘂𝗿𝗳𝗮𝗰𝗲.
Measured on git history, raw modeling_*.py grew at ~362 LOC/day before modular; counting only modular shards yields ~25 LOC/day after — about 15× lower. The effective curve (blue line below) represents the maintenance surface today: what maintainers actually read and review.
Less code to hand-maintain means fewer places to break. Naturally LOC is not a direct measure of complexity, but they correlate in review effort and change risk.
-
+
+
+
+
The blue line (effective) is the sum of the red + green, whereas the yellow would have been the progression without modular. We can see that the maintenance surface is essentially constant (in LOC) since the implementation of modular.
If you zoom in, you’ll notice there’s a sharp drop near the end, it’s essentially due to us removing support for Jax and TensorFlow library-wide.
But this was not the only effort that allowed us to reduce maintenance load.
We recently underwent a deep refactor of the attention implementation. You’ve likely heard about flash attention and its several variants.
The attention computation itself happens at a lower level of abstraction than the model itself.
-
However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn’t a minimal user api. Next section explains what we did.
+
However, we were adding specific torch operations for each backend (sdpa, the several flash-attention iterations, flex attention) but it wasn’t a minimal user api
minimal-user-api
Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. Least amount of codepaths.
. Next section explains what we did.
Evidence: effective (i.e., maintainable) LOC growth drops ~15× when counting shards instead of expanded modeling files. Less code to read, fewer places to break.
Next: how the attention interface stays standard without hiding semantics.
External Attention classes
The solution for the “attention abstraction problem” was to move to a standard attention interface that allows the following:
The naive implementation of attention, called “eager”, is available by default. We use a Callable called eager_attention_forward, which can run as long as the user has PyTorch installed – which is a requirement any way.
Instead of using a class interface and a class hierarchy, we just moved to a function interface. When a more complex attention implementation is needed, we use other Callables, including much faster kernel bindings when available. The decision to use a different attention implementation is based on the model configuration file we download from the Hub, and it can also be overridden by the user.
-
This is a clear example that that we prefer an interface that is standard, but not abstract. To be completely precise, this is what the interface selection looks like in transformers code:
+
This is a clear example that that we prefer an interface that is standard, but not abstract
standardize-dont-abstract
Model-specific logic belongs in the model file, not hidden behind abstractions.
. To be completely precise, this is what the interface selection looks like in transformers code:
Having the attention interfaces functionalized allows to do dynamic switching of attentions as well, increasing their hackability.
+
Having the attention interfaces functionalized allows to do dynamic switching of attentions as well, increasing their hackability
code-is-product
Optimize for reading, diffing, and tweaking. Code quality matters as much as functionality.
.
Another strength of the new attention interface is the possibility to enforce specific kwargs, which are needed by kernel providers and other dependencies.
Backend integrations sometimes require specific kwargs.
-
We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a minimal user api.
+
We know that kwargs are often a necessary evil that plagues tools with widespread compatibility; and it is something we have aimed to reduce, and will continue reduce in order to improve readability - with them, the current system is a minimal user api
minimal-user-api
Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. Least amount of codepaths.
.
We reduce that surface and document expectations; where flexibility is necessary, we plan to use typing.Annotated to convey shapes and invariants without constraining integrations. Such an implementation could look like this in the future:
from typing import Annotated
@@ -650,30 +1553,51 @@ Another strength of the new attention interface is the possibility to enforce sp
Why does it matter?
Because we want to avoid code modifications that are unrelated to the model.
We choose to place the level of abstraction higher than the device placement: a matrix multiplication - a nn.Linear layer - should be always expressed in the same way, regardless of how it is placed.
-
Hence, we want to touch the modeling code minimally, and only modify it when architectural changes are involved – not depending on the way you run it. For tensor parallelism, we simply specify a tp_plan:
-
# In the model's config (example: ERNIE 4.5-style decoder blocks)
- base_model_tp_plan = {
- "layers.*.self_attn.q_proj": "colwise",
- "layers.*.self_attn.k_proj": "colwise",
- "layers.*.self_attn.v_proj": "colwise",
- "layers.*.self_attn.o_proj": "rowwise",
- "layers.*.mlp.gate_proj": "colwise",
- "layers.*.mlp.up_proj": "colwise",
- "layers.*.mlp.down_proj": "rowwise",
- }
-
- # Runtime
- import torch
- from transformers import AutoModelForCausalLM, AutoTokenizer
-
- model_id = "your/model-or-local-checkpoint"
- model = AutoModelForCausalLM.from_pretrained( # <-- will automatically map to the plan defined above
- model_id,
- dtype=torch.bfloat16,
- )
- tok = AutoTokenizer.from_pretrained(model_id)
- inputs = tok("Hello", return_tensors="pt").to(model.device)
- out = model(**inputs)
+
Hence, we want to touch the modeling code as little as possible
minimal-user-api
Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. Least amount of codepaths.
, and only modify it when architectural changes are involved – not depending on the way you run it. For tensor parallelism, we simply specify a tp_plan:
+
The plan is written once, saved as part of the config and passed to .from_pretrained(). It maps module name patterns to partitioning strategies. Strategies are resolved by the internal ParallelInterface, which wires to sharding implementations ColwiseParallel, RowwiseParallel, packed variants, and so on.
The alternative would be to modify classes depending on supported types of parallelism.
The tp_plan solution allows users to run the same model on a single GPU, or distribute it using multiple processes per node, e.g. 4 GPUs:
@@ -699,11 +1623,51 @@ Another strength of the new attention interface is the possibility to enforce sp
"full_attention" ],
-
This is minimal to implement on the user side, and allows to keep the modeling code untouched. It is also easy to tweak.
+
This is minimal
minimal-user-api
Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. Least amount of codepaths.
to implement on the user side, and allows to keep the modeling code untouched. It is also easy to tweak.
Allowed layer types are explicit; schedules (e.g., sliding/full alternation) live in config. This keeps the file readable and easy to tweak.
Next: speedups come from kernels that don’t change semantics.
Community Kernels
-
The same principle extends to normalization, activation, and other code paths. The model defines semantics; a kernel defines how to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a consistent public surface
-
@use_kernel_forward_from_hub("RMSNorm")
+
The same principle extends to normalization, activation, and other code paths. The model defines semantics; a kernel defines how to execute them faster. We annotate the module to borrow a community‑provided forward, keeping a consistent public surface
consistent-public-surface
Uniform naming, signatures, and conventions across all models for predictability.
@@ -723,55 +1687,1286 @@ So I wanted to take a look at the current state of modularity a
So what do we see?
(Graph reading guide: nodes are models; edges are modular imports).
Check out the full viewer here (tab “dependency graph”, hit “build graph”) for better manipulation and exploration.
-
Let’s walk through some sections of this graph together.
First, Llama is a basis and an influence for many models, and it is very visible.
-
+
Figure 2: Llama as a central model influencing many other models in the dependency graph.
The models linked sometimes pull components from other models than llama of course. Radically different architectures such as mamba have spawned their own dependency subgraph.
Audio models form sparser archipelagos, see for instance wav2vec2 which is a significant basis for a dozen of them.
-
-
In the case of VLMs which have massively grown in popularity since 2024, there’s far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
-
As you can see, there is a small DETR island:
-
-
There is also a little llava pocket, and so on, but it’s not comparable to the centrality observed for llama.
-
Another problem is, this visualization only shows modular models. Several models still do NOT have a modular file. If we zoom out significantly, we can see them, the red nodes are models that do not have a modular file yet.
-
-
Hence the next question, and how do we identify modularisable models?
-
Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents.
-Next: timeline + similarity signals to spot modularisable candidates.
I looked into Jaccard similarity, which we use to measure set differences, to find similarities across models. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.
-
It is interesting, for our comparison, to look at when we deployed the modular logic and what was its rippling effect on the library. Looking at the timeline makes it obvious: adding modular allowed to connect more and more models to solid reference points.
-
Yet, we still have a lot of gaps to fill.
-
Zoom out below - it’s full of models. You can click on a node to see its connections better, or use the text box to search for a model. You can use the full viewer (tab “timeline”, hit “build timeline”) for better exploration.
-
-
Let’s look at a few highly connected models. Let’s start by the foundational work of Llava.
-
-
You see that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can likely remodularize, not touching the actual model but being much more readable with DRY*.
-
The same can be identified with the classical encoders family, centered on BERT:
-
Here roberta, xlm_roberta, ernie are modulars of BERT, while models like mobilebert are likely candidates.
-
-
Similarity metrics (Jaccard index or embeddings) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., llava_video → llava) for refactors that preserve behavior.
Next: concrete VLM choices that avoid leaky abstractions.
We don’t yet have a cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attention bridges). This is one of the main improvement points where we can work.
+
Figure 3: Cluster of audio architectures based on wav2vec2, forming a specialized archipelago.
+
In the case of VLMs which have massively grown in popularity since 2024, there’s far too many vision-based architectures that are not yet defined as modulars of other existing archs. In other words, there is no strong reference point in terms of software for vision models.
+
As you can see, there is a small DETR island:
+
Figure 4: Small DETR archipelago for vision models, less centralized than Llama for text.
+
There is also a little llava pocket, and so on, but it’s not comparable to the centrality observed for llama.
+
Another problem is, this visualization only shows modular models. Several models still do NOT have a modular file. If we zoom out significantly, we can see them, the red nodes are models that do not have a modular file yet.
+
Figure 5: Overview showing red nodes (models without modular files) to be modularized.
+
Hence the next question, and how do we identify modularisable models?
+
Llama-lineage is a hub; several VLMs remain islands — engineering opportunity for shared parents.
+Next: timeline + similarity signals to spot modularisable candidates.
I looked into Jaccard similarity, which we use to measure set differences, to find similarities across models. I know that code is more than a set of characters stringed together. We also tried code-embedding models that ranked candidates better in practice, but for this post we stick to the deterministic Jaccard index.
+
It is interesting, for our comparison, to look at when we deployed the modular logic and what was its rippling effect on the library. Looking at the timeline makes it obvious: adding modular allowed to connect more and more models to solid reference points.
+
Yet, we still have a lot of gaps to fill.
+
Zoom out below - it’s full of models. You can click on a node to see its connections better, or use the text box to search for a model. You can use the full viewer (tab “timeline”, hit “build timeline”) for better exploration.
+
+
Let’s look at a few highly connected models. Let’s start by the foundational work of Llava.
+
+
You see that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can likely remodularize, not touching the actual model but being much more readable with DRY*.
+
+
Let’s look at a few highly connected models. Let’s start by the foundational work of Llava.
+
Figure 6: LLaVA and its variants in the timeline, with llava_video as a candidate for modularization.
+
You see that llava_video is a red node, connected by a red edge to llava: it’s a candidate, something that we can likely remodularize, not touching the actual model
backwards-compatibility
Any artifact once on the hub must remain loadable. Breaking changes are unacceptable.
but being much more readable with DRY*
do-repeat-yourself
Strategic duplication can improve readability and maintainability when done thoughtfully.
.
+
The same can be identified with the classical encoders family, centered on BERT:
+
Here roberta, xlm_roberta, ernie are modulars of BERT, while models like mobilebert are likely candidates.
+
Figure 7: Family of classical encoders centered on BERT, with several models already modularized.
+
Similarity metrics (Jaccard index or embeddings) surfaces likely parents; the timeline shows consolidation after modular landed. Red nodes/edges = candidates (e.g., llava_video → llava) for refactors that preserve behavior.
Next: concrete VLM choices that avoid leaky abstractions.
We don’t yet have a cookbook for common VLM patterns (image token scatter, multi‑tower encoders, cross‑attention bridges). This is one of the main improvement points where we can work.
For instance, we thought of abstracting away the mixing of inputs_embeds, the tensor fed into an LLM decoder in 95% of the existing VLMs. It would have looked like something like
Model-specific logic belongs in the model file, not hidden behind abstractions.
. Embedding mixin is part of the model, removing it would break it. A user opening modeling_qwen2.5_vl (check out the Qwen2.5VL collection) should not have to go to another file to understand how it works.
What is the current state of these “abstractions” across the codebase?
You will see all the imports around a modeling file, here Gemma3n.
-
+
Figure 8: Gemma3n import graph showing dependency complexity, with GenerationMixin very central.
As you can see, the GenerationMixin node is already very heavy. It encompasses all of the utilities around .generate, it is second only to nn.Module.
That means every decision we make to abstract something else has to be extremely careful.
The following Pull request to standardize placeholder masking is a good example of what kind of changes are acceptable. In a VLM, we always need to insert embeddings from various encoders at various positions, so we can have a function to do it. For Qwen2 VL, for instance, it will look like this:
@@ -815,14 +3010,200 @@ That means every decision we make to abstract something else has to be extremely
return special_image_mask, special_video_mask
-
But this is within the modeling file, not in the PreTrainedModel base class. It will not move away from it, because it’d break the self-contained logic of the model.
+
But this is within the modeling file, not in the PreTrainedModel base class. It will not move away from it, because it’d break the One model, one file tenet.
one-model-one-file
All inference and training core logic visible, top‑to‑bottom, in a single file.
What do we conclude? Going forward, we should aim for VLMs to have a form of centrality similar to that of Llama for text-only models. This centrality should not be achieved at the cost of abstracting and hiding away crucial inner workings of said models.
Keep VLM embedding mix in the modeling file (semantics), standardize safe helpers (e.g., placeholder masking), don’t migrate behavior to PreTrainedModel.
Next: pipeline-level wins that came from PyTorch-first choices (fast processors).
Deciding to become a torch-first library meant relieving a tremendous amount of support for jax and TensorFlow, and it also meant that we could be more lenient about the amount of torch-dependent utilities that we were able to accept. One of these is the fast processing of images. Where inputs were once minimally assumed to be ndarrays, enforcing native torch and torchvision inputs allowed us to massively improve processing speed for each model.
The gains in performance are immense, up to 20x speedup for most models when using compiled torchvision ops. Furthermore, lets us run the whole pipeline solely on GPU.
-
+
Figure 9: Performance gains of fast image processors, up to 20x acceleration with compiled torchvision.
PyTorch-first lets processors assume torch/torchvision and run the whole pipeline on GPU; big per-model speedups.
Next: how this lowers friction for contributors and downstream users.
This is an overall objective: there’s no transformers without its community.
@@ -833,3896 +3214,603 @@ That means every decision we make to abstract something else has to be extremely
The shape of a contribution: add a model (or variant) with a small modular shard; the community and serving stacks pick it up immediately. Popularity trends (encoders/embeddings) guide where we invest.
Next: power tools enabled by a consistent API.
Models popularity
Talking about dependencies, we can take a look at the number of downloads as a measure of popularity. One thing we see is the prominence of encoders, despite the apparent prevalence of decoder LLMs. The reason is that encoders are used to generate embeddings, which have multiple downstream uses. Just check out EmbeddingGemma for a modern recap. Hence, it is vital to keep the encoders portion of the library viable, usable, fine-tunable.
As the codebase grows, we need to maintain it in coordination with our friend Sentence Transformers codebase. Retrieval use-cases, smart databases, FAISS-based indexing rely on it, and thus indirectly on transformers.
-
In that regard, we DO want to be a modular toolbox, being minimal enough and well documented enough so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
+
In that regard, we DO want to be a modular toolbox, being minimal
minimal-user-api
Config, model, preprocessing; from_pretrained, save_pretrained, push_to_hub. Least amount of codepaths.
enough and well documented enough so any ML/AI developer can use transformers without having to think about it. We aim to reduce the cognitive load brought about by model development, not increase it.
So, how do these design choices, these “tenets” influence development of models and overall usage of transformers?
Encoders remain critical for embeddings and retrieval; maintaining them well benefits the broader ecosystem (e.g., Sentence Transformers, FAISS).
Next: dev tools that leverage unified attention APIs and PyTorch-only internals.
This uniformity allows us to build cool tools to visualize the inner workings of the attention mechanism.
One particular piece of machinery is the attention mask. Here you see the famous bidirectional attention pattern for the whole prefix (text + image) in PaliGemma and all Gemma2+ models, contrasting with the usual “causal-only” models.
Because everything is PyTorch, we can easily debug any model when we want to add it to transformers. We now have a power-user tool for porting or adding models, that wraps a forward pass, intercepts every submodule call, and logs shapes, dtypes, and sample statistics of inputs/outputs to nested JSON.
-
It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, to match our Source of Truth guideline.
-
+
It just works with PyTorch models and is especially useful when aligning outputs with a reference implementation, to match our Source of Truth guideline
source-of-truth
Model implementations should be reliable, reproducible, and faithful to original performances.
.
+
Figure 10: Model debugger interface intercepting calls and logging statistics in nested JSON.
Forward interception and nested JSON logging align ports to reference implementations, reinforcing “Source of Truth.”
Next: CUDA warmup reduces load-time without touching modeling semantics.
Having a clean external API allows us to work on the true inner workings of transformers. One of a few recent additions was the CUDA warmup via caching_allocator_warmup, which dramatically improved loading times by pre-allocating GPU memory to avoid malloc bottlenecks during model loading. It can achieve a 7x speedup factor for an 8B model, or 6x for a 32B one, as you can check in the PR!
-
-
-
-
Mem allocation patterns during model loading
+
+ })();
+
+
It’s hard to overstate how much of a lifesaver that is when you’re trying to load a model as fast as possible, as it’s the narrowest bottleneck for your iteration speed.
Pre-allocating GPU memory removes malloc spikes (e.g., 7× for 8B, 6× for 32B in the referenced PR).
having it immediately usable in vLLM, SGLang, and so on without additional code. In the case of vLLM, transformers was added as a backend to run models on vLLM, which optimizes throughput/latency on top of existing transformers architectures as seen in this great vLLM x HF blog post.
being the reference code for implementations in MLX, llama.cpp and other libraries.
-
This further cements the need for a consistent public surface: we are a backend and a reference, and there’s more software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), check here for GLM4 video support, and here for MoE support, for instance.
+
This further cements the need for a consistent public surface
consistent-public-surface
Uniform naming, signatures, and conventions across all models for predictability.
: we are a backend and a reference, and there’s more software than us to handle serving. At the time of writing, more effort is done in that direction. We already have compatible configs for VLMs for vLLM (say that three times fast), check here for GLM4 video support, and here for MoE support, for instance.
Being a good backend consumer requires a consistent public surface; modular shards and configs make that stability practical.
Next: what changes in v5 without breaking the promise of visible semantics.
The next major version of transformers is just around the corner (and will have another blog post to its name when it comes out). When v5 is released, we aim to keep backwards compatibility as solid as possible. The changes we make now are in service of that goal.
@@ -5121,7 +4760,7 @@ return Plotly;
author={Pablo Montalvo and Lysandre Debut and Pedro Cuenca and Yoni Gozlan},
year={2025},
-}
Special thanks to all the reviewers on this! Vaibhav Srivastav, Cyril Vallez, Yoni Gozlan also for his excellent work on fast image processors, Arthur Zucker for his guidance, and of course the wonderful Thibaud Frere for designing this template and helping me out with it!
Most importantly: thanks to the entire Open-Source community, sincerely.