What’s the easiest way to load a pre-trained Hugging Face model in Python or in a notebook?

How can I quickly use an already-trained AI model from Hugging Face using Python (for example in Jupyter Notebook or Google Colab)?

1 Like

from_pretrained() method from Pipeline class is probably the simplest approach.
If you want to do something more elaborate, it’s often better to use from_pretrained() from AutoModel class / AutoTokenizer class, etc.


Easiest way in Python or a notebook: transformers.pipeline()

If you want “download a pretrained model from the Hugging Face Hub and run inference in 1–3 lines,” use Transformers Pipelines. Pipelines wrap the messy parts (tokenization or image/audio preprocessing, model forward pass, decoding) behind a task-focused API. (Hugging Face)


Mental model in 30 seconds

When you “load a model from Hugging Face,” you are doing three things:

  1. Pick a model repo on the Hub (by model id like "distilgpt2").
  2. Download + cache weights, config, and tokenizer/processor locally. The Hub cache is shared across HF libraries. (Hugging Face)
  3. Run inference either with:
  • pipeline(...) for the simplest “just run it” workflow. (Hugging Face)
  • AutoTokenizer + AutoModel...from_pretrained(...) when you want more control. (Hugging Face)

Copy-paste notebook recipe (Jupyter or Colab)

1) Install

!pip -q install -U transformers accelerate torch huggingface_hub

accelerate matters later for large models and device_map="auto". (Hugging Face)

2) Pick CPU vs GPU once

import torch
device = 0 if torch.cuda.is_available() else -1
device

Pipelines use device=-1 for CPU and device=0 for the first CUDA GPU. (Hugging Face)

3) Run a pretrained model immediately

Example A: sentiment analysis

from transformers import pipeline

clf = pipeline("sentiment-analysis", device=device)
clf("Hugging Face models are easy to use in notebooks.")

Example B: text generation

from transformers import pipeline

gen = pipeline("text-generation", model="distilgpt2", device=device)
gen("Write a short haiku about Tokyo in winter:", max_new_tokens=40)

Using max_new_tokens is the recommended way to control “how much to generate.” (Hugging Face)


How to pick the right model fast (so it “just works”)

On each model page, the model card metadata often includes a pipeline_tag. That tag indicates the intended task, helps the Hub pick the right widget, and is a strong hint for which pipeline("...") task string you should use. (Hugging Face)

Practical mapping:

  • pipeline_tag: text-classificationpipeline("text-classification", ...) or task-specific ones like "sentiment-analysis".
  • pipeline_tag: text-generationpipeline("text-generation", ...).
  • Vision/audio tasks also have pipeline tags and corresponding pipelines. (Hugging Face)

If you only remember one rule: match the pipeline task to the model’s intended task. It avoids many “wrong head / wrong outputs” problems.


What “loading” actually does (cache, paths, and why reruns are faster)

Downloads go into a local cache. Default is typically ~/.cache/huggingface/hub (configurable). Transformers documents the cache location and environment-variable priority such as HF_HUB_CACHE then HF_HOME. (Hugging Face)

If you want to move the cache (common in Colab, shared servers, small disks), set the env var before importing Transformers:

import os
os.environ["HF_HUB_CACHE"] = "/some/bigger/disk/hf_hub_cache"

(Transformers and huggingface_hub both document this cache behavior and variables.) (Hugging Face)


When pipeline() is not enough: use Auto classes

Use Auto classes when you want batching, custom preprocessing, direct control over tensors, or you want to call model.generate() yourself.

Classification example (manual but flexible)

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_id = "distilbert-base-uncased-finetuned-sst-2-english"

tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

inputs = tok("This is great.", return_tensors="pt")
with torch.no_grad():
    logits = model(**inputs).logits

pred = logits.argmax(dim=-1).item()
pred

The Auto Classes docs describe the “instantiate from from_pretrained()” pattern across many task heads. (Hugging Face)


Large models: the easiest “make it fit” knob is device_map="auto"

If a model is too large for your GPU, let Accelerate dispatch layers across devices:

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "distilgpt2"  # swap to a larger causal LM if you want
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

inputs = tok("Explain model caching in one sentence:", return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=40)
print(tok.decode(out[0], skip_special_tokens=True))

Accelerate documents that device_map="auto" fills GPU memory first, then CPU, then disk if needed. (Hugging Face)


Gated/private models: login in the notebook

If a model is gated or private, you need authentication. The Hub docs cover programmatic login (and related token management). (Hugging Face)

from huggingface_hub import login
login()  # prompts in many notebook environments

Offline notebooks and reproducibility

Transformers documents “offline mode” as: download first, then run without network. It explicitly points to using snapshot_download() ahead of time. (Hugging Face)

from huggingface_hub import snapshot_download

local_dir = snapshot_download("distilgpt2")  # downloads full repo snapshot
local_dir

Then load from that folder (or load by id with local_files_only=True if everything is already cached).


Common notebook pitfalls (quick prevention)

1) Wrong device handling

Set the device at pipeline creation time. Pipelines define CPU as device=-1 and GPU as a CUDA ordinal like device=0. (Hugging Face)

2) Confusion about “which task string do I use”

Use the model card pipeline_tag as your hint. (Hugging Face)

3) Cache surprises

If downloads are going to the “wrong place,” configure HF_HUB_CACHE or HF_HOME before imports. Transformers lists the precedence explicitly. (Hugging Face)


High-quality references to keep open while you work


Summary

  • Fastest path: pipeline(task, model=..., device=0) in Colab, or device=-1 on CPU. (Hugging Face)
  • Pick the right task by checking the model card pipeline_tag. (Hugging Face)
  • For control: AutoTokenizer.from_pretrained() + AutoModel...from_pretrained(). (Hugging Face)
  • For big models: device_map="auto" (GPU then CPU then disk). (Hugging Face)
  • If gated: huggingface_hub.login(). If offline: snapshot_download() first. (Hugging Face)