Tokenizers documentation

Decoders

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.22.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Decoders

Python
Rust
Node

DecodeStream

class tokenizers.decoders.DecodeStream

( ids = None skip_special_tokens = False )

Parameters

  • skip_special_tokens (bool, defaults to False) — Whether to skip special tokens (e.g. [CLS], [SEP], <s>) when decoding.

Provides incremental decoding of token IDs as they are generated, yielding decoded text chunks as soon as they are available.

Unlike batch decoding, streaming decode is designed for use with autoregressive generation — tokens arrive one at a time and the decoder needs to handle multi-byte sequences (e.g., UTF-8 characters split across token boundaries) and byte-fallback tokens gracefully.

The decoder internally buffers tokens until it can produce a valid UTF-8 string chunk, then yields that chunk and advances its internal state. This means individual calls to step() may return None when the current token completes a partial sequence that cannot yet be decoded.

Example:

>>> from tokenizers import Tokenizer
>>> from tokenizers.decoders import DecodeStream
>>> tokenizer = Tokenizer.from_pretrained("gpt2")
>>> stream = DecodeStream(skip_special_tokens=True)
>>> # Simulate streaming token-by-token generation
>>> token_ids = tokenizer.encode("Hello, streaming world!").ids
>>> for token_id in token_ids:
...     chunk = stream.step(tokenizer, token_id)
...     if chunk is not None:
...         print(chunk, end="", flush=True)

step

( tokenizer id ) Optional[str]

Parameters

  • tokenizer (Tokenizer) — The tokenizer whose decoder pipeline will be used.
  • id (int or List[int]) — The next token ID, or a list of token IDs to append to the stream.

Returns

Optional[str]

The next decoded text chunk if enough tokens have accumulated, or None if more tokens are still needed.

Add the next token ID (or list of IDs) to the stream and return the next decoded text chunk if one is available.

Because some characters span multiple tokens (e.g. multi-byte UTF-8 sequences or byte-fallback tokens), this method may return None when the provided token does not yet complete a decodable unit. Callers should simply continue feeding tokens until a non-None value is returned.

BPEDecoder

class tokenizers.decoders.BPEDecoder

( suffix = '</w>' )

Parameters

  • suffix (str, optional, defaults to </w>) — The suffix that was used to characterize an end-of-word. This suffix will be replaced by whitespaces during the decoding

BPEDecoder Decoder

Example:

>>> from tokenizers.decoders import BPEDecoder
>>> decoder = BPEDecoder()
>>> decoder.decode(["Hello</w>", "world</w>"])
'Hello world'

ByteFallback

class tokenizers.decoders.ByteFallback

( )

ByteFallback Decoder

ByteFallback is a decoder that handles tokens representing raw bytes in the <0xNN> format (e.g., <0x61> for the byte 0x61 = 'a'). It converts such tokens to their corresponding bytes and attempts to decode the resulting byte sequence as UTF-8. This is used in LLaMA/SentencePiece models that use byte fallback for unknown characters. Inconvertible byte tokens are replaced with the Unicode replacement character (U+FFFD).

Example:

>>> from tokenizers.decoders import ByteFallback, Fuse, Sequence
>>> decoder = Sequence([ByteFallback(), Fuse()])
>>> decoder.decode(["<0x48>", "<0x65>", "<0x6C>", "<0x6C>", "<0x6F>"])
'Hello'

ByteLevel

class tokenizers.decoders.ByteLevel

( )

ByteLevel Decoder

This decoder is to be used in tandem with the ByteLevel pre-tokenizer. It reverses the byte-to-unicode mapping applied during pre-tokenization, converting the special Unicode characters back into the original bytes to reconstruct the original string.

Example:

>>> from tokenizers.decoders import ByteLevel
>>> decoder = ByteLevel()
>>> decoder.decode(["ĠHello", "Ġworld"])
' Hello world'

CTC

class tokenizers.decoders.CTC

( pad_token = '<pad>' word_delimiter_token = '|' cleanup = True )

Parameters

  • pad_token (str, optional, defaults to <pad>) — The pad token used by CTC to delimit a new token.
  • word_delimiter_token (str, optional, defaults to |) — The word delimiter token. It will be replaced by a
  • cleanup (bool, optional, defaults to True) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

CTC Decoder

Example:

>>> from tokenizers.decoders import CTC
>>> decoder = CTC()
>>> decoder.decode(["h", "e", "e", "<pad>", "l", "l", "o", "|", "w", "o", "r", "l", "d"])
'hello world'

Fuse

class tokenizers.decoders.Fuse

( )

Fuse Decoder

Fuse simply concatenates every token into a single string without any separator. This is typically the last step in a decoder chain when other decoders need to operate on individual tokens before they are joined together.

Example:

>>> from tokenizers.decoders import Fuse
>>> decoder = Fuse()
>>> decoder.decode(["Hello", ",", " ", "world", "!"])
'Hello, world!'

Metaspace

class tokenizers.decoders.Metaspace

( )

Parameters

  • replacement (str, optional, defaults to ) — The replacement character. Must be exactly one character. By default we use the (U+2581) meta symbol (Same as in SentencePiece).
  • prepend_scheme (str, optional, defaults to "always") — Whether to add a space to the first word if there isn’t already one. This lets us treat hello exactly like say hello. Choices: “always”, “never”, “first”. First means the space is only added on the first token (relevant when special tokens are used or other pre_tokenizer are used).

Metaspace Decoder

Example:

>>> from tokenizers.decoders import Metaspace
>>> decoder = Metaspace()
>>> decoder.decode(["▁Hello", "▁my", "▁friend"])
'Hello my friend'

Replace

class tokenizers.decoders.Replace

( pattern content )

Parameters

  • pattern (str or Regex) — The pattern that was used as the replacement target during encoding.
  • content (str) — The string to replace each match of the pattern with during decoding.

Replace Decoder

This decoder is to be used in tandem with the Replace normalizer or a similar replace operation. It reverses a string replacement by substituting the replacement content back with the original pattern.

Example:

>>> from tokenizers.decoders import Replace
>>> decoder = Replace("▁", " ")
>>> decoder.decode(["▁Hello", "▁world"])
' Hello world'

Sequence

class tokenizers.decoders.Sequence

( decoders )

Parameters

  • decoders (List[Decoder]) — The list of decoders to chain together.

Sequence Decoder

Chains multiple decoders together, applying them in order. Each decoder in the sequence processes the output of the previous one, allowing complex decoding pipelines to be built from simpler components.

Example:

>>> from tokenizers.decoders import ByteFallback, Fuse, Metaspace, Sequence
>>> decoder = Sequence([ByteFallback(), Fuse(), Metaspace()])
>>> decoder.decode(["▁Hello", "▁world"])
'Hello world'

Strip

class tokenizers.decoders.Strip

( content = ' ' left = 0 right = 0 )

Parameters

  • content (str, defaults to " ") — The character to strip from each token.
  • left (int, defaults to 0) — The number of occurrences of content to remove from the left side of each token.
  • right (int, defaults to 0) — The number of occurrences of content to remove from the right side of each token.

Strip Decoder

Strips a given number of occurrences of a character from the left and/or right side of each token. This is useful for removing padding characters or special prefix/suffix markers added during tokenization.

Example:

>>> from tokenizers.decoders import Strip
>>> decoder = Strip(content="▁", left=1)
>>> decoder.decode(["▁Hello", "▁world"])
'Hello world'

WordPiece

class tokenizers.decoders.WordPiece

( prefix = '##' cleanup = True )

Parameters

  • prefix (str, optional, defaults to ##) — The prefix to use for subwords that are not a beginning-of-word
  • cleanup (bool, optional, defaults to True) — Whether to cleanup some tokenization artifacts. Mainly spaces before punctuation, and some abbreviated english forms.

WordPiece Decoder

Example:

>>> from tokenizers.decoders import WordPiece
>>> decoder = WordPiece()
>>> decoder.decode(["Hello", ",", "##world", "!"])
'Hello, world!'
Update on GitHub