class transformers.NllbTokenizertransformers.NllbTokenizerhttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/nllb/tokenization_nllb.py#L38[{"name": "vocab_file", "val": ""}, {"name": "bos_token", "val": " = '~~'"}, {"name": "eos_token", "val": " = '~~'"}, {"name": "sep_token", "val": " = ''"}, {"name": "cls_token", "val": " = ''"}, {"name": "unk_token", "val": " = ''"}, {"name": "pad_token", "val": " = ''"}, {"name": "mask_token", "val": " = ''"}, {"name": "tokenizer_file", "val": " = None"}, {"name": "src_lang", "val": " = None"}, {"name": "tgt_lang", "val": " = None"}, {"name": "sp_model_kwargs", "val": ": typing.Optional[dict[str, typing.Any]] = None"}, {"name": "additional_special_tokens", "val": " = None"}, {"name": "legacy_behaviour", "val": " = False"}, {"name": "**kwargs", "val": ""}]- **vocab_file** (`str`) -- Path to the vocabulary file. - **bos_token** (`str`, *optional*, defaults to `""`) -- The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the `cls_token`. - **eos_token** (`str`, *optional*, defaults to `""`) -- The end of sequence token. When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`. - **sep_token** (`str`, *optional*, defaults to `""`) -- The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. - **cls_token** (`str`, *optional*, defaults to `""`) -- The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. - **unk_token** (`str`, *optional*, defaults to `""`) -- The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - **pad_token** (`str`, *optional*, defaults to `""`) -- The token used for padding, for example when batching sequences of different lengths. - **mask_token** (`str`, *optional*, defaults to `""`) -- The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. - **tokenizer_file** (`str`, *optional*) -- The path to a tokenizer file to use instead of the vocab file. - **src_lang** (`str`, *optional*) -- The language to use as source language for translation. - **tgt_lang** (`str`, *optional*) -- The language to use as target language for translation. - **sp_model_kwargs** (`dict[str, str]`) -- Additional keyword arguments to pass to the model initialization.0 Construct an NLLB tokenizer. Adapted from [RobertaTokenizer](/docs/transformers/v4.57.0/en/model_doc/roberta#transformers.RobertaTokenizer) and [XLNetTokenizer](/docs/transformers/v4.57.0/en/model_doc/xlnet#transformers.XLNetTokenizer). Based on [SentencePiece](https://github.com/google/sentencepiece). The tokenization method is ` ` for source language documents, and ` ` for target language documents. Examples: ```python >>> from transformers import NllbTokenizer >>> tokenizer = NllbTokenizer.from_pretrained( ... "facebook/nllb-200-distilled-600M", src_lang="eng_Latn", tgt_lang="fra_Latn" ... ) >>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria" >>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie." >>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt") ```
build_inputs_with_special_tokenstransformers.NllbTokenizer.build_inputs_with_special_tokenshttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/nllb/tokenization_nllb.py#L245[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]- **token_ids_0** (`list[int]`) -- List of IDs to which the special tokens will be added. - **token_ids_1** (`list[int]`, *optional*) -- Optional second list of IDs for sequence pairs.0`list[int]`List of [input IDs](../glossary#input-ids) with the appropriate special tokens. Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. An NLLB sequence has the following format, where `X` represents the sequence: - `input_ids` (for encoder) `X [eos, src_lang_code]` - `decoder_input_ids`: (for decoder) `X [eos, tgt_lang_code]` BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

class transformers.NllbTokenizerFasttransformers.NllbTokenizerFasthttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/nllb/tokenization_nllb_fast.py#L42[{"name": "vocab_file", "val": " = None"}, {"name": "tokenizer_file", "val": " = None"}, {"name": "bos_token", "val": " = '~~'"}, {"name": "eos_token", "val": " = '~~'"}, {"name": "sep_token", "val": " = ''"}, {"name": "cls_token", "val": " = ''"}, {"name": "unk_token", "val": " = ''"}, {"name": "pad_token", "val": " = ''"}, {"name": "mask_token", "val": " = ''"}, {"name": "src_lang", "val": " = None"}, {"name": "tgt_lang", "val": " = None"}, {"name": "additional_special_tokens", "val": " = None"}, {"name": "legacy_behaviour", "val": " = False"}, {"name": "**kwargs", "val": ""}]- **vocab_file** (`str`) -- Path to the vocabulary file. - **bos_token** (`str`, *optional*, defaults to `""`) -- The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the `cls_token`. - **eos_token** (`str`, *optional*, defaults to `""`) -- The end of sequence token. When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the `sep_token`. - **sep_token** (`str`, *optional*, defaults to `""`) -- The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens. - **cls_token** (`str`, *optional*, defaults to `""`) -- The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens. - **unk_token** (`str`, *optional*, defaults to `""`) -- The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - **pad_token** (`str`, *optional*, defaults to `""`) -- The token used for padding, for example when batching sequences of different lengths. - **mask_token** (`str`, *optional*, defaults to `""`) -- The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict. - **tokenizer_file** (`str`, *optional*) -- The path to a tokenizer file to use instead of the vocab file. - **src_lang** (`str`, *optional*) -- The language to use as source language for translation. - **tgt_lang** (`str`, *optional*) -- The language to use as target language for translation.0 Construct a "fast" NLLB tokenizer (backed by HuggingFace's *tokenizers* library). Based on [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models). This tokenizer inherits from [PreTrainedTokenizerFast](/docs/transformers/v4.57.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast) which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. The tokenization method is ` ` for source language documents, and ` ` for target language documents. Examples: ```python >>> from transformers import NllbTokenizerFast >>> tokenizer = NllbTokenizerFast.from_pretrained( ... "facebook/nllb-200-distilled-600M", src_lang="eng_Latn", tgt_lang="fra_Latn" ... ) >>> example_english_phrase = " UN Chief Says There Is No Military Solution in Syria" >>> expected_translation_french = "Le chef de l'ONU affirme qu'il n'y a pas de solution militaire en Syrie." >>> inputs = tokenizer(example_english_phrase, text_target=expected_translation_french, return_tensors="pt") ```
build_inputs_with_special_tokenstransformers.NllbTokenizerFast.build_inputs_with_special_tokenshttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/nllb/tokenization_nllb_fast.py#L178[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]- **token_ids_0** (`list[int]`) -- List of IDs to which the special tokens will be added. - **token_ids_1** (`list[int]`, *optional*) -- Optional second list of IDs for sequence pairs.0`list[int]`list of [input IDs](../glossary#input-ids) with the appropriate special tokens. Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. The special tokens depend on calling set_lang. An NLLB sequence has the following format, where `X` represents the sequence: - `input_ids` (for encoder) `X [eos, src_lang_code]` - `decoder_input_ids`: (for decoder) `X [eos, tgt_lang_code]` BOS is never used. Pairs of sequences are not the expected use case, but they will be handled without a separator.

create_token_type_ids_from_sequencestransformers.NllbTokenizerFast.create_token_type_ids_from_sequenceshttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/nllb/tokenization_nllb_fast.py#L207[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]- **token_ids_0** (`list[int]`) -- List of IDs. - **token_ids_1** (`list[int]`, *optional*) -- Optional second list of IDs for sequence pairs.0`list[int]`List of zeros. Create a mask from the two sequences passed to be used in a sequence-pair classification task. nllb does not make use of token type ids, therefore a list of zeros is returned.

set_src_lang_special_tokenstransformers.NllbTokenizerFast.set_src_lang_special_tokenshttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/nllb/tokenization_nllb_fast.py#L262[{"name": "src_lang", "val": ""}] Reset the special tokens to the source lang setting. - In legacy mode: No prefix and suffix=[eos, src_lang_code]. - In default mode: Prefix=[src_lang_code], suffix = [eos]

set_tgt_lang_special_tokenstransformers.NllbTokenizerFast.set_tgt_lang_special_tokenshttps://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/nllb/tokenization_nllb_fast.py#L285[{"name": "lang", "val": ": str"}] Reset the special tokens to the target lang setting. - In legacy mode: No prefix and suffix=[eos, tgt_lang_code]. - In default mode: Prefix=[tgt_lang_code], suffix = [eos]