# FastSpeech2Conformer

<div class="flex flex-wrap space-x-1">
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
</div>

## Overview

The FastSpeech2Conformer model was proposed with the paper [Recent Developments On Espnet Toolkit Boosted By Conformer](https://huggingface.co/papers/2010.13956) by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.

The abstract from the original FastSpeech2 paper is the following:

*Non-autoregressive text to speech (TTS) models such as FastSpeech (Ren et al., 2019) can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of fully end-to-end inference. Experimental results show that 1) FastSpeech 2 achieves a 3x training speed-up over FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech 2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even surpass autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/.*

This model was contributed by [Connor Henderson](https://huggingface.co/connor-henderson). The original code can be found [here](https://github.com/espnet/espnet/blob/master/espnet2/tts/fastspeech2/fastspeech2.py).

## 🤗 Model Architecture
FastSpeech2's general structure with a Mel-spectrogram decoder was implemented, and the traditional transformer blocks were replaced with conformer blocks as done in the ESPnet library.

#### FastSpeech2 Model Architecture
![FastSpeech2 Model Architecture](https://www.microsoft.com/en-us/research/uploads/prod/2021/04/fastspeech2-1.png)

#### Conformer Blocks
![Conformer Blocks](https://www.researchgate.net/profile/Hirofumi-Inaguma-2/publication/344911155/figure/fig2/AS:951455406108673@1603856054097/An-overview-of-Conformer-block.png)

#### Convolution Module
![Convolution Module](https://d3i71xaburhd42.cloudfront.net/8809d0732f6147d4ad9218c8f9b20227c837a746/2-Figure1-1.png)

## 🤗 Transformers Usage

You can run FastSpeech2Conformer locally with the 🤗 Transformers library.

1. First install the 🤗 [Transformers library](https://github.com/huggingface/transformers), g2p-en:

```bash
pip install --upgrade pip
pip install --upgrade transformers g2p-en
```

2. Run inference via the Transformers modelling code with the model and hifigan separately

```python

from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerModel, FastSpeech2ConformerHifiGan
import soundfile as sf

tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]

model = FastSpeech2ConformerModel.from_pretrained("espnet/fastspeech2_conformer")
output_dict = model(input_ids, return_dict=True)
spectrogram = output_dict["spectrogram"]

hifigan = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
waveform = hifigan(spectrogram)

sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
```

3. Run inference via the Transformers modelling code with the model and hifigan combined

```python
from transformers import FastSpeech2ConformerTokenizer, FastSpeech2ConformerWithHifiGan
import soundfile as sf

tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
inputs = tokenizer("Hello, my dog is cute.", return_tensors="pt")
input_ids = inputs["input_ids"]

model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan")
output_dict = model(input_ids, return_dict=True)
waveform = output_dict["waveform"]

sf.write("speech.wav", waveform.squeeze().detach().numpy(), samplerate=22050)
```

4. Run inference with a pipeline and specify which vocoder to use

```python
from transformers import pipeline, FastSpeech2ConformerHifiGan
import soundfile as sf

vocoder = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
synthesiser = pipeline(model="espnet/fastspeech2_conformer", vocoder=vocoder)

speech = synthesiser("Hello, my dog is cooler than you!")

sf.write("speech.wav", speech["audio"].squeeze(), samplerate=speech["sampling_rate"])
```

## FastSpeech2ConformerConfig[[transformers.FastSpeech2ConformerConfig]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.FastSpeech2ConformerConfig</name><anchor>transformers.FastSpeech2ConformerConfig</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/configuration_fastspeech2_conformer.py#L26</source><parameters>[{"name": "hidden_size", "val": " = 384"}, {"name": "vocab_size", "val": " = 78"}, {"name": "num_mel_bins", "val": " = 80"}, {"name": "encoder_num_attention_heads", "val": " = 2"}, {"name": "encoder_layers", "val": " = 4"}, {"name": "encoder_linear_units", "val": " = 1536"}, {"name": "decoder_layers", "val": " = 4"}, {"name": "decoder_num_attention_heads", "val": " = 2"}, {"name": "decoder_linear_units", "val": " = 1536"}, {"name": "speech_decoder_postnet_layers", "val": " = 5"}, {"name": "speech_decoder_postnet_units", "val": " = 256"}, {"name": "speech_decoder_postnet_kernel", "val": " = 5"}, {"name": "positionwise_conv_kernel_size", "val": " = 3"}, {"name": "encoder_normalize_before", "val": " = False"}, {"name": "decoder_normalize_before", "val": " = False"}, {"name": "encoder_concat_after", "val": " = False"}, {"name": "decoder_concat_after", "val": " = False"}, {"name": "reduction_factor", "val": " = 1"}, {"name": "speaking_speed", "val": " = 1.0"}, {"name": "use_macaron_style_in_conformer", "val": " = True"}, {"name": "use_cnn_in_conformer", "val": " = True"}, {"name": "encoder_kernel_size", "val": " = 7"}, {"name": "decoder_kernel_size", "val": " = 31"}, {"name": "duration_predictor_layers", "val": " = 2"}, {"name": "duration_predictor_channels", "val": " = 256"}, {"name": "duration_predictor_kernel_size", "val": " = 3"}, {"name": "energy_predictor_layers", "val": " = 2"}, {"name": "energy_predictor_channels", "val": " = 256"}, {"name": "energy_predictor_kernel_size", "val": " = 3"}, {"name": "energy_predictor_dropout", "val": " = 0.5"}, {"name": "energy_embed_kernel_size", "val": " = 1"}, {"name": "energy_embed_dropout", "val": " = 0.0"}, {"name": "stop_gradient_from_energy_predictor", "val": " = False"}, {"name": "pitch_predictor_layers", "val": " = 5"}, {"name": "pitch_predictor_channels", "val": " = 256"}, {"name": "pitch_predictor_kernel_size", "val": " = 5"}, {"name": "pitch_predictor_dropout", "val": " = 0.5"}, {"name": "pitch_embed_kernel_size", "val": " = 1"}, {"name": "pitch_embed_dropout", "val": " = 0.0"}, {"name": "stop_gradient_from_pitch_predictor", "val": " = True"}, {"name": "encoder_dropout_rate", "val": " = 0.2"}, {"name": "encoder_positional_dropout_rate", "val": " = 0.2"}, {"name": "encoder_attention_dropout_rate", "val": " = 0.2"}, {"name": "decoder_dropout_rate", "val": " = 0.2"}, {"name": "decoder_positional_dropout_rate", "val": " = 0.2"}, {"name": "decoder_attention_dropout_rate", "val": " = 0.2"}, {"name": "duration_predictor_dropout_rate", "val": " = 0.2"}, {"name": "speech_decoder_postnet_dropout", "val": " = 0.5"}, {"name": "max_source_positions", "val": " = 5000"}, {"name": "use_masking", "val": " = True"}, {"name": "use_weighted_masking", "val": " = False"}, {"name": "num_speakers", "val": " = None"}, {"name": "num_languages", "val": " = None"}, {"name": "speaker_embed_dim", "val": " = None"}, {"name": "is_encoder_decoder", "val": " = True"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **hidden_size** (`int`, *optional*, defaults to 384) --
  The dimensionality of the hidden layers.
- **vocab_size** (`int`, *optional*, defaults to 78) --
  The size of the vocabulary.
- **num_mel_bins** (`int`, *optional*, defaults to 80) --
  The number of mel filters used in the filter bank.
- **encoder_num_attention_heads** (`int`, *optional*, defaults to 2) --
  The number of attention heads in the encoder.
- **encoder_layers** (`int`, *optional*, defaults to 4) --
  The number of layers in the encoder.
- **encoder_linear_units** (`int`, *optional*, defaults to 1536) --
  The number of units in the linear layer of the encoder.
- **decoder_layers** (`int`, *optional*, defaults to 4) --
  The number of layers in the decoder.
- **decoder_num_attention_heads** (`int`, *optional*, defaults to 2) --
  The number of attention heads in the decoder.
- **decoder_linear_units** (`int`, *optional*, defaults to 1536) --
  The number of units in the linear layer of the decoder.
- **speech_decoder_postnet_layers** (`int`, *optional*, defaults to 5) --
  The number of layers in the post-net of the speech decoder.
- **speech_decoder_postnet_units** (`int`, *optional*, defaults to 256) --
  The number of units in the post-net layers of the speech decoder.
- **speech_decoder_postnet_kernel** (`int`, *optional*, defaults to 5) --
  The kernel size in the post-net of the speech decoder.
- **positionwise_conv_kernel_size** (`int`, *optional*, defaults to 3) --
  The size of the convolution kernel used in the position-wise layer.
- **encoder_normalize_before** (`bool`, *optional*, defaults to `False`) --
  Specifies whether to normalize before encoder layers.
- **decoder_normalize_before** (`bool`, *optional*, defaults to `False`) --
  Specifies whether to normalize before decoder layers.
- **encoder_concat_after** (`bool`, *optional*, defaults to `False`) --
  Specifies whether to concatenate after encoder layers.
- **decoder_concat_after** (`bool`, *optional*, defaults to `False`) --
  Specifies whether to concatenate after decoder layers.
- **reduction_factor** (`int`, *optional*, defaults to 1) --
  The factor by which the speech frame rate is reduced.
- **speaking_speed** (`float`, *optional*, defaults to 1.0) --
  The speed of the speech produced.
- **use_macaron_style_in_conformer** (`bool`, *optional*, defaults to `True`) --
  Specifies whether to use macaron style in the conformer.
- **use_cnn_in_conformer** (`bool`, *optional*, defaults to `True`) --
  Specifies whether to use convolutional neural networks in the conformer.
- **encoder_kernel_size** (`int`, *optional*, defaults to 7) --
  The kernel size used in the encoder.
- **decoder_kernel_size** (`int`, *optional*, defaults to 31) --
  The kernel size used in the decoder.
- **duration_predictor_layers** (`int`, *optional*, defaults to 2) --
  The number of layers in the duration predictor.
- **duration_predictor_channels** (`int`, *optional*, defaults to 256) --
  The number of channels in the duration predictor.
- **duration_predictor_kernel_size** (`int`, *optional*, defaults to 3) --
  The kernel size used in the duration predictor.
- **energy_predictor_layers** (`int`, *optional*, defaults to 2) --
  The number of layers in the energy predictor.
- **energy_predictor_channels** (`int`, *optional*, defaults to 256) --
  The number of channels in the energy predictor.
- **energy_predictor_kernel_size** (`int`, *optional*, defaults to 3) --
  The kernel size used in the energy predictor.
- **energy_predictor_dropout** (`float`, *optional*, defaults to 0.5) --
  The dropout rate in the energy predictor.
- **energy_embed_kernel_size** (`int`, *optional*, defaults to 1) --
  The kernel size used in the energy embed layer.
- **energy_embed_dropout** (`float`, *optional*, defaults to 0.0) --
  The dropout rate in the energy embed layer.
- **stop_gradient_from_energy_predictor** (`bool`, *optional*, defaults to `False`) --
  Specifies whether to stop gradients from the energy predictor.
- **pitch_predictor_layers** (`int`, *optional*, defaults to 5) --
  The number of layers in the pitch predictor.
- **pitch_predictor_channels** (`int`, *optional*, defaults to 256) --
  The number of channels in the pitch predictor.
- **pitch_predictor_kernel_size** (`int`, *optional*, defaults to 5) --
  The kernel size used in the pitch predictor.
- **pitch_predictor_dropout** (`float`, *optional*, defaults to 0.5) --
  The dropout rate in the pitch predictor.
- **pitch_embed_kernel_size** (`int`, *optional*, defaults to 1) --
  The kernel size used in the pitch embed layer.
- **pitch_embed_dropout** (`float`, *optional*, defaults to 0.0) --
  The dropout rate in the pitch embed layer.
- **stop_gradient_from_pitch_predictor** (`bool`, *optional*, defaults to `True`) --
  Specifies whether to stop gradients from the pitch predictor.
- **encoder_dropout_rate** (`float`, *optional*, defaults to 0.2) --
  The dropout rate in the encoder.
- **encoder_positional_dropout_rate** (`float`, *optional*, defaults to 0.2) --
  The positional dropout rate in the encoder.
- **encoder_attention_dropout_rate** (`float`, *optional*, defaults to 0.2) --
  The attention dropout rate in the encoder.
- **decoder_dropout_rate** (`float`, *optional*, defaults to 0.2) --
  The dropout rate in the decoder.
- **decoder_positional_dropout_rate** (`float`, *optional*, defaults to 0.2) --
  The positional dropout rate in the decoder.
- **decoder_attention_dropout_rate** (`float`, *optional*, defaults to 0.2) --
  The attention dropout rate in the decoder.
- **duration_predictor_dropout_rate** (`float`, *optional*, defaults to 0.2) --
  The dropout rate in the duration predictor.
- **speech_decoder_postnet_dropout** (`float`, *optional*, defaults to 0.5) --
  The dropout rate in the speech decoder postnet.
- **max_source_positions** (`int`, *optional*, defaults to 5000) --
  if `"relative"` position embeddings are used, defines the maximum source input positions.
- **use_masking** (`bool`, *optional*, defaults to `True`) --
  Specifies whether to use masking in the model.
- **use_weighted_masking** (`bool`, *optional*, defaults to `False`) --
  Specifies whether to use weighted masking in the model.
- **num_speakers** (`int`, *optional*) --
  Number of speakers. If set to > 1, assume that the speaker ids will be provided as the input and use
  speaker id embedding layer.
- **num_languages** (`int`, *optional*) --
  Number of languages. If set to > 1, assume that the language ids will be provided as the input and use the
  language id embedding layer.
- **speaker_embed_dim** (`int`, *optional*) --
  Speaker embedding dimension. If set to > 0, assume that speaker_embedding will be provided as the input.
- **is_encoder_decoder** (`bool`, *optional*, defaults to `True`) --
  Specifies whether the model is an encoder-decoder.</paramsdesc><paramgroups>0</paramgroups></docstring>

This is the configuration class to store the configuration of a [FastSpeech2ConformerModel](/docs/transformers/v4.57.0/en/model_doc/fastspeech2_conformer#transformers.FastSpeech2ConformerModel). It is used to
instantiate a FastSpeech2Conformer model according to the specified arguments, defining the model architecture.
Instantiating a configuration with the defaults will yield a similar configuration to that of the
FastSpeech2Conformer [espnet/fastspeech2_conformer](https://huggingface.co/espnet/fastspeech2_conformer)
architecture.

Configuration objects inherit from [PretrainedConfig](/docs/transformers/v4.57.0/en/main_classes/configuration#transformers.PretrainedConfig) and can be used to control the model outputs. Read the
documentation from [PretrainedConfig](/docs/transformers/v4.57.0/en/main_classes/configuration#transformers.PretrainedConfig) for more information.



<ExampleCodeBlock anchor="transformers.FastSpeech2ConformerConfig.example">

Example:

```python
>>> from transformers import FastSpeech2ConformerModel, FastSpeech2ConformerConfig

>>> # Initializing a FastSpeech2Conformer style configuration
>>> configuration = FastSpeech2ConformerConfig()

>>> # Initializing a model from the FastSpeech2Conformer style configuration
>>> model = FastSpeech2ConformerModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

</ExampleCodeBlock>

</div>

## FastSpeech2ConformerHifiGanConfig[[transformers.FastSpeech2ConformerHifiGanConfig]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.FastSpeech2ConformerHifiGanConfig</name><anchor>transformers.FastSpeech2ConformerHifiGanConfig</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/configuration_fastspeech2_conformer.py#L328</source><parameters>[{"name": "model_in_dim", "val": " = 80"}, {"name": "upsample_initial_channel", "val": " = 512"}, {"name": "upsample_rates", "val": " = [8, 8, 2, 2]"}, {"name": "upsample_kernel_sizes", "val": " = [16, 16, 4, 4]"}, {"name": "resblock_kernel_sizes", "val": " = [3, 7, 11]"}, {"name": "resblock_dilation_sizes", "val": " = [[1, 3, 5], [1, 3, 5], [1, 3, 5]]"}, {"name": "initializer_range", "val": " = 0.01"}, {"name": "leaky_relu_slope", "val": " = 0.1"}, {"name": "normalize_before", "val": " = True"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **model_in_dim** (`int`, *optional*, defaults to 80) --
  The number of frequency bins in the input log-mel spectrogram.
- **upsample_initial_channel** (`int`, *optional*, defaults to 512) --
  The number of input channels into the upsampling network.
- **upsample_rates** (`tuple[int]` or `list[int]`, *optional*, defaults to `[8, 8, 2, 2]`) --
  A tuple of integers defining the stride of each 1D convolutional layer in the upsampling network. The
  length of *upsample_rates* defines the number of convolutional layers and has to match the length of
  *upsample_kernel_sizes*.
- **upsample_kernel_sizes** (`tuple[int]` or `list[int]`, *optional*, defaults to `[16, 16, 4, 4]`) --
  A tuple of integers defining the kernel size of each 1D convolutional layer in the upsampling network. The
  length of *upsample_kernel_sizes* defines the number of convolutional layers and has to match the length of
  *upsample_rates*.
- **resblock_kernel_sizes** (`tuple[int]` or `list[int]`, *optional*, defaults to `[3, 7, 11]`) --
  A tuple of integers defining the kernel sizes of the 1D convolutional layers in the multi-receptive field
  fusion (MRF) module.
- **resblock_dilation_sizes** (`tuple[tuple[int]]` or `list[list[int]]`, *optional*, defaults to `[[1, 3, 5], [1, 3, 5], [1, 3, 5]]`) --
  A nested tuple of integers defining the dilation rates of the dilated 1D convolutional layers in the
  multi-receptive field fusion (MRF) module.
- **initializer_range** (`float`, *optional*, defaults to 0.01) --
  The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
- **leaky_relu_slope** (`float`, *optional*, defaults to 0.1) --
  The angle of the negative slope used by the leaky ReLU activation.
- **normalize_before** (`bool`, *optional*, defaults to `True`) --
  Whether or not to normalize the spectrogram before vocoding using the vocoder's learned mean and variance.</paramsdesc><paramgroups>0</paramgroups></docstring>

This is the configuration class to store the configuration of a `FastSpeech2ConformerHifiGanModel`. It is used to
instantiate a FastSpeech2Conformer HiFi-GAN vocoder model according to the specified arguments, defining the model
architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the
FastSpeech2Conformer
[espnet/fastspeech2_conformer_hifigan](https://huggingface.co/espnet/fastspeech2_conformer_hifigan) architecture.

Configuration objects inherit from [PretrainedConfig](/docs/transformers/v4.57.0/en/main_classes/configuration#transformers.PretrainedConfig) and can be used to control the model outputs. Read the
documentation from [PretrainedConfig](/docs/transformers/v4.57.0/en/main_classes/configuration#transformers.PretrainedConfig) for more information.



<ExampleCodeBlock anchor="transformers.FastSpeech2ConformerHifiGanConfig.example">

Example:

```python
>>> from transformers import FastSpeech2ConformerHifiGan, FastSpeech2ConformerHifiGanConfig

>>> # Initializing a FastSpeech2ConformerHifiGan configuration
>>> configuration = FastSpeech2ConformerHifiGanConfig()

>>> # Initializing a model (with random weights) from the configuration
>>> model = FastSpeech2ConformerHifiGan(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

</ExampleCodeBlock>

</div>

## FastSpeech2ConformerWithHifiGanConfig[[transformers.FastSpeech2ConformerWithHifiGanConfig]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.FastSpeech2ConformerWithHifiGanConfig</name><anchor>transformers.FastSpeech2ConformerWithHifiGanConfig</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/configuration_fastspeech2_conformer.py#L408</source><parameters>[{"name": "model_config", "val": ": typing.Optional[dict] = None"}, {"name": "vocoder_config", "val": ": typing.Optional[dict] = None"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **model_config** (`typing.Dict`, *optional*) --
  Configuration of the text-to-speech model.
- **vocoder_config** (`typing.Dict`, *optional*) --
  Configuration of the vocoder model.</paramsdesc><paramgroups>0</paramgroups></docstring>

This is the configuration class to store the configuration of a [FastSpeech2ConformerWithHifiGan](/docs/transformers/v4.57.0/en/model_doc/fastspeech2_conformer#transformers.FastSpeech2ConformerWithHifiGan). It is used to
instantiate a `FastSpeech2ConformerWithHifiGanModel` model according to the specified sub-models configurations,
defining the model architecture.

Instantiating a configuration with the defaults will yield a similar configuration to that of the
FastSpeech2ConformerModel [espnet/fastspeech2_conformer](https://huggingface.co/espnet/fastspeech2_conformer) and
FastSpeech2ConformerHifiGan
[espnet/fastspeech2_conformer_hifigan](https://huggingface.co/espnet/fastspeech2_conformer_hifigan) architectures.

Configuration objects inherit from [PretrainedConfig](/docs/transformers/v4.57.0/en/main_classes/configuration#transformers.PretrainedConfig) and can be used to control the model outputs. Read the
documentation from [PretrainedConfig](/docs/transformers/v4.57.0/en/main_classes/configuration#transformers.PretrainedConfig) for more information.



model_config ([FastSpeech2ConformerConfig](/docs/transformers/v4.57.0/en/model_doc/fastspeech2_conformer#transformers.FastSpeech2ConformerConfig), *optional*):
Configuration of the text-to-speech model.
vocoder_config (`FastSpeech2ConformerHiFiGanConfig`, *optional*):
Configuration of the vocoder model.

<ExampleCodeBlock anchor="transformers.FastSpeech2ConformerWithHifiGanConfig.example">

Example:

```python
>>> from transformers import (
...     FastSpeech2ConformerConfig,
...     FastSpeech2ConformerHifiGanConfig,
...     FastSpeech2ConformerWithHifiGanConfig,
...     FastSpeech2ConformerWithHifiGan,
... )

>>> # Initializing FastSpeech2ConformerWithHifiGan sub-modules configurations.
>>> model_config = FastSpeech2ConformerConfig()
>>> vocoder_config = FastSpeech2ConformerHifiGanConfig()

>>> # Initializing a FastSpeech2ConformerWithHifiGan module style configuration
>>> configuration = FastSpeech2ConformerWithHifiGanConfig(model_config.to_dict(), vocoder_config.to_dict())

>>> # Initializing a model (with random weights)
>>> model = FastSpeech2ConformerWithHifiGan(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

</ExampleCodeBlock>


</div>

## FastSpeech2ConformerTokenizer[[transformers.FastSpeech2ConformerTokenizer]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.FastSpeech2ConformerTokenizer</name><anchor>transformers.FastSpeech2ConformerTokenizer</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/tokenization_fastspeech2_conformer.py#L32</source><parameters>[{"name": "vocab_file", "val": ""}, {"name": "bos_token", "val": " = '<sos/eos>'"}, {"name": "eos_token", "val": " = '<sos/eos>'"}, {"name": "pad_token", "val": " = '<blank>'"}, {"name": "unk_token", "val": " = '<unk>'"}, {"name": "should_strip_spaces", "val": " = False"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **vocab_file** (`str`) --
  Path to the vocabulary file.
- **bos_token** (`str`, *optional*, defaults to `"<sos/eos>"`) --
  The begin of sequence token. Note that for FastSpeech2, it is the same as the `eos_token`.
- **eos_token** (`str`, *optional*, defaults to `"<sos/eos>"`) --
  The end of sequence token. Note that for FastSpeech2, it is the same as the `bos_token`.
- **pad_token** (`str`, *optional*, defaults to `"<blank>"`) --
  The token used for padding, for example when batching sequences of different lengths.
- **unk_token** (`str`, *optional*, defaults to `"<unk>"`) --
  The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
  token instead.
- **should_strip_spaces** (`bool`, *optional*, defaults to `False`) --
  Whether or not to strip the spaces from the list of tokens.</paramsdesc><paramgroups>0</paramgroups></docstring>

Construct a FastSpeech2Conformer tokenizer.





<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>__call__</name><anchor>transformers.FastSpeech2ConformerTokenizer.__call__</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/tokenization_utils_base.py#L2855</source><parameters>[{"name": "text", "val": ": typing.Union[str, list[str], list[list[str]], NoneType] = None"}, {"name": "text_pair", "val": ": typing.Union[str, list[str], list[list[str]], NoneType] = None"}, {"name": "text_target", "val": ": typing.Union[str, list[str], list[list[str]], NoneType] = None"}, {"name": "text_pair_target", "val": ": typing.Union[str, list[str], list[list[str]], NoneType] = None"}, {"name": "add_special_tokens", "val": ": bool = True"}, {"name": "padding", "val": ": typing.Union[bool, str, transformers.utils.generic.PaddingStrategy] = False"}, {"name": "truncation", "val": ": typing.Union[bool, str, transformers.tokenization_utils_base.TruncationStrategy, NoneType] = None"}, {"name": "max_length", "val": ": typing.Optional[int] = None"}, {"name": "stride", "val": ": int = 0"}, {"name": "is_split_into_words", "val": ": bool = False"}, {"name": "pad_to_multiple_of", "val": ": typing.Optional[int] = None"}, {"name": "padding_side", "val": ": typing.Optional[str] = None"}, {"name": "return_tensors", "val": ": typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None"}, {"name": "return_token_type_ids", "val": ": typing.Optional[bool] = None"}, {"name": "return_attention_mask", "val": ": typing.Optional[bool] = None"}, {"name": "return_overflowing_tokens", "val": ": bool = False"}, {"name": "return_special_tokens_mask", "val": ": bool = False"}, {"name": "return_offsets_mapping", "val": ": bool = False"}, {"name": "return_length", "val": ": bool = False"}, {"name": "verbose", "val": ": bool = True"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **text** (`str`, `list[str]`, `list[list[str]]`, *optional*) --
  The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
  (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
  `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
- **text_pair** (`str`, `list[str]`, `list[list[str]]`, *optional*) --
  The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
  (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
  `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
- **text_target** (`str`, `list[str]`, `list[list[str]]`, *optional*) --
  The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
  list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
  you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
- **text_pair_target** (`str`, `list[str]`, `list[list[str]]`, *optional*) --
  The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or a
  list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),
  you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).

- **add_special_tokens** (`bool`, *optional*, defaults to `True`) --
  Whether or not to add special tokens when encoding the sequences. This will use the underlying
  `PretrainedTokenizerBase.build_inputs_with_special_tokens` function, which defines which tokens are
  automatically added to the input ids. This is useful if you want to add `bos` or `eos` tokens
  automatically.
- **padding** (`bool`, `str` or [PaddingStrategy](/docs/transformers/v4.57.0/en/internal/file_utils#transformers.utils.PaddingStrategy), *optional*, defaults to `False`) --
  Activates and controls padding. Accepts the following values:

  - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
    sequence is provided).
  - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
    acceptable input length for the model if that argument is not provided.
  - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
    lengths).
- **truncation** (`bool`, `str` or [TruncationStrategy](/docs/transformers/v4.57.0/en/internal/tokenization_utils#transformers.tokenization_utils_base.TruncationStrategy), *optional*, defaults to `False`) --
  Activates and controls truncation. Accepts the following values:

  - `True` or `'longest_first'`: Truncate to a maximum length specified with the argument `max_length` or
    to the maximum acceptable input length for the model if that argument is not provided. This will
    truncate token by token, removing a token from the longest sequence in the pair if a pair of
    sequences (or a batch of pairs) is provided.
  - `'only_first'`: Truncate to a maximum length specified with the argument `max_length` or to the
    maximum acceptable input length for the model if that argument is not provided. This will only
    truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
  - `'only_second'`: Truncate to a maximum length specified with the argument `max_length` or to the
    maximum acceptable input length for the model if that argument is not provided. This will only
    truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
  - `False` or `'do_not_truncate'` (default): No truncation (i.e., can output batch with sequence lengths
    greater than the model maximum admissible input size).
- **max_length** (`int`, *optional*) --
  Controls the maximum length to use by one of the truncation/padding parameters.

  If left unset or set to `None`, this will use the predefined model maximum length if a maximum length
  is required by one of the truncation/padding parameters. If the model has no specific maximum input
  length (like XLNet) truncation/padding to a maximum length will be deactivated.
- **stride** (`int`, *optional*, defaults to 0) --
  If set to a number along with `max_length`, the overflowing tokens returned when
  `return_overflowing_tokens=True` will contain some tokens from the end of the truncated sequence
  returned to provide some overlap between truncated and overflowing sequences. The value of this
  argument defines the number of overlapping tokens.
- **is_split_into_words** (`bool`, *optional*, defaults to `False`) --
  Whether or not the input is already pre-tokenized (e.g., split into words). If set to `True`, the
  tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace)
  which it will tokenize. This is useful for NER or token classification.
- **pad_to_multiple_of** (`int`, *optional*) --
  If set will pad the sequence to a multiple of the provided value. Requires `padding` to be activated.
  This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
  `>= 7.5` (Volta).
- **padding_side** (`str`, *optional*) --
  The side on which the model should have padding applied. Should be selected between ['right', 'left'].
  Default value is picked from the class attribute of the same name.
- **return_tensors** (`str` or [TensorType](/docs/transformers/v4.57.0/en/internal/file_utils#transformers.TensorType), *optional*) --
  If set, will return tensors instead of list of python integers. Acceptable values are:

  - `'tf'`: Return TensorFlow `tf.constant` objects.
  - `'pt'`: Return PyTorch `torch.Tensor` objects.
  - `'np'`: Return Numpy `np.ndarray` objects.

- **return_token_type_ids** (`bool`, *optional*) --
  Whether to return token type IDs. If left to the default, will return the token type IDs according to
  the specific tokenizer's default, defined by the `return_outputs` attribute.

  [What are token type IDs?](../glossary#token-type-ids)
- **return_attention_mask** (`bool`, *optional*) --
  Whether to return the attention mask. If left to the default, will return the attention mask according
  to the specific tokenizer's default, defined by the `return_outputs` attribute.

  [What are attention masks?](../glossary#attention-mask)
- **return_overflowing_tokens** (`bool`, *optional*, defaults to `False`) --
  Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch
  of pairs) is provided with `truncation_strategy = longest_first` or `True`, an error is raised instead
  of returning overflowing tokens.
- **return_special_tokens_mask** (`bool`, *optional*, defaults to `False`) --
  Whether or not to return special tokens mask information.
- **return_offsets_mapping** (`bool`, *optional*, defaults to `False`) --
  Whether or not to return `(char_start, char_end)` for each token.

  This is only available on fast tokenizers inheriting from [PreTrainedTokenizerFast](/docs/transformers/v4.57.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast), if using
  Python's tokenizer, this method will raise `NotImplementedError`.
- **return_length**  (`bool`, *optional*, defaults to `False`) --
  Whether or not to return the lengths of the encoded inputs.
- **verbose** (`bool`, *optional*, defaults to `True`) --
  Whether or not to print more information and warnings.
- ****kwargs** -- passed to the `self.tokenize()` method</paramsdesc><paramgroups>0</paramgroups><rettype>[BatchEncoding](/docs/transformers/v4.57.0/en/main_classes/tokenizer#transformers.BatchEncoding)</rettype><retdesc>A [BatchEncoding](/docs/transformers/v4.57.0/en/main_classes/tokenizer#transformers.BatchEncoding) with the following fields:

- **input_ids** -- List of token ids to be fed to a model.

  [What are input IDs?](../glossary#input-ids)

- **token_type_ids** -- List of token type ids to be fed to a model (when `return_token_type_ids=True` or
  if *"token_type_ids"* is in `self.model_input_names`).

  [What are token type IDs?](../glossary#token-type-ids)

- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
  `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names`).

  [What are attention masks?](../glossary#attention-mask)

- **overflowing_tokens** -- List of overflowing tokens sequences (when a `max_length` is specified and
  `return_overflowing_tokens=True`).
- **num_truncated_tokens** -- Number of tokens truncated (when a `max_length` is specified and
  `return_overflowing_tokens=True`).
- **special_tokens_mask** -- List of 0s and 1s, with 1 specifying added special tokens and 0 specifying
  regular sequence tokens (when `add_special_tokens=True` and `return_special_tokens_mask=True`).
- **length** -- The length of the inputs (when `return_length=True`)</retdesc></docstring>

Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of
sequences.








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>save_vocabulary</name><anchor>transformers.FastSpeech2ConformerTokenizer.save_vocabulary</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/tokenization_fastspeech2_conformer.py#L146</source><parameters>[{"name": "save_directory", "val": ": str"}, {"name": "filename_prefix", "val": ": typing.Optional[str] = None"}]</parameters><paramsdesc>- **save_directory** (`str`) --
  The directory in which to save the vocabulary.</paramsdesc><paramgroups>0</paramgroups><rettype>`Tuple(str)`</rettype><retdesc>Paths to the files saved.</retdesc></docstring>

Save the vocabulary and special tokens file to a directory.








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>decode</name><anchor>transformers.FastSpeech2ConformerTokenizer.decode</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/tokenization_fastspeech2_conformer.py#L133</source><parameters>[{"name": "token_ids", "val": ""}, {"name": "**kwargs", "val": ""}]</parameters></docstring>


</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>batch_decode</name><anchor>transformers.FastSpeech2ConformerTokenizer.batch_decode</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/tokenization_utils_base.py#L3860</source><parameters>[{"name": "sequences", "val": ": typing.Union[list[int], list[list[int]], ForwardRef('np.ndarray'), ForwardRef('torch.Tensor'), ForwardRef('tf.Tensor')]"}, {"name": "skip_special_tokens", "val": ": bool = False"}, {"name": "clean_up_tokenization_spaces", "val": ": typing.Optional[bool] = None"}, {"name": "**kwargs", "val": ""}]</parameters><paramsdesc>- **sequences** (`Union[list[int], list[list[int]], np.ndarray, torch.Tensor, tf.Tensor]`) --
  List of tokenized input ids. Can be obtained using the `__call__` method.
- **skip_special_tokens** (`bool`, *optional*, defaults to `False`) --
  Whether or not to remove special tokens in the decoding.
- **clean_up_tokenization_spaces** (`bool`, *optional*) --
  Whether or not to clean up the tokenization spaces. If `None`, will default to
  `self.clean_up_tokenization_spaces`.
- **kwargs** (additional keyword arguments, *optional*) --
  Will be passed to the underlying model specific decode method.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[str]`</rettype><retdesc>The list of decoded sentences.</retdesc></docstring>

Convert a list of lists of token ids into a list of strings by calling decode.








</div></div>

## FastSpeech2ConformerModel[[transformers.FastSpeech2ConformerModel]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.FastSpeech2ConformerModel</name><anchor>transformers.FastSpeech2ConformerModel</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py#L1016</source><parameters>[{"name": "config", "val": ": FastSpeech2ConformerConfig"}]</parameters><paramsdesc>- **config** ([FastSpeech2ConformerConfig](/docs/transformers/v4.57.0/en/model_doc/fastspeech2_conformer#transformers.FastSpeech2ConformerConfig)) --
  Model configuration class with all the parameters of the model. Initializing with a config file does not
  load the weights associated with the model, only the configuration. Check out the
  [from_pretrained()](/docs/transformers/v4.57.0/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.</paramsdesc><paramgroups>0</paramgroups></docstring>

FastSpeech2Conformer Model.

This model inherits from [PreTrainedModel](/docs/transformers/v4.57.0/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.





<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>forward</name><anchor>transformers.FastSpeech2ConformerModel.forward</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py#L1093</source><parameters>[{"name": "input_ids", "val": ": LongTensor"}, {"name": "attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "spectrogram_labels", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "duration_labels", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "pitch_labels", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "energy_labels", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "speaker_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "lang_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "speaker_embedding", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "return_dict", "val": ": typing.Optional[bool] = None"}, {"name": "output_attentions", "val": ": typing.Optional[bool] = None"}, {"name": "output_hidden_states", "val": ": typing.Optional[bool] = None"}]</parameters><paramsdesc>- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`) --
  Input sequence of text vectors.
- **attention_mask** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) --
  Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

  - 1 for tokens that are **not masked**,
  - 0 for tokens that are **masked**.

  [What are attention masks?](../glossary#attention-mask)
- **spectrogram_labels** (`torch.FloatTensor` of shape `(batch_size, max_spectrogram_length, num_mel_bins)`, *optional*, defaults to `None`) --
  Batch of padded target features.
- **duration_labels** (`torch.LongTensor` of shape `(batch_size, sequence_length + 1)`, *optional*, defaults to `None`) --
  Batch of padded durations.
- **pitch_labels** (`torch.FloatTensor` of shape `(batch_size, sequence_length + 1, 1)`, *optional*, defaults to `None`) --
  Batch of padded token-averaged pitch.
- **energy_labels** (`torch.FloatTensor` of shape `(batch_size, sequence_length + 1, 1)`, *optional*, defaults to `None`) --
  Batch of padded token-averaged energy.
- **speaker_ids** (`torch.LongTensor` of shape `(batch_size, 1)`, *optional*, defaults to `None`) --
  Speaker ids used to condition features of speech output by the model.
- **lang_ids** (`torch.LongTensor` of shape `(batch_size, 1)`, *optional*, defaults to `None`) --
  Language ids used to condition features of speech output by the model.
- **speaker_embedding** (`torch.FloatTensor` of shape `(batch_size, embedding_dim)`, *optional*, defaults to `None`) --
  Embedding containing conditioning signals for the features of the speech.
- **return_dict** (`bool`, *optional*) --
  Whether or not to return a [ModelOutput](/docs/transformers/v4.57.0/en/main_classes/output#transformers.utils.ModelOutput) instead of a plain tuple.
- **output_attentions** (`bool`, *optional*) --
  Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
  tensors for more detail.
- **output_hidden_states** (`bool`, *optional*) --
  Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
  more detail.</paramsdesc><paramgroups>0</paramgroups><rettype>`transformers.models.fastspeech2_conformer.modeling_fastspeech2_conformer.FastSpeech2ConformerModelOutput` or `tuple(torch.FloatTensor)`</rettype><retdesc>A `transformers.models.fastspeech2_conformer.modeling_fastspeech2_conformer.FastSpeech2ConformerModelOutput` or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([FastSpeech2ConformerConfig](/docs/transformers/v4.57.0/en/model_doc/fastspeech2_conformer#transformers.FastSpeech2ConformerConfig)) and inputs.

- **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Spectrogram generation loss.
- **spectrogram** (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_bins)`, *optional*, defaults to `None`) -- The predicted spectrogram.
- **encoder_last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*, defaults to `None`) -- Sequence of hidden-states at the output of the last layer of the encoder of the model.
- **encoder_hidden_states** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

  Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
- **encoder_attentions** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
  sequence_length)`.

  Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
  self-attention heads.
- **decoder_hidden_states** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

  Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
- **decoder_attentions** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
  sequence_length)`.

  Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
  self-attention heads.
- **duration_outputs** (`torch.LongTensor` of shape `(batch_size, max_text_length + 1)`, *optional*) -- Outputs of the duration predictor.
- **pitch_outputs** (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*) -- Outputs of the pitch predictor.
- **energy_outputs** (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*) -- Outputs of the energy predictor.</retdesc></docstring>
The [FastSpeech2ConformerModel](/docs/transformers/v4.57.0/en/model_doc/fastspeech2_conformer#transformers.FastSpeech2ConformerModel) forward method, overrides the `__call__` special method.

<Tip>

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

</Tip>







<ExampleCodeBlock anchor="transformers.FastSpeech2ConformerModel.forward.example">

Example:

```python
>>> from transformers import (
...     FastSpeech2ConformerTokenizer,
...     FastSpeech2ConformerModel,
...     FastSpeech2ConformerHifiGan,
... )

>>> tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
>>> inputs = tokenizer("some text to convert to speech", return_tensors="pt")
>>> input_ids = inputs["input_ids"]

>>> model = FastSpeech2ConformerModel.from_pretrained("espnet/fastspeech2_conformer")
>>> output_dict = model(input_ids, return_dict=True)
>>> spectrogram = output_dict["spectrogram"]

>>> vocoder = FastSpeech2ConformerHifiGan.from_pretrained("espnet/fastspeech2_conformer_hifigan")
>>> waveform = vocoder(spectrogram)
>>> print(waveform.shape)
torch.Size([1, 49664])
```

</ExampleCodeBlock>


</div></div>

## FastSpeech2ConformerHifiGan[[transformers.FastSpeech2ConformerHifiGan]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.FastSpeech2ConformerHifiGan</name><anchor>transformers.FastSpeech2ConformerHifiGan</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py#L1354</source><parameters>[{"name": "config", "val": ": FastSpeech2ConformerHifiGanConfig"}]</parameters><paramsdesc>- **config** ([FastSpeech2ConformerHifiGanConfig](/docs/transformers/v4.57.0/en/model_doc/fastspeech2_conformer#transformers.FastSpeech2ConformerHifiGanConfig)) --
  Model configuration class with all the parameters of the model. Initializing with a config file does not
  load the weights associated with the model, only the configuration. Check out the
  [from_pretrained()](/docs/transformers/v4.57.0/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.</paramsdesc><paramgroups>0</paramgroups></docstring>

HiFi-GAN vocoder.

This model inherits from [PreTrainedModel](/docs/transformers/v4.57.0/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.





<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>forward</name><anchor>transformers.FastSpeech2ConformerHifiGan.forward</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py#L1423</source><parameters>[{"name": "spectrogram", "val": ": FloatTensor"}]</parameters><paramsdesc>- **spectrogram** (`torch.FloatTensor`) --
  Tensor containing the log-mel spectrograms. Can be batched and of shape `(batch_size, sequence_length,
  config.model_in_dim)`, or un-batched and of shape `(sequence_length, config.model_in_dim)`.</paramsdesc><paramgroups>0</paramgroups><rettype>`torch.FloatTensor`</rettype><retdesc>Tensor containing the speech waveform. If the input spectrogram is batched, will be of
shape `(batch_size, num_frames,)`. If un-batched, will be of shape `(num_frames,)`.</retdesc></docstring>

Converts a log-mel spectrogram into a speech waveform. Passing a batch of log-mel spectrograms returns a batch
of speech waveforms. Passing a single, un-batched log-mel spectrogram returns a single, un-batched speech
waveform.








</div></div>

## FastSpeech2ConformerWithHifiGan[[transformers.FastSpeech2ConformerWithHifiGan]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.FastSpeech2ConformerWithHifiGan</name><anchor>transformers.FastSpeech2ConformerWithHifiGan</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py#L1478</source><parameters>[{"name": "config", "val": ": FastSpeech2ConformerWithHifiGanConfig"}]</parameters><paramsdesc>- **config** ([FastSpeech2ConformerWithHifiGanConfig](/docs/transformers/v4.57.0/en/model_doc/fastspeech2_conformer#transformers.FastSpeech2ConformerWithHifiGanConfig)) --
  Model configuration class with all the parameters of the model. Initializing with a config file does not
  load the weights associated with the model, only the configuration. Check out the
  [from_pretrained()](/docs/transformers/v4.57.0/en/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.</paramsdesc><paramgroups>0</paramgroups></docstring>

The FastSpeech2ConformerModel with a FastSpeech2ConformerHifiGan vocoder head that performs text-to-speech (waveform).

This model inherits from [PreTrainedModel](/docs/transformers/v4.57.0/en/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.





<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>forward</name><anchor>transformers.FastSpeech2ConformerWithHifiGan.forward</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/fastspeech2_conformer/modeling_fastspeech2_conformer.py#L1489</source><parameters>[{"name": "input_ids", "val": ": LongTensor"}, {"name": "attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "spectrogram_labels", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "duration_labels", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "pitch_labels", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "energy_labels", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "speaker_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "lang_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "speaker_embedding", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "return_dict", "val": ": typing.Optional[bool] = None"}, {"name": "output_attentions", "val": ": typing.Optional[bool] = None"}, {"name": "output_hidden_states", "val": ": typing.Optional[bool] = None"}]</parameters><paramsdesc>- **input_ids** (`torch.LongTensor` of shape `(batch_size, sequence_length)`) --
  Input sequence of text vectors.
- **attention_mask** (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*) --
  Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

  - 1 for tokens that are **not masked**,
  - 0 for tokens that are **masked**.

  [What are attention masks?](../glossary#attention-mask)
- **spectrogram_labels** (`torch.FloatTensor` of shape `(batch_size, max_spectrogram_length, num_mel_bins)`, *optional*, defaults to `None`) --
  Batch of padded target features.
- **duration_labels** (`torch.LongTensor` of shape `(batch_size, sequence_length + 1)`, *optional*, defaults to `None`) --
  Batch of padded durations.
- **pitch_labels** (`torch.FloatTensor` of shape `(batch_size, sequence_length + 1, 1)`, *optional*, defaults to `None`) --
  Batch of padded token-averaged pitch.
- **energy_labels** (`torch.FloatTensor` of shape `(batch_size, sequence_length + 1, 1)`, *optional*, defaults to `None`) --
  Batch of padded token-averaged energy.
- **speaker_ids** (`torch.LongTensor` of shape `(batch_size, 1)`, *optional*, defaults to `None`) --
  Speaker ids used to condition features of speech output by the model.
- **lang_ids** (`torch.LongTensor` of shape `(batch_size, 1)`, *optional*, defaults to `None`) --
  Language ids used to condition features of speech output by the model.
- **speaker_embedding** (`torch.FloatTensor` of shape `(batch_size, embedding_dim)`, *optional*, defaults to `None`) --
  Embedding containing conditioning signals for the features of the speech.
- **return_dict** (`bool`, *optional*) --
  Whether or not to return a [ModelOutput](/docs/transformers/v4.57.0/en/main_classes/output#transformers.utils.ModelOutput) instead of a plain tuple.
- **output_attentions** (`bool`, *optional*) --
  Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
  tensors for more detail.
- **output_hidden_states** (`bool`, *optional*) --
  Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
  more detail.</paramsdesc><paramgroups>0</paramgroups><rettype>`transformers.models.fastspeech2_conformer.modeling_fastspeech2_conformer.FastSpeech2ConformerModelOutput` or `tuple(torch.FloatTensor)`</rettype><retdesc>A `transformers.models.fastspeech2_conformer.modeling_fastspeech2_conformer.FastSpeech2ConformerModelOutput` or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([FastSpeech2ConformerConfig](/docs/transformers/v4.57.0/en/model_doc/fastspeech2_conformer#transformers.FastSpeech2ConformerConfig)) and inputs.

- **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Spectrogram generation loss.
- **spectrogram** (`torch.FloatTensor` of shape `(batch_size, sequence_length, num_bins)`, *optional*, defaults to `None`) -- The predicted spectrogram.
- **encoder_last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*, defaults to `None`) -- Sequence of hidden-states at the output of the last layer of the encoder of the model.
- **encoder_hidden_states** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

  Hidden-states of the encoder at the output of each layer plus the initial embedding outputs.
- **encoder_attentions** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
  sequence_length)`.

  Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the
  self-attention heads.
- **decoder_hidden_states** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

  Hidden-states of the decoder at the output of each layer plus the initial embedding outputs.
- **decoder_attentions** (`tuple[torch.FloatTensor]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
  sequence_length)`.

  Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the
  self-attention heads.
- **duration_outputs** (`torch.LongTensor` of shape `(batch_size, max_text_length + 1)`, *optional*) -- Outputs of the duration predictor.
- **pitch_outputs** (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*) -- Outputs of the pitch predictor.
- **energy_outputs** (`torch.FloatTensor` of shape `(batch_size, max_text_length + 1, 1)`, *optional*) -- Outputs of the energy predictor.</retdesc></docstring>
The [FastSpeech2ConformerWithHifiGan](/docs/transformers/v4.57.0/en/model_doc/fastspeech2_conformer#transformers.FastSpeech2ConformerWithHifiGan) forward method, overrides the `__call__` special method.

<Tip>

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

</Tip>







<ExampleCodeBlock anchor="transformers.FastSpeech2ConformerWithHifiGan.forward.example">

Example:

```python
>>> from transformers import (
...     FastSpeech2ConformerTokenizer,
...     FastSpeech2ConformerWithHifiGan,
... )

>>> tokenizer = FastSpeech2ConformerTokenizer.from_pretrained("espnet/fastspeech2_conformer")
>>> inputs = tokenizer("some text to convert to speech", return_tensors="pt")
>>> input_ids = inputs["input_ids"]

>>> model = FastSpeech2ConformerWithHifiGan.from_pretrained("espnet/fastspeech2_conformer_with_hifigan")
>>> output_dict = model(input_ids, return_dict=True)
>>> waveform = output_dict["waveform"]
>>> print(waveform.shape)
torch.Size([1, 49664])
```

</ExampleCodeBlock>


</div></div>

<EditOnGithub source="https://github.com/huggingface/transformers/blob/main/docs/source/en/model_doc/fastspeech2_conformer.md" />