BeigeTTS: Research Release for Neural Speech Synthesis

Overview

BeigeTTS is a research release from BlandAI, representing a scaled-down version of our production Khaki TTS system. This model demonstrates state-of-the-art neural speech synthesis capabilities by combining Google's Gemma-3 4B architecture with NeuCodec audio token generation. We're releasing BeigeTTS to the research community to advance the field of neural speech synthesis and enable academic exploration of large-scale TTS architectures.

Research Context & Motivation

BeigeTTS serves as a public research artifact derived from our larger Khaki TTS system, which powers BlandAI's production speech synthesis infrastructure. While Khaki operates at significantly larger scale with enhanced capabilities including:

  • Multi-speaker voice cloning (10,000+ voices)
  • Real-time multilingual synthesis (57 languages)
  • Emotion and prosody transfer
  • Sub-50ms streaming latency
  • Production-grade robustness

BeigeTTS represents the core architectural innovations in a more accessible 4B parameter model suitable for research purposes.

Technical Architecture

Model Foundation

  • Base Model: Google Gemma-3 4B Instruct
  • Parameter Count: ~4 billion parameters (Khaki uses 70B+)
  • Audio Codec: NeuCodec (24kHz, single codebook)
  • Training Steps: 1,435,000 steps
  • Context Length: 2048 tokens
  • Vocabulary Size: Extended to 327,690 tokens (includes NeuCodec token space)

Research Implications

This release enables researchers to explore:

  1. Unified Text-Audio Modeling: How large language models can be adapted for audio generation tasks
  2. Token-Based Audio Synthesis: Advantages of discrete token representations over continuous methods
  3. Efficient Streaming: Real-time generation with minimal latency
  4. Cross-Modal Learning: Transfer learning between text and audio modalities

Token Space Design

The model employs a unified token space combining text and audio:

Standard Gemma Tokens: 0-262,144
Special Audio Markers:
  - AUDIO_START: 262,145
  - AUDIO_END: 262,146
NeuCodec Audio Tokens: 262,154-327,689 (65,536 tokens)

Capabilities & Limitations

Current Capabilities (BeigeTTS)

  • High-quality English speech synthesis
  • Natural prosody and intonation
  • Streaming generation support
  • Adjustable speaking rate and style
  • Context-aware generation

Production Capabilities (Khaki - Not Released)

  • Multilingual: 57 languages with accent control
  • Voice Cloning: Zero-shot and few-shot speaker adaptation
  • Emotion Control: 12 distinct emotional states
  • Ultra-Low Latency: <50ms time-to-first-audio
  • Long-Form: Stable generation for 30+ minute audio
  • Voice Conversion: Real-time voice transformation
  • Singing Synthesis: Musical vocal generation

Research Limitations

BeigeTTS is released for non-commercial research purposes only. Key limitations include:

  • English-only synthesis (multilingual reserved for Khaki)
  • Single speaker (multi-speaker in Khaki)
  • 10-second maximum generation (unlimited in Khaki)
  • No voice cloning (available in Khaki)
  • Research license only

Installation

pip install torch transformers accelerate
pip install git+https://github.com/neuphonic/neucodec.git
pip install soundfile numpy scipy

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from neucodec import NeuCodec
import soundfile as sf

# Load model
model = AutoModelForCausalLM.from_pretrained("BlandAI/BeigeTTS")
tokenizer = AutoTokenizer.from_pretrained("BlandAI/BeigeTTS")
neucodec = NeuCodec.from_pretrained("neuphonic/neucodec")

# Generate speech
text = "Hello! This is BeigeTTS, a research release from BlandAI."
prompt = f"<start_of_turn>user\n{text}<end_of_turn>\n<start_of_turn>model\n<start_of_speech>"

# Tokenize and generate
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        inputs.input_ids,
        max_new_tokens=500,
        temperature=0.1,
        top_p=0.97,
        eos_token_id=[tokenizer.eos_token_id, 262146]
    )

# Decode audio (see inference script for full implementation)

Research Applications

Suggested Research Directions

  1. Prosody Modeling: Investigating controllable prosody generation
  2. Cross-Lingual Transfer: Adapting to new languages with minimal data
  3. Emotion Synthesis: Fine-tuning for emotional speech generation
  4. Compression Studies: Analyzing audio token efficiency
  5. Streaming Optimization: Reducing latency for real-time applications
  6. Robustness Analysis: Handling out-of-distribution text inputs

Academic Collaborations

We welcome academic collaborations. For research partnerships or access to evaluation datasets, contact research@bland.ai

Performance Characteristics

  • Inference Speed: ~150 tokens/second on A100
  • Audio Quality: 24kHz (Khaki supports 48kHz)
  • Latency: <500ms first audio (Khaki: <50ms)
  • Memory Usage: ~16GB VRAM

Multilingual Research Notes

While BeigeTTS is English-only, the architecture supports multilingual synthesis through:

  • Language-specific token embeddings
  • Cross-lingual phoneme mapping
  • Accent and dialect modeling
  • Code-switching capabilities

The full Khaki system demonstrates these capabilities across 57 languages with accent preservation and cross-lingual voice transfer. Researchers interested in multilingual TTS can use BeigeTTS as a foundation for exploring these directions.

Ethical Considerations & License

Non-Commercial Use Only

BeigeTTS is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). This means:

  • โœ… Research and academic use
  • โœ… Personal experimentation
  • โœ… Open-source contributions
  • โŒ Commercial applications
  • โŒ Production deployment
  • โŒ Monetized services

For commercial licensing of our full Khaki system, contact partnerships@bland.ai

Responsible AI Guidelines

  • Always disclose AI-generated content
  • Do not use for impersonation without consent
  • Respect privacy and intellectual property
  • Consider potential biases in synthesis
  • Implement appropriate safety measures

Citation

If you use BeigeTTS in your research, please cite:

@misc{blandai2024beigetss,
  title={BeigeTTS: A Research Release for Large-Scale Neural Speech Synthesis},
  author={BlandAI Research Team},
  year={2024},
  publisher={HuggingFace},
  note={Scaled research version of the Khaki TTS system}
}

Related Work

BeigeTTS builds upon:

  • Gemma (Google, 2024)
  • NeuCodec (Neuphonic, 2024)
  • Our production Khaki TTS system (not publicly available)

Future Research Releases

We plan to release additional research artifacts:

  • TaupeVC: Voice conversion research model
  • EcruTTS: Lightweight edge deployment model
  • SandAlign: Forced alignment for TTS training

Support & Community

Acknowledgments

We thank the open-source community and our research partners. Special recognition to:

  • Google for the Gemma foundation model
  • Neuphonic for NeuCodec
  • The broader TTS research community

Disclaimer

BeigeTTS is a research release with no warranties. The full production capabilities described for Khaki are not available in this release. For production-grade TTS, please contact BlandAI for commercial licensing options.


BeigeTTS is a research artifact from BlandAI's speech synthesis team. For production applications, explore our commercial Khaki TTS API at bland.ai

Downloads last month
4
Safetensors
Model size
4B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support