# TTS (Text To Speech) API - NvrTtsEnUs

The **Text-to-Speech** (TTS) API endpoint allows you to obtain speech synthesis from raw text.

## Introduction

**Text-to-Speech** (TTS) is a subfield of Artificial Intelligence (AI) that converts written text into spoken words. This TTS API operates as a two-stage pipeline, with a first model generating a mel spectrogram, then a second model using this mel spectrogram to generate speech. This speech synthesis system enables you to synthesize natural speech from raw transcriptions without any additional information.

AI Endpoints makes it easy, with ready-to-use inference APIs. Discover how to use them:

## Model concept and configuration

These TTS models were developed by NVIDIA. The TTS AI Endpoint takes text as input and returns audio stream or audio buffer, along with additional optional metadata.

**Model configuration:**
- **Transcription mode**: offline
- **Language Support**: en-US, es-ES, de-DE, it-IT (please choose the corresponding endpoint - e.g. *nvr-tts-en-us*)
- **Input type**: Raw text
- **Voice name**: This parameter specifies the voice to use for speech synthesis, allowing selection of speaker gender and emotional styles. Available options vary depending on the model language; some languages offer both male and female voices, while others may have only one gender available, and emotional variations (such as neutral, calm, or happy) are limited to certain languages. Voices are prefixed with the language code (e.g., English-US). The suffix "-1" (e.g., Female-1 or Male-1) indicates the base voice, which has natural characteristics like timbre and accent, without any specific emotional modification. The available voice names are as follows: Detailed breakdown of available voice names: English-US.Female-1, English-US.Male-1, English-US.Female-Calm, English-US.Female-Neutral, English-US.Female-Happy, English-US.Female-Angry, English-US.Female-Fearful, English-US.Female-Sad, English-US.Male-Calm, English-US.Male-Neutral, English-US.Male-Happy, English-US.Male-Angry, Spanish-ES-Female-1, Spanish-ES-Male-1, German-DE-Male-1, Italian-IT-Female-1, Italian-IT-Male-1.
- **Sample Rate**: usually 22 000 Hz or 44 000 Hz

## How to?

The **TTS** endpoint offers you a wide range of transcription options. Learn how to use them with the following example:

### With a simple HTTP client (requests)

First install the *requests* library:

```bash
pip install requests
```

Next, export your access token to the *OVH_AI_ENDPOINTS_ACCESS_TOKEN* environment variable:

```bash
export OVH_AI_ENDPOINTS_ACCESS_TOKEN=<your-access-token>
```

*If you do not have an access token key yet, follow the instructions in the [AI Endpoints – Getting Started](https://help.ovhcloud.com/csm/en-gb-public-cloud-ai-endpoints-getting-started?id=kb_article_view&sysparm_article=KB0065401).*

Finally, run the following Python code:

```python
import requests

url = "https://nvr-tts-en-us.endpoints.kepler.ai.cloud.ovh.net/api/v1/tts/text_to_audio"

headers = {
    "accept": "application/octet-stream",
    "Content-Type": "application/json",
    "Authorization": f"Bearer {os.getenv('OVH_AI_ENDPOINTS_ACCESS_TOKEN')}",
}

data = {
    "encoding": 1,
    "language_code": "en-US",
    "sample_rate_hz": 16000,
    "text": "We provide a set of managed tools designed for building your Machine Learning projects: AI Notebooks, AI Training, AI Deploy and AI Endpoints.",
    "voice_name": "English-US.Female-1"
}

response = requests.post(url, headers=headers, json=data)

if response.status_code == 200:
    # Save the audio content to a file
    with open("output_audio.wav", "wb") as audio_file:
        audio_file.write(response.content)
    print("Audio file saved as output_audio.wav")
else:
    print("Error:", response.status_code, response.text)
```

Returning the following result:

```
Audio file saved as output_audio.wav
```

You are now able to play and use your generated audio file.

### With the gRPC RIVA client

Install RIVA client and audio libraries:

```python
pip install nvidia-riva-client numpy
```

This use case deals with a basic example that returns the audio speech generated by the model:

```python
import numpy as np
import IPython.display as ipd
import riva.client

# connect with riva tts server
tts_service = riva.client.SpeechSynthesisService(
                riva.client.Auth(
                    uri="nvr-tts-en-us.endpoints-grpc.kepler.ai.cloud.ovh.net:443",
                    use_ssl=True,
                )
            )

# set up config
sample_rate_hz = 44100
req = {
        "language_code"  : "en-US",                                 # choose the corresponding language in the list: en-US / es-ES / de-DE / it-IT
        "encoding"       : riva.client.AudioEncoding.LINEAR_PCM ,
        "sample_rate_hz" : sample_rate_hz,                          # sample rate: 44.1KHz audio
        "voice_name"     : "English-US.Female-1"                    # voices: `English-US.Female-1`, `English-US.Male-1`, 
                                                                    #         `English-US.Female-Calm`, `English-US.Female-Neutral`,
                                                                    #         `English-US.Female-Happy`, `English-US.Female-Angry`,
                                                                    #         `English-US.Female-Fearful`, `English-US.Female-Sad`,
                                                                    #         `English-US.Male-Calm`, `English-US.Male-Neutral`,
                                                                    #         `English-US.Male-Happy`, `English-US.Male-Angry`, 
                                                                    #         `Spanish-ES-Female-1`, `Spanish-ES-Male-1`,
                                                                    #         `German-DE-Male-1`, `Italian-IT-Female-1`, 
                                                                    #         `Italian-IT-Male-1`
}

# input text
req["text"] = "We provide a set of managed tools designed for building your Machine Learning projects: AI Notebooks, AI Training, AI Deploy and AI Endpoints."

# return response
response = tts_service.synthesize(**req)
audio_samples = np.frombuffer(response.audio, dtype=np.int16)

# play output audio
ipd.Audio(audio_samples, rate=sample_rate_hz)
```

## Model rate limit

When using AI Endpoints, the **following rate limits apply**:

- **Anonymous**: 2 requests per minute, per IP and per model.
- **Authenticated with an API access key**: 400 requests per minute, per Public Cloud project and per model.

If you exceed this limit, a **429 error code** will be returned.

If you require higher usage, please **[get in touch with us](https://help.ovhcloud.com/csm?id=csm_get_help)** to discuss increasing your rate limits.

## References

For more information about the TTS model features, please refer to RIVA TTS [documentation](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-overview.html).

## Going Further

For a broader overview of AI Endpoints, explore the full [AI Endpoints Documentation](https://help.ovhcloud.com/csm/en-gb-documentation-public-cloud-ai-and-machine-learning-ai-endpoints?id=kb_browse_cat&kb_id=574a8325551974502d4c6e78b7421938&kb_category=ea1d6daa918a1a541e11d3d71f8624aa).

Reach out to our support team or join the [OVHcloud Discord](https://discord.gg/ovhcloud) #ai-endpoints channel to share your questions, feedback, and suggestions for improving the service, to the team and the community.