Speech Generation API

Learn how to integrate Demeterics into your workflows with step-by-step guides and API examples.

Speech Generation API

Beta Access Required: The Speech API requires whitelisted access.

To request access, email sales@demeterics.com with:

  • Subject: "Feature Access Request"
  • Feature name: "Text-to-Speech (TTS)"

For multi-speaker podcast generation, also request: "TTS Multi-Speaker"

The Demeterics Speech API provides a unified Text-to-Speech (TTS) interface across multiple providers. Convert text to natural-sounding audio with a single API while automatically tracking usage, costs, and storing generated audio for analysis.

Overview

Base URL: https://api.demeterics.com/tts/v1

Features:

  • Unified API: Single endpoint for OpenAI, ElevenLabs, Google Cloud TTS, Murf.ai, Groq Orpheus, and Google Gemini
  • Multi-Speaker: Generate podcasts and dialogues either as a single Gemini call (up to 2 speakers, native prosody) or via the dialog meta-providers (gemini-dialog, openai-dialog, elevenlabs-dialog, murf-dialog) for unlimited speakers and large dialogues
  • Auto-tracking: Every request logged to BigQuery with full observability
  • Audio Storage: Generated audio stored in GCS with 15-minute signed URLs
  • BYOK Support: Use your own provider API keys with dual-key authentication
  • Cost Control: Automatic credit billing with 15% managed or 10% BYOK fee

Authentication

Managed Keys (Default)

Use only your Demeterics API key:

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{...}'

Bring Your Own Key (BYOK)

Use the dual-key format to provide your own provider API key:

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key;sk-your_openai_key" \
  -H "Content-Type: application/json" \
  -d '{...}'

The format is: [demeterics_api_key];[provider_api_key]

BYOK Benefits:

  • 10% service fee instead of 15%
  • Use your own rate limits and quotas
  • Provider costs billed directly to your account

Endpoints

Generate Speech

POST /tts/v1/generate

Convert text to speech audio.

Request Body:

Field Type Required Description
provider string Yes Target provider: openai, elevenlabs, google, murf, groq, gemini, openai-dialog, elevenlabs-dialog, murf-dialog, gemini-dialog
model string No TTS model (provider-specific)
voice string No Voice identifier (single speaker)
input string Yes Text to convert (max varies by provider)
format string No Output format: mp3, wav, opus, flac
speed float No Playback speed: 0.25-4.0 (default: 1.0)
language string No Language code (ISO 639-1)
speakers array No Multi-speaker config — required for gemini (max 2 speakers) and the *-dialog providers (unlimited speakers)
temperature float No Sampling temperature, 0.0–2.0. Honored by gemini / gemini-dialog for prosody control; silently ignored by other providers (set it freely in a fallback-chain request builder — non-Gemini providers won't reject it).

Example Request:

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "model": "tts-1",
    "voice": "alloy",
    "input": "Hello, welcome to Demeterics!",
    "format": "mp3"
  }'

Response:

{
  "id": "01JARV4HZ6XPQMWVCS9N1GKEFD",
  "provider": "openai",
  "model": "tts-1",
  "voice": "alloy",
  "audio_url": "https://storage.googleapis.com/demeterics-data/tts/...",
  "duration_seconds": 2.3,
  "cost_usd": 0.00023,
  "usage": {
    "input_chars": 31
  },
  "metadata": {
    "format": "mp3",
    "sample_rate": 24000,
    "channels": 1,
    "generation_ms": 450
  }
}

List Voices

GET /tts/v1/voices?provider={provider}

List available voices for a provider.

Query Parameters:

Parameter Type Required Description
provider string Yes Provider: openai, elevenlabs, google, murf

Example Request:

curl -X GET "https://api.demeterics.com/tts/v1/voices?provider=openai" \
  -H "Authorization: Bearer dmt_your_api_key"

Response:

{
  "voices": [
    {
      "id": "alloy",
      "name": "Alloy",
      "description": "Neutral and balanced",
      "gender": "neutral"
    },
    {
      "id": "echo",
      "name": "Echo",
      "description": "Clear and articulate",
      "gender": "male"
    }
  ]
}

Providers

OpenAI

Models:

  • gpt-4o-mini-tts - Latest model with better steerability. Dual-priced: $0.60/1M input tokens + $12/1M audio output tokens — effective $0.015/min of audio ($20 per 1M chars). Substantially cheaper than ElevenLabs.
  • tts-1 - Fast and efficient (legacy)
  • tts-1-hd - Higher quality (legacy)

Voices:

  • alloy - Neutral and balanced
  • ash - Warm and conversational
  • ballad - Soft and melodic
  • coral - Friendly and approachable
  • echo - Clear and articulate
  • fable - Expressive and dynamic
  • onyx - Deep and authoritative
  • nova - Friendly and warm
  • sage - Calm and measured
  • shimmer - Bright and optimistic
  • verse - Dynamic and engaging

Supported Formats: mp3, opus, aac, flac, wav, pcm

Max Characters: 4,096

ElevenLabs

Models:

  • eleven_v3 - Most expressive model — human-like speech with high emotional range, 70+ languages, supports vocal directions (recommended for high-quality content)
  • eleven_multilingual_v2 - Premium quality, 29 languages, 10K char limit
  • eleven_turbo_v2_5 - High quality + speed (~250-300ms), 32 languages, 40K char limit
  • eleven_turbo_v2 - Fast, English only
  • eleven_flash_v2_5 - Ultra-fast (~75ms), 32 languages, 50% lower cost — great for drafts and real-time
  • eleven_monolingual_v1 - Deprecated February 28, 2026 — migrate to eleven_v3

Vocal Directions (eleven_v3):

ElevenLabs v3 supports inline audio tags to direct the performance style:

[cheerful] Welcome to our channel!
[whisper] But here's a secret...
[dramatic] Everything is about to change.
[sarcastic] Oh sure, that went exactly as planned.

Available directions include: [cheerful], [whisper], [dramatic], [sarcastic], [excited], [friendly], [warm], [professionally], [authoritatively], [breathy], and more.

Voices: Over 100 pre-made voices plus custom voice cloning

Supported Formats: mp3, pcm, ulaw

Max Characters: 5,000 (eleven_v3), 10,000 (multilingual_v2), 40,000 (turbo/flash v2.5)

Google Cloud TTS

Models:

  • standard - Basic quality
  • neural2 - Neural network based
  • wavenet - High quality WaveNet
  • journey - Conversational style
  • studio - Professional quality

Voices: 220+ voices across 40+ languages

Supported Formats: mp3, wav, ogg

Max Characters: 5,000

Murf.ai

Models:

  • GEN2 - Latest generation, highest quality ($0.03/1000 chars)
  • FALCON - Fast streaming model ($0.01/1000 chars) ← Recommended for Voice-to-Voice

Voices: 120+ voices across 20+ languages including:

  • en-US-natalie - Natalie (US English, female) — clear, professional
  • en-US-samantha - Samantha (US English, female) — warm, conversational
  • en-US-terrell - Terrell (US English, male) — deep, authoritative
  • en-US-wayne - Wayne (US English, male) — friendly, casual
  • en-UK-hazel - Hazel (UK English, female) — British accent
  • en-UK-ruby - Ruby (UK English, female) — British, professional
  • en-UK-maisie - Maisie (UK English, female) — British, youthful
  • en-AU-lincoln - Lincoln (Australian, male) — Australian accent

Supported Formats: mp3, wav, flac, ogg, pcm, alaw, ulaw

Max Characters: 10,000

Features:

  • Voice styles (conversational, newscast, etc.)
  • Speed and pitch control
  • Multi-language support with native locales
  • Streaming support via /v1/speech/stream endpoint

Murf Falcon Streaming (Widget Integration)

The FALCON model supports real-time audio streaming, used internally by the AI Chat Widget's Voice-to-Voice feature.

Note: Murf Falcon streaming is not exposed as a standalone Demeterics API endpoint. It's used automatically when Voice-to-Voice is enabled on your AI Chat Widget. For direct TTS generation, use POST /tts/v1/generate with provider: "murf" and model: "FALCON".

How Voice-to-Voice Works:

When Voice-to-Voice is enabled, the widget uses a two-phase streaming architecture:

  1. Phase 1POST /api/widget/voice

    • Uploads user audio recording
    • Returns: transcript, response text, and stream_token
    • Text is displayed in the widget immediately
  2. Phase 2GET /api/widget/voice/stream?token=X

    • Server-Sent Events (SSE) stream audio chunks
    • Web Audio API plays chunks as they arrive
    • ~130ms time-to-first-audio (TTFA)

Additional streaming endpoints (internal use):

  • GET /api/widget/voice/stream/mp3 — MP3 format stream
  • GET /api/widget/voice/stream/raw — Raw audio stream
  • WS /api/widget/voice/ws — WebSocket streaming
  • WS /api/widget/voice/live — Full-duplex WebSocket

Performance:

  • ~130ms time-to-first-audio
  • WAV format at 24kHz mono
  • Optimized for low-latency conversational AI

Cost: $0.01 per 1,000 characters (billed when stream is consumed)

Google Gemini TTS

Beta Access: Gemini TTS with multi-speaker support is available to whitelisted users. Contact support to request access.

Models:

  • gemini-3.1-flash-tts-preview - Best quality, 70+ languages, audio tags for vocal style control (default)
  • gemini-2.5-flash-preview-tts - Older, cheaper alternative
  • gemini-2.5-pro-preview-tts - Higher quality (Pro tier)

Voices (30 prebuilt voices):

  • Puck - Upbeat
  • Kore - Firm
  • Charon - Informative
  • Zephyr - Bright
  • Fenrir - Excitable
  • Leda - Youthful
  • Aoede - Breezy
  • Sulafat - Warm
  • Achird - Friendly
  • And 21 more...

Supported Formats: wav

Max Characters: 8,000

Features:

  • Multi-speaker support: Up to 2 speakers with different voices
  • 30 prebuilt voice options
  • Ideal for podcasts, dialogues, and conversational content

Multi-Speaker Mode (Podcasts & Dialogues)

Generate conversational audio with up to 2 distinct speakers, each with their own voice. Perfect for:

  • Podcasts with host and guest
  • Dialogues between characters
  • Interview-style content
  • Educational back-and-forth explanations

Request Body (Multi-Speaker):

Field Type Required Description
provider string Yes Must be gemini
model string No gemini-3.1-flash-tts-preview (default)
input string Yes Dialogue with speaker labels
speakers array Yes Speaker-to-voice mapping (max 2)
format string No Output format (default: wav)

Speaker Configuration:

Each speaker object has:

Field Type Required Description
name string Yes Speaker label (must match input text)
voice string Yes Voice ID (e.g., Puck, Kore)

Example: Podcast Generation

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "gemini",
    "model": "gemini-3.1-flash-tts-preview",
    "input": "Host: Welcome to the AI Insights podcast! Today we explore the future of voice AI.\nGuest: Thanks for having me! Voice technology is transforming how we interact with machines.",
    "speakers": [
      {"name": "Host", "voice": "Puck"},
      {"name": "Guest", "voice": "Kore"}
    ],
    "format": "wav"
  }'

Response:

{
  "id": "tts_01JARV4HZ6XPQMWVCS9N1GKEFD",
  "provider": "gemini",
  "model": "gemini-3.1-flash-tts-preview",
  "audio_url": "https://storage.googleapis.com/demeterics-data/tts/...",
  "duration_seconds": 8.5,
  "cost_usd": 0.00125,
  "usage": {
    "input_chars": 156
  }
}

Python Example:

import requests

response = requests.post(
    "https://api.demeterics.com/tts/v1/generate",
    headers={"Authorization": "Bearer dmt_your_api_key"},
    json={
        "provider": "gemini",
        "input": """Host: What's the biggest challenge in AI today?
Guest: I'd say it's making AI accessible to everyone, not just tech companies.""",
        "speakers": [
            {"name": "Host", "voice": "Puck"},
            {"name": "Guest", "voice": "Kore"}
        ]
    }
)

audio_url = response.json()["audio_url"]
print(f"Podcast audio: {audio_url}")

Best Practices for Multi-Speaker:

  1. Consistent labels: Use the same speaker names throughout (e.g., Host: not Announcer:)
  2. Clear formatting: Start each line with Speaker: followed by their dialogue
  3. Voice pairing: Choose voices with distinct characteristics (e.g., upbeat + firm)
  4. Keep turns short: Shorter dialogue turns sound more natural
  5. Max 2 speakers: Gemini currently supports up to 2 distinct speakers

Dialog Providers (Gemini / OpenAI / ElevenLabs / Murf)

Beta Access: Dialog providers require the TTS Multi-Speaker feature flag — same access as Gemini multi-speaker.

The gemini-dialog, openai-dialog, elevenlabs-dialog, and murf-dialog providers accept the same multi-speaker request shape as Gemini but synthesize each speaker turn independently through the underlying single-speaker provider, then concatenate the per-turn PCM with 250ms inter-turn silence.

When to use:

  • Gemini multi-speaker is unavailable (fallback path) — pick elevenlabs-dialog or openai-dialog (independent infrastructure)
  • You need more than 2 speakers (Gemini's hard limit)
  • Your dialogue exceeds Gemini's 4000-byte multi-speaker cap — pick gemini-dialog to keep Gemini voice quality, or any other dialog provider
  • You want OpenAI / ElevenLabs / Murf voices in dialog form

Tradeoff — read this before adopting: Each turn is synthesized in isolation. There is no cross-turn prosody conditioning, so a reply does not react to the question's intonation. The output sounds like two voice actors reading separate lines, not a real conversation. For high-fidelity podcast dialogue, prefer native gemini multi-speaker; use the dialog providers as a length-relief tier (gemini-dialog), a reliability tier (elevenlabs-dialog/openai-dialog), or when Gemini's caps don't fit.

gemini-dialog does NOT add reliability. It calls the same generativelanguage.googleapis.com endpoint as native gemini multi-speaker. When Google has an outage, both go down together. Use elevenlabs-dialog or openai-dialog for the outage-fallback role — they have independent infrastructure.

Available providers:

Provider ID Underlying Default Model Cost / 1M chars (managed)
gemini-dialog Google Gemini gemini-3.1-flash-tts-preview $46.00
openai-dialog OpenAI TTS gpt-4o-mini-tts $23.00
elevenlabs-dialog ElevenLabs eleven_v3 $345.00
murf-dialog Murf.ai FALCON $15.30

elevenlabs-dialog default is eleven_v3 — highest emotional range and supports inline audio tags ([cheerful], [whisper], [dramatic]). For a cheaper rescue tier, override with model: "eleven_flash_v2_5" ($69.00 / 1M chars) or use openai-dialog instead.

Request shape: Identical to Gemini multi-speaker (speakers array + input with <Name>: line prefixes).

Field Type Required Description
provider string Yes gemini-dialog, openai-dialog, elevenlabs-dialog, or murf-dialog
model string No Underlying provider model — see table above for defaults
input string Yes Dialogue text with speaker labels (Host: ...\nGuest: ...)
speakers array Yes Speaker-to-voice mapping (no upper limit)
format string No Currently wav only — mp3 transcoding requires ffmpeg, which the runtime image does not include

Limits:

  • Total input: 50,000 chars across all turns
  • Per turn: Each individual turn must fit the underlying provider's per-call limit — OpenAI 4096 chars, ElevenLabs 5000 chars, Murf 10000 chars
  • Speakers: No upper limit; speaker names must be alphanumeric only
  • Sample rate: All turns must produce the same sample rate. Defaults are aligned to 24kHz; Murf GEN2 emits 44.1kHz and would mismatch — stick with FALCON for murf-dialog

Example — three-turn dialog with ElevenLabs:

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "elevenlabs-dialog",
    "model": "eleven_flash_v2_5",
    "input": "Host: Welcome to the show.\nGuest: Thanks for having me.\nHost: Lets dive in.",
    "speakers": [
      {"name": "Host",  "voice": "JBFqnCBsd6RMkjVDRZzb"},
      {"name": "Guest", "voice": "EXAVITQu4vr4xnSDxMaL"}
    ]
  }'

How it works:

  1. Input is split into ordered turns by matching line-leading speaker tags against the configured speakers array. Mid-text colons (I said: "hi") are not mistaken for tags because the candidate must be in the speaker whitelist.
  2. Each turn is sent to the underlying single-speaker adapter independently with the speaker's mapped voice.
  3. Per-turn audio is decoded to raw PCM (WAV header stripped if present), 250ms of silence is inserted between turns, and the result is wrapped as a single 16-bit mono WAV.
  4. The response audio_url, duration_seconds, and cost_usd reflect the joined audio. Per-turn cost is summed.

Failure mode: If any single turn fails (after the underlying provider's own retries), the whole request fails with a provider_error. There is no per-turn retry-with-different-provider logic in v1.

Voices: Use the same voice IDs documented above for OpenAI / ElevenLabs / Murf. Each speaker.voice is interpreted by the underlying provider, so e.g. elevenlabs-dialog accepts ElevenLabs voice IDs (or names like Rachel) and openai-dialog accepts OpenAI voice names (alloy, echo, etc.).


Groq Orpheus (Canopy Labs)

Migration Notice: PlayAI TTS models (playai-tts, playai-tts-arabic) are deprecated and will be decommissioned on December 31, 2025. Please migrate to canopylabs/orpheus-v1-english.

Models:

  • canopylabs/orpheus-v1-english - Expressive English TTS with vocal direction support

Voices (8 voices):

  • tara - Female, conversational (default)
  • leah - Female, professional
  • jess - Female, friendly
  • leo - Male, conversational
  • dan - Male, professional
  • mia - Female, warm
  • zac - Male, casual
  • zoe - Female, clear

Supported Formats: wav only

Max Characters: 200 per request

Features:

  • Vocal Directions: Control speech style with bracketed commands:
    • Conversational: [cheerful], [friendly], [casual], [warm]
    • Professional: [professionally], [authoritatively], [formally]
    • Expressive: [whisper], [excited], [dramatic], [deadpan], [sarcastic]
    • Vocal qualities: [gravelly whisper], [rapid babbling], [singsong], [breathy]
  • Fast generation via Groq infrastructure
  • More directions = more expressive; fewer/no directions = natural, casual
  • 56% cheaper than PlayAI ($22/1M chars vs $50/1M chars)

Pricing

Managed Keys

Character-based pricing with 15% service fee:

Provider Model Cost per 1M chars
OpenAI gpt-4o-mini-tts $23.00
OpenAI tts-1 $17.25
OpenAI tts-1-hd $34.50
ElevenLabs eleven_v3 $345.00
ElevenLabs eleven_multilingual_v2 $138.00
ElevenLabs eleven_turbo_v2_5 $69.00
ElevenLabs eleven_flash_v2_5 $69.00
Google wavenet $18.40
Google neural2 $18.40
Google standard $4.60
Murf GEN2 $34.50
Murf FALCON $15.30
Groq canopylabs/orpheus-v1-english $22.00
Gemini gemini-2.5-flash-preview-tts $23.00
Gemini gemini-2.5-pro-preview-tts $57.50
Gemini gemini-3.1-flash-tts-preview $46.00
gemini-dialog gemini-2.5-flash-preview-tts $23.00
gemini-dialog gemini-3.1-flash-tts-preview $46.00
openai-dialog gpt-4o-mini-tts $23.00
elevenlabs-dialog eleven_flash_v2_5 $69.00
elevenlabs-dialog eleven_v3 $345.00
murf-dialog FALCON $15.30

Dialog provider pricing equals the underlying provider's rate. Cost is the sum of per-turn calls plus the standard 15% managed (or 10% BYOK) service fee.

BYOK

10% service fee on top of provider costs. Provider costs billed directly to your account.

Error Handling

Error Response Format:

{
  "error": {
    "type": "invalid_request",
    "message": "Input text exceeds maximum length",
    "code": "text_too_long"
  }
}

Common Error Codes:

Code HTTP Status Description
invalid_provider 400 Unknown provider specified
invalid_voice 400 Voice not available for provider
text_too_long 400 Input exceeds provider limit
insufficient_credits 402 Not enough credits
provider_error 502 Provider API failed
rate_limited 429 Too many requests

Data Tracking

Every speech generation is automatically tracked in BigQuery with:

  • Transaction ID (ULID)
  • User and API key identifiers
  • Provider, model, and voice used
  • Input character count and text hash (privacy-safe)
  • Audio duration and format
  • GCS storage path
  • Cost breakdown (provider cost, service fee, total)
  • Latency metrics
  • Error information (if failed)

Query your speech generations:

SELECT
  transaction_id,
  provider,
  model,
  tts.voice,
  tts.input_chars,
  tts.duration_sec,
  total_cost
FROM `demeterics.demeterics.interactions`
WHERE interaction_type = 'tts'
  AND user_id = @user_id
  AND timing.question_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY timing.question_time DESC

SDK Support

Python

import requests

response = requests.post(
    "https://api.demeterics.com/tts/v1/generate",
    headers={"Authorization": "Bearer dmt_your_api_key"},
    json={
        "provider": "openai",
        "voice": "alloy",
        "input": "Hello, world!",
        "format": "mp3"
    }
)

audio_url = response.json()["audio_url"]

Node.js

const response = await fetch("https://api.demeterics.com/tts/v1/generate", {
  method: "POST",
  headers: {
    "Authorization": "Bearer dmt_your_api_key",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    provider: "openai",
    voice: "alloy",
    input: "Hello, world!",
    format: "mp3"
  })
});

const { audio_url } = await response.json();

Best Practices

  1. Choose the right provider: OpenAI for speed, ElevenLabs eleven_v3 for highest quality (YouTube, podcasts), ElevenLabs eleven_flash_v2_5 for real-time, Google for language coverage
  2. Pick the right multi-speaker mode: Gemini for highest-fidelity dialogue (cross-turn prosody, max 2 speakers, 4000-byte cap); the *-dialog providers when you need more speakers, larger dialogues, or a Gemini-outage fallback
  3. Cache audio: Store frequently-used audio locally to reduce API calls
  4. Use appropriate formats: MP3 for web, WAV for editing, Opus for streaming
  5. Monitor costs: Track usage in your Demeterics dashboard
  6. Handle errors gracefully: Implement retry logic with exponential backoff