Larry Voice Catalog

Model: gemini-3.1-flash-tts-preview
Date: 2026-04-16 · Phrase: "Hi Igor. This is Larry, testing the Gemini voice catalog at regular pacing. Let me walk you through a longer sample so you can hear how this voice handles prosody, pauses, and sentence variety…"
Audio: 24 kHz mono · Pricing: ~$0.12 per 30 s utterance · Rate limit: 10 rpm (preview)
Details: How was this generated? → · Soprano iteration loop →
Repo: github.com/idvorkin-ai-tools/larry-voice-samples
ℹ️ These samples are pre-generated, not live. Each MP3 was produced once via the Gemini 3.1 Flash TTS preview endpoint and committed to this repo. The speed slider above uses the browser's HTMLMediaElement.playbackRate — it doesn't call the API. This keeps the page free and instant.

Custom voice presets

Charon base voice + natural-language style directive.

1.00×
live — playing audio keeps playing at the new rate
Larry (plain Charon)
No style directive — baseline
Freud preset
Thoughtful Viennese baritone — clinical, measured
Soprano preset v2 · 6/10
Enceladus base · tournament-tuned prompt · Raspy North-Jersey, quiet menace. How we got this voice →
Catalog sweep — 17 working Gemini voices (click to expand)

Each voice ran the same phrase. Slider above affects all of them.

Charon ⭐
Larry default · baritone storyteller
Fenrir
Excitable baritone · widest dynamic
Orus
"Lively male"
Puck
Upbeat, playful male
Enceladus
Male-coded, neutral
Iapetus
Neutral-to-cool
Algenib
Neutral
Algieba
Neutral
Umbriel
Neutral
Autonoe
Neutral
Zephyr
Airy
Aoede
"Warm"
Leda
Neutral-warm
Callirrhoe
Fastest generation
Despina
Smooth
Erinome
Neutral
Kore
"Firm" · reads feminine
Not available on preview endpoint — 13 voices (click to expand)

These names appear in Google's documented catalog but return silent failures on gemini-3.1-flash-tts-preview as of 2026-04-16: Achernar, Achird, Alnilam, Gacrux, Laomedeia, Pulcherrima, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Vindemiatrix, Zubenelgenubi.

Stats: latency & cost

Catalog reference

Cost per minute of output is fixed ($0.24 = 24 000 audio tokens × $10/1M). Wall clock scales ≈ linearly with audio length — voice choice shifts it by ±10% only.

Audio lengthWall clock (typical)Cost
5 s utterance~5 s (range 4.2–6.8)$0.02
30 s utterance~14 s (range 11–18)$0.12
1 min utterance~60 s$0.24
10 min long-form~10 min$2.40

Text input at ~150 chars/min is $0.00002 (negligible). Preview tier; may shift at GA.

Custom preset latency (measured)

Style directives add zero API-side overhead — Gemini reads the prose as metadata, not speech. Styled output is typically 15–25% longer because the model adds persona pauses/beats. Same phrase, different prompts:

PresetWall clockAudio lengthCost
Plain Charon4.7 s6.4 s$0.026
+ Soprano style4.7 s7.8 s$0.031
+ Freud style (diff phrase)~5 s17.3 s~$0.07

Takeaway: style presets cost ~20% more per utterance — the extra is persona quality, not overhead.

Generated 2026-04-16 · regenerated 2026-04-15 at ~30 s per clip · How was this generated? →