HTMLMediaElement.playbackRate — it doesn't call the API. This keeps the page free and instant.
Charon base voice + natural-language style directive.
Each voice ran the same phrase. Slider above affects all of them.
These names appear in Google's documented catalog but return silent failures on gemini-3.1-flash-tts-preview as of 2026-04-16: Achernar, Achird, Alnilam, Gacrux, Laomedeia, Pulcherrima, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Vindemiatrix, Zubenelgenubi.
Cost per minute of output is fixed ($0.24 = 24 000 audio tokens × $10/1M). Wall clock scales ≈ linearly with audio length — voice choice shifts it by ±10% only.
| Audio length | Wall clock (typical) | Cost |
|---|---|---|
| 5 s utterance | ~5 s (range 4.2–6.8) | $0.02 |
| 30 s utterance | ~14 s (range 11–18) | $0.12 |
| 1 min utterance | ~60 s | $0.24 |
| 10 min long-form | ~10 min | $2.40 |
Text input at ~150 chars/min is $0.00002 (negligible). Preview tier; may shift at GA.
Style directives add zero API-side overhead — Gemini reads the prose as metadata, not speech. Styled output is typically 15–25% longer because the model adds persona pauses/beats. Same phrase, different prompts:
| Preset | Wall clock | Audio length | Cost |
|---|---|---|---|
| Plain Charon | 4.7 s | 6.4 s | $0.026 |
| + Soprano style | 4.7 s | 7.8 s | $0.031 |
| + Freud style (diff phrase) | ~5 s | 17.3 s | ~$0.07 |
Takeaway: style presets cost ~20% more per utterance — the extra is persona quality, not overhead.
Generated 2026-04-16 · regenerated 2026-04-15 at ~30 s per clip · How was this generated? →