Every sample on the catalog page was produced through this flow:
generate-tts.py) decodes + wraps in a standard WAV header.ffmpeg -codec:a libmp3lame -qscale:a 4 (nice-wrapped) for browser playback.POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent
Content-Type: application/json
x-goog-api-key: $GOOGLE_API_KEY
{
"contents": [{ "parts": [{ "text": "..." }] }],
"generationConfig": {
"responseModalities": ["AUDIO"],
"speechConfig": {
"voiceConfig": {
"prebuiltVoiceConfig": { "voiceName": "Charon" }
}
}
}
}
| Dimension | Rate | Typical Larry utterance |
|---|---|---|
| Input text tokens | $0.50 / 1M | ~75 chars ≈ 19 tokens ≈ $0.0000095 |
| Output audio tokens | $10.00 / 1M | ~5s audio ≈ ~2,000 tokens ≈ $0.02 |
| All-in per 5s utterance | ~$0.02 | |
| Rate limit (preview) | 10 requests / minute | |
Back-of-envelope: a Larry day with 20–30 voice replies runs roughly $0.40–$0.60, or $12–$18/month. Preview tier rates may change when the model graduates to GA.
Short answer: No — Claude Opus 4.6 does not accept audio input natively. The Messages API content blocks are text, image, document, tool_use/tool_result, search_result, server_tool_use, and thinking. Audio is not in the list.
Open feature request: anthropic-sdk-python #1198, filed Feb 2026, no ETA as of today.
This means voice-quality judgment has to be done by Igor's ears, not by Larry — when the judge has to be Claude. Larry can still:
gen-ttsgen-stt (Parakeet ONNX) to verify transcription fidelity (a proxy for intelligibility, not quality)Yes, if you swap providers. gemini-3.1-pro-preview
and gemini-3-pro-preview both accept audio input natively on
generateContent — you pass the WAV inline as
inlineData with mimeType: "audio/wav", and the model
reasons about the audio the same way it does about an image or PDF.
This closed the loop we've wanted for voice work: generator
(gemini-3.1-flash-tts-preview) → critic
(gemini-3.1-pro-preview) → revised prompt → regenerate.
Entirely inside Google's own API.
We tried this on the Soprano preset: render with the current style directive, ask Gemini 3.1 Pro to rate 1–10 against the Gandolfini reference and propose 3 concrete prompt modifications, append the mods, regenerate. Three iterations. Scores went 3 → 3 → 4 — modest gains, limited by a naive append-only fusion (the generator ends up obeying an early "insert heavy breathing" mod even when a later critic complains the breathing became cartoonish). Full write-up, audio samples, and per-iteration critic JSON: soprano iteration →.
Key findings:
gemini-3.1-pro-preview or fall back to
gemini-2.5-pro / gemini-2.5-flash (both confirmed
audio-in on this repo's test probe).For this catalog: Igor's ears are still the final judge, but Gemini 3.1 Pro can now shortlist before the ear test.
gen-tts Python toolLives at chop-conventions/skills/gen-tts/ (PR #138). Single-file Python script, uv run --script shebang, stdlib-only. Flags:
./generate-tts.py \ --text "line to speak" \ --voice Charon \ --style-preset soprano \ # or --style-prompt "..." or --style-file voices/foo.txt --output /tmp/larry.wav
Batch mode via ThreadPoolExecutor for parallel calls (respect the 10 rpm cap — 6 concurrent is the max that doesn't trigger quota). Directorial tags like [short pause] work inline; [whisper] and [excited] currently trip Gemini's safety filter, documented in SKILL.md.
All 17 voice samples on the catalog page were generated with identical input text:
"Hi Igor, this is Larry testing the Gemini voice catalog at regular pacing."
Same phrase, different voices, no style directive. Each file's name is the catalog voice name. Wall-clock timings listed on the catalog page are the observed generation time for that specific sample.
Custom-preset samples (freud-preset.wav, soprano-preset.wav) use Charon as the base voice + the natural-language directive from the preset file as a prefix.
Generated 2026-04-16. If you find voice quality has shifted or pricing has changed materially, re-ping Larry for a regeneration sweep.