How was this generated?

The pipeline

Every sample on the catalog page was produced through this flow:

Igor (or Larry) writes a line of text.
Optional: a style-directive (natural-language voice description) is prepended — e.g. "Speak in a raspy North-Jersey Italian-American baritone..."
Text + optional directive are POSTed to the Gemini 3.1 Flash TTS preview endpoint with a chosen catalog voice name (Charon, Kore, Fenrir, etc.).
Gemini returns base64-encoded 16-bit signed PCM at 24 kHz mono.
The Python tool (generate-tts.py) decodes + wraps in a standard WAV header.
MP3 mirrors are post-processed with ffmpeg -codec:a libmp3lame -qscale:a 4 (nice-wrapped) for browser playback.

Model and endpoint

POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent
Content-Type: application/json
x-goog-api-key: $GOOGLE_API_KEY

{
  "contents": [{ "parts": [{ "text": "..." }] }],
  "generationConfig": {
    "responseModalities": ["AUDIO"],
    "speechConfig": {
      "voiceConfig": {
        "prebuiltVoiceConfig": { "voiceName": "Charon" }
      }
    }
  }
}

Pricing (Gemini 3.1 Flash TTS, 2026-04-16)

Dimension	Rate	Typical Larry utterance
Input text tokens	$0.50 / 1M	~75 chars ≈ 19 tokens ≈ $0.0000095
Output audio tokens	$10.00 / 1M	~5s audio ≈ ~2,000 tokens ≈ $0.02
All-in per 5s utterance	~$0.02
Rate limit (preview)	10 requests / minute

Back-of-envelope: a Larry day with 20–30 voice replies runs roughly $0.40–$0.60, or $12–$18/month. Preview tier rates may change when the model graduates to GA.

Timing reality: mean generation wall-clock is ~1.0× realtime — i.e. 5 s of audio takes ~5 s to produce. The preview endpoint caps at 10 rpm, so parallel sweeps fan out into serial retry when the quota bites.

Can Claude/Opus listen back to judge quality?

Short answer: No — Claude Opus 4.6 does not accept audio input natively. The Messages API content blocks are text, image, document, tool_use/tool_result, search_result, server_tool_use, and thinking. Audio is not in the list.

Open feature request: anthropic-sdk-python #1198, filed Feb 2026, no ETA as of today.

This means voice-quality judgment has to be done by Igor's ears, not by Larry — when the judge has to be Claude. Larry can still:

Generate candidates via gen-tts
Run them through gen-stt (Parakeet ONNX) to verify transcription fidelity (a proxy for intelligibility, not quality)
Ship the WAVs to Igor via Telegram for his perceptual call

Can the model listen to itself? (Gemini-as-critic, 2026-04-16)

Yes, if you swap providers. gemini-3.1-pro-preview and gemini-3-pro-preview both accept audio input natively on generateContent — you pass the WAV inline as inlineData with mimeType: "audio/wav", and the model reasons about the audio the same way it does about an image or PDF. This closed the loop we've wanted for voice work: generator (gemini-3.1-flash-tts-preview) → critic (gemini-3.1-pro-preview) → revised prompt → regenerate. Entirely inside Google's own API.

We tried this on the Soprano preset: render with the current style directive, ask Gemini 3.1 Pro to rate 1–10 against the Gandolfini reference and propose 3 concrete prompt modifications, append the mods, regenerate. Three iterations. Scores went 3 → 3 → 4 — modest gains, limited by a naive append-only fusion (the generator ends up obeying an early "insert heavy breathing" mod even when a later critic complains the breathing became cartoonish). Full write-up, audio samples, and per-iteration critic JSON: soprano iteration →.

Key findings:

Audio token accounting: ~93 input tokens for the 3.7 s probe clip (roughly 25 tokens/s, close to the documented 32 tokens/s of audio). Critic calls cost $<0.01 total for the 3-iteration run.
Critique quality is legit: Gemini 3.1 Pro names specific acoustic features ("downward pitch inflection on 'already'", "rattling vocal fry", "congested nasal breathing") in sound-engineer vocabulary. Its proposed modifications are imperative and actionable, not vibes.
Provider note: Gemini 3 Pro Preview was deprecated 2026-03-09 — use gemini-3.1-pro-preview or fall back to gemini-2.5-pro / gemini-2.5-flash (both confirmed audio-in on this repo's test probe).

For this catalog: Igor's ears are still the final judge, but Gemini 3.1 Pro can now shortlist before the ear test.

The `gen-tts` Python tool

Lives at chop-conventions/skills/gen-tts/ (PR #138). Single-file Python script, uv run --script shebang, stdlib-only. Flags:

./generate-tts.py \
  --text "line to speak" \
  --voice Charon \
  --style-preset soprano \   # or --style-prompt "..." or --style-file voices/foo.txt
  --output /tmp/larry.wav

Batch mode via ThreadPoolExecutor for parallel calls (respect the 10 rpm cap — 6 concurrent is the max that doesn't trigger quota). Directorial tags like [short pause] work inline; [whisper] and [excited] currently trip Gemini's safety filter, documented in SKILL.md.

Reproducibility

All 17 voice samples on the catalog page were generated with identical input text:

"Hi Igor, this is Larry testing the Gemini voice catalog at regular pacing."

Same phrase, different voices, no style directive. Each file's name is the catalog voice name. Wall-clock timings listed on the catalog page are the observed generation time for that specific sample.

Custom-preset samples (freud-preset.wav, soprano-preset.wav) use Charon as the base voice + the natural-language directive from the preset file as a prefix.

Generated 2026-04-16. If you find voice quality has shifted or pricing has changed materially, re-ping Larry for a regeneration sweep.