← Back to voice catalog

How was this generated?

The pipeline

Every sample on the catalog page was produced through this flow:

  1. Igor (or Larry) writes a line of text.
  2. Optional: a style-directive (natural-language voice description) is prepended — e.g. "Speak in a raspy North-Jersey Italian-American baritone..."
  3. Text + optional directive are POSTed to the Gemini 3.1 Flash TTS preview endpoint with a chosen catalog voice name (Charon, Kore, Fenrir, etc.).
  4. Gemini returns base64-encoded 16-bit signed PCM at 24 kHz mono.
  5. The Python tool (generate-tts.py) decodes + wraps in a standard WAV header.
  6. MP3 mirrors are post-processed with ffmpeg -codec:a libmp3lame -qscale:a 4 (nice-wrapped) for browser playback.

Model and endpoint

POST https://generativelanguage.googleapis.com/v1beta/models/gemini-3.1-flash-tts-preview:generateContent
Content-Type: application/json
x-goog-api-key: $GOOGLE_API_KEY

{
  "contents": [{ "parts": [{ "text": "..." }] }],
  "generationConfig": {
    "responseModalities": ["AUDIO"],
    "speechConfig": {
      "voiceConfig": {
        "prebuiltVoiceConfig": { "voiceName": "Charon" }
      }
    }
  }
}

Pricing (Gemini 3.1 Flash TTS, 2026-04-16)

DimensionRateTypical Larry utterance
Input text tokens$0.50 / 1M~75 chars ≈ 19 tokens ≈ $0.0000095
Output audio tokens$10.00 / 1M~5s audio ≈ ~2,000 tokens ≈ $0.02
All-in per 5s utterance~$0.02
Rate limit (preview)10 requests / minute

Back-of-envelope: a Larry day with 20–30 voice replies runs roughly $0.40–$0.60, or $12–$18/month. Preview tier rates may change when the model graduates to GA.

Timing reality: mean generation wall-clock is ~1.0× realtime — i.e. 5 s of audio takes ~5 s to produce. The preview endpoint caps at 10 rpm, so parallel sweeps fan out into serial retry when the quota bites.

Can Claude/Opus listen back to judge quality?

Short answer: No — Claude Opus 4.6 does not accept audio input natively. The Messages API content blocks are text, image, document, tool_use/tool_result, search_result, server_tool_use, and thinking. Audio is not in the list.

Open feature request: anthropic-sdk-python #1198, filed Feb 2026, no ETA as of today.

This means voice-quality judgment has to be done by Igor's ears, not by Larry — when the judge has to be Claude. Larry can still:

Can the model listen to itself? (Gemini-as-critic, 2026-04-16)

Yes, if you swap providers. gemini-3.1-pro-preview and gemini-3-pro-preview both accept audio input natively on generateContent — you pass the WAV inline as inlineData with mimeType: "audio/wav", and the model reasons about the audio the same way it does about an image or PDF. This closed the loop we've wanted for voice work: generator (gemini-3.1-flash-tts-preview) → critic (gemini-3.1-pro-preview) → revised prompt → regenerate. Entirely inside Google's own API.

We tried this on the Soprano preset: render with the current style directive, ask Gemini 3.1 Pro to rate 1–10 against the Gandolfini reference and propose 3 concrete prompt modifications, append the mods, regenerate. Three iterations. Scores went 3 → 3 → 4 — modest gains, limited by a naive append-only fusion (the generator ends up obeying an early "insert heavy breathing" mod even when a later critic complains the breathing became cartoonish). Full write-up, audio samples, and per-iteration critic JSON: soprano iteration →.

Key findings:

For this catalog: Igor's ears are still the final judge, but Gemini 3.1 Pro can now shortlist before the ear test.

The gen-tts Python tool

Lives at chop-conventions/skills/gen-tts/ (PR #138). Single-file Python script, uv run --script shebang, stdlib-only. Flags:

./generate-tts.py \
  --text "line to speak" \
  --voice Charon \
  --style-preset soprano \   # or --style-prompt "..." or --style-file voices/foo.txt
  --output /tmp/larry.wav

Batch mode via ThreadPoolExecutor for parallel calls (respect the 10 rpm cap — 6 concurrent is the max that doesn't trigger quota). Directorial tags like [short pause] work inline; [whisper] and [excited] currently trip Gemini's safety filter, documented in SKILL.md.

Reproducibility

All 17 voice samples on the catalog page were generated with identical input text:

"Hi Igor, this is Larry testing the Gemini voice catalog at regular pacing."

Same phrase, different voices, no style directive. Each file's name is the catalog voice name. Wall-clock timings listed on the catalog page are the observed generation time for that specific sample.

Custom-preset samples (freud-preset.wav, soprano-preset.wav) use Charon as the base voice + the natural-language directive from the preset file as a prefix.


Generated 2026-04-16. If you find voice quality has shifted or pricing has changed materially, re-ping Larry for a regeneration sweep.