I want a voice that sounds like Tony Soprano. Gemini 3.1 Flash TTS accepts natural-language voice prompts — so in theory I can just describe the voice and the model does the rest. In practice, how do you actually tune one of these prompts? What happens when the first attempt sounds wrong? How do you iterate toward the thing in your head without shipping each draft to a human listener?
This is a hill-climbing problem. You have a scoring function (does it sound like Tony?), a search space (every possible voice prompt), and you need to climb toward better scores. The interesting question isn't whether hill climbing works — it's whether a model can be the judge you climb against. If yes, you cut the human out of the inner loop and tune faster than any human could listen.
Experiment (2026-04-16): close the loop entirely inside Google's Gemini family — one model generates, another model listens and critiques, the critique moves you uphill, repeat. Generator: gemini-3.1-flash-tts-preview. Critic: gemini-3.1-pro-preview — which, unlike Claude Opus 4.6, accepts audio input natively. Target: James Gandolfini as Tony Soprano.
6/10 Enceladus base voice + tournament-tuned 6-slot prompt
"Pacing and heavy breathing are spot-on, but the timbre lacks the deep chest resonance and forward nasal placement needed to fully sell the illusion." — Gemini 3.1 Pro, final critic verdict
| Phase | Voice | Overall | What changed | Critic verdict (abbreviated) | Listen |
|---|---|---|---|---|---|
| v1 iter 1 | Charon | 3 | Baseline Soprano preset (lock-in) | "Whiny, cartoonish mob henchman." | |
| v1 iter 2 | Charon | 3 | +3 mods appended | "Too light, nasal, rushed." | |
| v1 iter 3 | Charon | 4 | +3 more mods (6 total, monotonic append) | "Cartoonish caricature, pleading inflection." | |
| v2 redesign — base-voice sweep + dimensional scorecard + tournament + regression guard | |||||
| v2 sweep | Enceladus | 5 | Untouched seed prompt, new base voice | "Chest resonance + pitch authority out of the box." | |
| v2 R1 | Enceladus | 5 | NASAL_QUALITY slot replaced | "Pacing lifted, chest still thin." | not kept |
| v2 R2 (final) | Enceladus | 6 | PITCH slot replaced — "flat authoritative baritone, no upward lilt" | "Pacing + breathing spot-on; timbre still missing chest." | |
| v2 R3 | Enceladus | 6 | CHEST_RESONANCE mutation → tied but pitch regressed; regression guard fired, reverted | Two flat rounds → early stop. | reverted by guard |
Net gain: +2 overall (v1 final 4 → v2 final 6). Most of it came from the base-voice swap before any prompt tuning.
Total end-to-end ≈ $0.20 across both loops. v1: ~$0.10 (3 TTS + 3 critic). v2: ~$0.20 (17 TTS + 17 critic + 1 final render, budget was $0.50). Critic calls are ~$0.001 each; generator dominates spend by ~10×. Wall-clock ~15 min per loop at 7s inter-call pacing to respect the 10 rpm rate limit.
All five male-coded voices were rendered with an identical seed prompt. Critic scored each on the same dimensional scorecard. Winner advances.
| Voice | chest | nasal | pacing | pitch | fry | overall | Listen |
|---|---|---|---|---|---|---|---|
| Charon (v1 lock-in) | 4 | 3 | 5 | 4 | 2 | 3 | |
| Fenrir | 2 | 4 | 5 | 2 | 3 | 3 | |
| Orus | 3 | 4 | 5 | 3 | 2 | 3 | |
| Puck | 4 | 4 | 6 | 4 | 3 | 4 | |
| Enceladus ← winner | 6 | 4 | 5 | 6 | 5 | 5 |
Each round targets the three lowest-scoring dimensions with three candidates, each mutating a single slot with the critic's slot-specific directive. Winner carries forward; losers dropped. Regression guard reverts any slot change that drops a strength dimension >1 from the Enceladus baseline.
| Round | Candidate | Mutated slot | chest | nasal | pacing | pitch | fry | overall | Disposition |
|---|---|---|---|---|---|---|---|---|---|
| R0 | seed | — | 6 | 4 | 5 | 6 | 5 | 5 | Enceladus baseline |
| R1 | C1 | NASAL_QUALITY | 4 | 5 | 7 | 4 | 5 | 5 | accept |
| R1 | C2 | PACING | 5 | 4 | 4 | 4 | 5 | 4 | reject |
| R1 | C3 | FRY | 4 | 4 | 6 | 4 | 3 | 4 | reject |
| R2 | C1 | CHEST_RESONANCE | 7 | 5 | 4 | 6 | 5 | 5 | reject (pacing collapsed) |
| R2 | C2 | PITCH | 5 | 6 | 8 | 6 | 7 | 6 | accept ← final |
| R2 | C3 | FRY | 4 | 5 | 6 | 4 | 6 | 4 | reject |
| R3 | C1 | CHEST_RESONANCE | 5 | 6 | 8 | 4 | 7 | 6 | tied but pitch -2; revert-guard fired |
| R3 | C2 | NASAL_QUALITY | 2 | 4 | 5 | 3 | 4 | 3 | reject |
| R3 | C3 | PITCH | 4 | 5 | 4 | 4 | 5 | 4 | reject |
Early stop on round 3: no candidate improved on R2/C2's overall=6, and the best (R3/C1) dropped pitch 2 below the Enceladus baseline. Regression guard reverted; two-rounds-flat early stop triggered.
v1 locked on Charon from step 1. Appended each critic's mods to a growing prompt without reconciling contradictions. Got to 4/10 and stalled.
Missing: chest resonance · heavy nasal breathing · downward pitch authority.
Missing: deep chest weight · sluggish labored pacing · rattling vocal fry.
Missing: subtle breathing (iter 1's "insert heavy breathing" mod produced theatrical sighs; critic now wants them dialed back) · flat dismissive authority · seamless fry integration.
Takeaway from v1: iter 3's verdict explicitly called out over-correction from iter 1's mod — proof that monotonic append couldn't reconcile contradictions between rounds. Motivated v2's single-slot replacement + tournament selection.
Generated 2026-04-16 · v1 source: /tmp/larry-soprano-iter/iterate.py · v2 source: /tmp/larry-soprano-iter-v2/iterate-v2.py + resume-v2.py