Soprano iteration — can Gemini critique its own voice?

1.00×
live — applies to every clip on the page

The user problem

I want a voice that sounds like Tony Soprano. Gemini 3.1 Flash TTS accepts natural-language voice prompts — so in theory I can just describe the voice and the model does the rest. In practice, how do you actually tune one of these prompts? What happens when the first attempt sounds wrong? How do you iterate toward the thing in your head without shipping each draft to a human listener?

This is a hill-climbing problem. You have a scoring function (does it sound like Tony?), a search space (every possible voice prompt), and you need to climb toward better scores. The interesting question isn't whether hill climbing works — it's whether a model can be the judge you climb against. If yes, you cut the human out of the inner loop and tune faster than any human could listen.

Experiment (2026-04-16): close the loop entirely inside Google's Gemini family — one model generates, another model listens and critiques, the critique moves you uphill, repeat. Generator: gemini-3.1-flash-tts-preview. Critic: gemini-3.1-pro-preview — which, unlike Claude Opus 4.6, accepts audio input natively. Target: James Gandolfini as Tony Soprano.

🏆 Winner — v2 final

6/10   Enceladus base voice + tournament-tuned 6-slot prompt

"Pacing and heavy breathing are spot-on, but the timbre lacks the deep chest resonance and forward nasal placement needed to fully sell the illusion." — Gemini 3.1 Pro, final critic verdict

View winning prompt (structured — 6 slots)
Speak from deep in the chest with a heavy bass-baritone resonance, as if the voice is anchored in the sternum rather than the head. Introduce heavy sinus congestion and audible, labored nasal breathing between phrases to simulate years of heavy smoking and physical weight. Short sentences with a shrugged, unhurried cadence. Pauses are deliberate beats, not hesitations. Slightly lethargic, never rushed. Anchor the pitch in a low, authoritative baritone register, eliminating any upward lilt to maintain a flat, quietly menacing presence. Blend subtle, sustained vocal fry into the mid-vowels, not isolated rattles at sentence endings. Channel James Gandolfini as Tony Soprano in an intimate indoor scene — North-Jersey Italian-American, affectionate default, quiet menace underneath. Drop the final '-g' in gerunds (thinkin', doin'). Never shout.

Combined score trajectory — v1 → v2

PhaseVoiceOverallWhat changedCritic verdict (abbreviated)Listen
v1 iter 1Charon3Baseline Soprano preset (lock-in)"Whiny, cartoonish mob henchman."
v1 iter 2Charon3+3 mods appended"Too light, nasal, rushed."
v1 iter 3Charon4+3 more mods (6 total, monotonic append)"Cartoonish caricature, pleading inflection."
v2 redesign — base-voice sweep + dimensional scorecard + tournament + regression guard
v2 sweepEnceladus5Untouched seed prompt, new base voice"Chest resonance + pitch authority out of the box."
v2 R1Enceladus5NASAL_QUALITY slot replaced"Pacing lifted, chest still thin."not kept
v2 R2 (final)Enceladus6PITCH slot replaced — "flat authoritative baritone, no upward lilt""Pacing + breathing spot-on; timbre still missing chest."
v2 R3Enceladus6CHEST_RESONANCE mutation → tied but pitch regressed; regression guard fired, revertedTwo flat rounds → early stop.reverted by guard

Net gain: +2 overall (v1 final 4 → v2 final 6). Most of it came from the base-voice swap before any prompt tuning.

Key findings (what hill climbing with a model-judge actually looks like)

Cost

Total end-to-end ≈ $0.20 across both loops. v1: ~$0.10 (3 TTS + 3 critic). v2: ~$0.20 (17 TTS + 17 critic + 1 final render, budget was $0.50). Critic calls are ~$0.001 each; generator dominates spend by ~10×. Wall-clock ~15 min per loop at 7s inter-call pacing to respect the 10 rpm rate limit.

v2 deep-dive — base-voice sweep, tournament rounds, scorecard (click to expand)

Base-voice sweep

All five male-coded voices were rendered with an identical seed prompt. Critic scored each on the same dimensional scorecard. Winner advances.

VoicechestnasalpacingpitchfryoverallListen
Charon (v1 lock-in)435423
Fenrir245233
Orus345323
Puck446434
Enceladus ← winner645655

Tournament rounds

Each round targets the three lowest-scoring dimensions with three candidates, each mutating a single slot with the critic's slot-specific directive. Winner carries forward; losers dropped. Regression guard reverts any slot change that drops a strength dimension >1 from the Enceladus baseline.

RoundCandidateMutated slotchestnasalpacingpitchfryoverallDisposition
R0seed645655Enceladus baseline
R1C1NASAL_QUALITY457455accept
R1C2PACING544454reject
R1C3FRY446434reject
R2C1CHEST_RESONANCE754655reject (pacing collapsed)
R2C2PITCH568676accept ← final
R2C3FRY456464reject
R3C1CHEST_RESONANCE568476tied but pitch -2; revert-guard fired
R3C2NASAL_QUALITY245343reject
R3C3PITCH454454reject

Early stop on round 3: no candidate improved on R2/C2's overall=6, and the best (R3/C1) dropped pitch 2 below the Enceladus baseline. Regression guard reverted; two-rounds-flat early stop triggered.

v1 deep-dive — the naive-append run that plateaued (click to expand)

v1 locked on Charon from step 1. Appended each critic's mods to a growing prompt without reconciling contradictions. Got to 4/10 and stalled.

Iter 1 — baseline preset 3

"Whiny, cartoonish mob henchman rather than the physically imposing, deeply resonant Tony Soprano."

Missing: chest resonance · heavy nasal breathing · downward pitch authority.

Iter 2 — 3 mods appended 3

"Too light, nasal, rushed — generic mob caricature."

Missing: deep chest weight · sluggish labored pacing · rattling vocal fry.

Iter 3 — 6 mods appended (prompt now 250 words) 4

"Exaggerated breathing and pleading upward inflection turn Tony into a cartoonish caricature rather than a quiet, menacing boss."

Missing: subtle breathing (iter 1's "insert heavy breathing" mod produced theatrical sighs; critic now wants them dialed back) · flat dismissive authority · seamless fry integration.

Takeaway from v1: iter 3's verdict explicitly called out over-correction from iter 1's mod — proof that monotonic append couldn't reconcile contradictions between rounds. Motivated v2's single-slot replacement + tournament selection.


Generated 2026-04-16 · v1 source: /tmp/larry-soprano-iter/iterate.py · v2 source: /tmp/larry-soprano-iter-v2/iterate-v2.py + resume-v2.py