Soprano iteration — can Gemini critique its own voice?

The user problem

I want a voice that sounds like Tony Soprano. Gemini 3.1 Flash TTS accepts natural-language voice prompts — so in theory I can just describe the voice and the model does the rest. In practice, how do you actually tune one of these prompts? What happens when the first attempt sounds wrong? How do you iterate toward the thing in your head without shipping each draft to a human listener?

This is a hill-climbing problem. You have a scoring function (does it sound like Tony?), a search space (every possible voice prompt), and you need to climb toward better scores. The interesting question isn't whether hill climbing works — it's whether a model can be the judge you climb against. If yes, you cut the human out of the inner loop and tune faster than any human could listen.

Experiment (2026-04-16): close the loop entirely inside Google's Gemini family — one model generates, another model listens and critiques, the critique moves you uphill, repeat. Generator: gemini-3.1-flash-tts-preview. Critic: gemini-3.1-pro-preview — which, unlike Claude Opus 4.6, accepts audio input natively. Target: James Gandolfini as Tony Soprano.

🏆 Winner — v2 final

6/10 Enceladus base voice + tournament-tuned 6-slot prompt

"Pacing and heavy breathing are spot-on, but the timbre lacks the deep chest resonance and forward nasal placement needed to fully sell the illusion." — Gemini 3.1 Pro, final critic verdict

View winning prompt (structured — 6 slots)

Speak from deep in the chest with a heavy bass-baritone resonance, as if the voice is anchored in the sternum rather than the head. Introduce heavy sinus congestion and audible, labored nasal breathing between phrases to simulate years of heavy smoking and physical weight. Short sentences with a shrugged, unhurried cadence. Pauses are deliberate beats, not hesitations. Slightly lethargic, never rushed. Anchor the pitch in a low, authoritative baritone register, eliminating any upward lilt to maintain a flat, quietly menacing presence. Blend subtle, sustained vocal fry into the mid-vowels, not isolated rattles at sentence endings. Channel James Gandolfini as Tony Soprano in an intimate indoor scene — North-Jersey Italian-American, affectionate default, quiet menace underneath. Drop the final '-g' in gerunds (thinkin', doin'). Never shout.

Combined score trajectory — v1 → v2

Phase	Voice	Overall	What changed	Critic verdict (abbreviated)	Listen
v1 iter 1	Charon	3	Baseline Soprano preset (lock-in)	"Whiny, cartoonish mob henchman."
v1 iter 2	Charon	3	+3 mods appended	"Too light, nasal, rushed."
v1 iter 3	Charon	4	+3 more mods (6 total, monotonic append)	"Cartoonish caricature, pleading inflection."
v2 redesign — base-voice sweep + dimensional scorecard + tournament + regression guard
v2 sweep	Enceladus	5	Untouched seed prompt, new base voice	"Chest resonance + pitch authority out of the box."
v2 R1	Enceladus	5	NASAL_QUALITY slot replaced	"Pacing lifted, chest still thin."	not kept
v2 R2 (final)	Enceladus	6	PITCH slot replaced — "flat authoritative baritone, no upward lilt"	"Pacing + breathing spot-on; timbre still missing chest."
v2 R3	Enceladus	6	CHEST_RESONANCE mutation → tied but pitch regressed; regression guard fired, reverted	Two flat rounds → early stop.	reverted by guard

Phase

Voice

Overall

What changed

Critic verdict (abbreviated)

Listen

v1 iter 1

Charon

Baseline Soprano preset (lock-in)

"Whiny, cartoonish mob henchman."

v1 iter 2

Charon

+3 mods appended

"Too light, nasal, rushed."

v1 iter 3

Charon

+3 more mods (6 total, monotonic append)

"Cartoonish caricature, pleading inflection."

v2 redesign — base-voice sweep + dimensional scorecard + tournament + regression guard

v2 sweep

Enceladus

Untouched seed prompt, new base voice

"Chest resonance + pitch authority out of the box."

v2 R1

Enceladus

NASAL_QUALITY slot replaced

"Pacing lifted, chest still thin."

not kept

v2 R2 (final)

Enceladus

PITCH slot replaced — "flat authoritative baritone, no upward lilt"

"Pacing + breathing spot-on; timbre still missing chest."

v2 R3

Enceladus

CHEST_RESONANCE mutation → tied but pitch regressed; regression guard fired, reverted

Two flat rounds → early stop.

reverted by guard

Net gain: +2 overall (v1 final 4 → v2 final 6). Most of it came from the base-voice swap before any prompt tuning.

Key findings (what hill climbing with a model-judge actually looks like)

Cost

Total end-to-end ≈ $0.20 across both loops. v1: ~$0.10 (3 TTS + 3 critic). v2: ~$0.20 (17 TTS + 17 critic + 1 final render, budget was $0.50). Critic calls are ~$0.001 each; generator dominates spend by ~10×. Wall-clock ~15 min per loop at 7s inter-call pacing to respect the 10 rpm rate limit.

Voice	chest	nasal	pacing	pitch	fry	overall
Charon (v1 lock-in)	4	3	5	4	2	3
Fenrir	2	4	5	2	3	3
Orus	3	4	5	3	2	3
Puck	4	4	6	4	3	4
Enceladus ← winner	6	4	5	6	5	5

Voice

chest

nasal

pacing

pitch

fry

overall

Listen

Charon (v1 lock-in)

Fenrir

Orus

Puck

Enceladus ← winner

Round	Candidate	Mutated slot	chest	nasal	pacing	pitch	fry	overall	Disposition
R0	seed	—	6	4	5	6	5	5	Enceladus baseline
R1	C1	NASAL_QUALITY	4	5	7	4	5	5	accept
R1	C2	PACING	5	4	4	4	5	4	reject
R1	C3	FRY	4	4	6	4	3	4	reject
R2	C1	CHEST_RESONANCE	7	5	4	6	5	5	reject (pacing collapsed)
R2	C2	PITCH	5	6	8	6	7	6	accept ← final
R2	C3	FRY	4	5	6	4	6	4	reject
R3	C1	CHEST_RESONANCE	5	6	8	4	7	6	tied but pitch -2; revert-guard fired
R3	C2	NASAL_QUALITY	2	4	5	3	4	3	reject
R3	C3	PITCH	4	5	4	4	5	4	reject

Round

Candidate

Mutated slot

chest

nasal

pacing

pitch

fry

overall

Disposition

seed

—

Enceladus baseline

NASAL_QUALITY

accept

PACING

reject

FRY

reject

CHEST_RESONANCE

reject (pacing collapsed)

PITCH

accept ← final

FRY

reject

CHEST_RESONANCE

tied but pitch -2; revert-guard fired

NASAL_QUALITY

reject

PITCH

reject

Generated 2026-04-16 · v1 source: /tmp/larry-soprano-iter/iterate.py · v2 source: /tmp/larry-soprano-iter-v2/iterate-v2.py + resume-v2.py

Soprano iteration — can Gemini critique its own voice?

The user problem

🏆 Winner — v2 final

Combined score trajectory — v1 → v2

Key findings (what hill climbing with a model-judge actually looks like)

Cost

Base-voice sweep

Tournament rounds

Iter 1 — baseline preset 3

Iter 2 — 3 mods appended 3

Iter 3 — 6 mods appended (prompt now 250 words) 4