Speech Score · Ableton for the voice

A tracker and an Ableton — but the instrument is the human voice.

A music producer never types a song into a paragraph. They lay it on a timeline — tracks, takes, things they can move, mute, and version. So why is a scene of dialogue still a flat block of script text? Speech Score treats a performance the way a DAW treats music — and it runs today.

A working instrument

Where it is: real, running, and playable right here — the whole instrument, all four scores and ~1.4 MB of real neural voices, rides inside one page: no install, no signup, works offline. There's a live mode where a human actor performs one lane while the AI holds the others on the beat. What it can't do yet: perform in your voice — that reference-read layer is next.

Dialogue as a score, not a paragraph

The instrument is tracker-brained and Ableton-bodied. Voices are lanes, lines are clips, the page is a timeline. A playhead descends and each line sounds as it's struck — in real neural voices, polyphonic and latency-tight, so three witches can land on the same row and phase into one another. In the editor you drag a clip to retime it, drag it across lanes to recast it, set how many beats it spans, and warp it to lock a word to the grid — the same knob that snaps it to the beat also stretches the voice toward madness. It's explicitly not text-to-speech: the point isn't to make a machine talk, it's to make a performance you can compose, rehearse, and stage.

The arrangement — Macbeth's three witchesrows = time · columns = voices

t	Witch I	Witch II	Witch III	Chorus
0.0	"When shall we three—"
0.5	↳ meet again?	"In thunder—"
1.0		lightning,	"or in rain?"
1.5	✻	✻	✻	"Fair is foul"
2.0				↳ "and foul is fair"

A real score in the library. On the shared row (t=1.5) the three lanes strike at once and the chorus refrain overlaps — the phasing you can hear, not just read.

What's built — and what's next

Shipped & working

Real neural voices — polyphonic, zero-latency Web Audio playback, per-voice humanization.
The drag-clip editor: retime, recast across lanes, resize in beats, sub-beat snap for polyrhythm, warp — mouse, touchpad, and finger alike.
Live human+AI cue mode: Space or a footswitch pedal advances, count-in, loop a passage, mute or solo a voice — the human sets the pace, the AI answers on its rows.
Per-clip trim, gain, and fades over a live waveform. A one-file offline share you can double-click.

Next on the bench

Voice cloning — a ~10s reference read makes a lane a specific real person.
Pitch-preserving warp and true audio recomposition — "actually edit the voice."

Four scores, one engine

Philip Glass Buys a Loaf of Bread4 lanes · the first score

Richard & AnneShakespeare stichomythia · 2 lanes

The Importance of Being EarnestWilde · a human + AI duet

Macbeth's Three Witches3 lanes + a chorus refrain

Every score is portable JSON on one engine — add a play in a single place. The point of four is the point of the instrument: it isn't one effect, it's a surface that plays any scene you bring it.

"If actors ever needed confirmation of their worth, it's this application." … then, watching the voices phase together: "the discordant music reveals itself!" — "that's pretty good."

— Chris, an actor, going from the jab to recognition on the first prototype. The flagship live human-plus-AI mode came out of that thread.

The same instinct as the Script Doctor, pointed at the ear instead of the page: take a judgment a director makes by feel — that pause is too long, those two lines should collide — and give it a track you can actually move.

Hear a scored take — right now.

Open it, hit play, and the scene performs in real neural voices. It opens on Philip Glass Buys a Loaf of Bread; switch to Macbeth's witches or the Wilde duet in the library. Runs in your browser — no install, no signup, works offline.

▶ Open the instrument

Act or direct? Put your name on a lane →