Acoustic Intelligence

Acoustic Intelligence

AI that learns to speak, not from data — from trying.

A neural controller learns to produce human vowels by controlling a real-time vocal instrument. No training data. No phoneme models. No text. Just an AI, a voice box, and feedback.

Research in progress. First vowels achieved March 2026.

The Idea

Human infants don't learn to speak from datasets. They babble, hear the result, and adjust. They discover how their vocal cords work through trial and feedback. No one programs them.

We asked: can an AI do the same?

Not text-to-speech

No text input. No phoneme pipeline. The AI controls a physical vocal instrument directly.

Not voice cloning

No recorded speech dataset. The AI discovers how to produce sound through its own exploration.

Teacher-student only

A teacher scores the output (closer/further). It never tells the AI what to do. Scoring is the only control lever.

Emergence over control

All learning must be emergent. No programmatic sound shaping. If the AI can't discover it, it doesn't happen.

The Journey

From random noise to recognisable vowels — a timeline of discovery.

Phase 1 — Evolutionary Discovery

March 2026 — Week 1

From noise to speech-like sound

An evolutionary search (MAP-Elites) explored a DSP vocal instrument. Given only a speech detector as feedback, 12 independent seeds all converged on speech-like acoustic structure. Score: 0.97.

Evolution Progression

Random noise evolving toward speech-like structure over 2,500 generations

Discovered Speech Onset (Segment 106)

A 50ms acoustic gesture discovered independently by every seed — a universal speech initiator

Best Composed Utterance (0.97 speech score)

Discovered primitives composed into the highest-scoring output

The limitation

Evolutionary search found speech — but it had no understanding, no memory, no ability to generalise. It stumbled onto speech without knowing what it was doing. Time for a different approach.

Phase 2 — Neural Motor Control

March 2026 — Week 2 (current)

An AI brain learns to control vocal cords

A GRU neural controller learns to produce sound by sending motor commands to a real-time Rust vocal instrument. It hears the result through an inner ear and adjusts. Like an infant learning to speak.

8

Motor parameters per 10ms frame

3

Distinct vowels learned

1

Controller brain for all vowels

Hear It Speak

One neural controller producing three distinct vowels — and transitioning between them.

Three Vowels — One Brain

The same controller receives a different target signal and produces a different vowel. All learned through feedback, not programmed.

"ah"

"ee"

"oo"

Vowel Transitions

The controller smoothly transitions between vowel states — learned, not interpolated.

"ah" to "ee"

"ah" to "oo"

Spectral Analysis

Spectrogram comparison — Richard's voice (top) vs AI vowels. Each vowel has a distinct formant signature.

Spectrogram: Richard vs AI vowels

How It Works

1

Babble

The AI sends random motor commands to a vocal instrument and hears what comes out

2

Listen

An inner ear extracts compact acoustic features from each 10ms frame of output

3

Learn

The controller learns which motor commands produce which sounds through closed-loop feedback

4

Imitate

Given a target sound, the controller figures out how to make the instrument match it

Architecture

Vocal Instrument (Rust)

Real-time source-filter synthesis. Glottal pulse generator + cascade all-pole formant filter. 8 motor inputs, 220 audio samples out, every 10ms.

Motor Controller (PyTorch)

2-layer GRU neural network. Takes target + acoustic feedback, outputs 8 motor commands per frame. One brain controls all vowels.

Inner Ear

Compact 16-float feedback vector per frame. F0, voicing, energy, spectral features. Causal — only hears what's already been produced.

Training

Closed-loop behaviour cloning with the instrument in the training loop. No reinforcement learning needed for basic vowel control. Deterministic. Reproducible.

Built on AWS

Phase 1 evolutionary search ran on AWS EC2 — parameter sweeps on 192-core Graviton instances, multi-seed validation campaigns. Phase 2 neural training developed locally on Apple Silicon, with cloud compute reserved for scaling experiments. AWS Activate startup program.

About

Acoustic Intelligence is an Australian deep-tech AI research startup. We're building AI systems that learn the skill of speaking — not from data, but from physics and feedback.

Founded by Richard James. Based in Brisbane, Australia.

This work demonstrates that an AI can learn to produce distinct human vowels by controlling a physical vocal instrument through feedback alone — no speech data, no linguistic knowledge, no programmatic shaping.

Get in Touch

Research collaborations, partnerships, and investment inquiries welcome.

richard@acousticintelligence.ai