Technology

The phonics speech recognition problem: why no existing solution works — and how IndiLearn is solving it

Updated May 2026 · 6 min readIndiLearn · Education Technology
Unsolved
No existing speech recognition reliably assesses isolated phonemes from young readers
AU+US
Accent variants supported — toggle at school configuration level
3
Tracing scoring algorithms (pixel, Fréchet, Hausdorff) — zone-specific feedback
0
Cloud calls during speech recognition — all on-device or on-site

In this article

  1. The problem standard speech recognition cannot solve
  2. Why children's phoneme production is uniquely hard to recognise
  3. The Australian accent challenge
  4. How IndiLearn is solving it
  5. PencilKit tracing: three-algorithm scoring for handwriting
  6. Personalised decodable content from session data

When we set out to build a phonics assessment app for primary students, we made the obvious first move: test every major commercial speech recognition API against the task. Google Speech-to-Text. AWS Transcribe. Whisper. Every leading option available.

None of them worked reliably on isolated phoneme production from young children. Not even close.

This is not a criticism of those products. They are excellent at what they are built for: recognising continuous speech, at word and sentence level, from adult speakers. That is not what phonics instruction requires. Phonics instruction requires recognising whether a five-year-old said /sh/ correctly in isolation. These are fundamentally different problems — and no existing solution addresses the second one reliably. This is the problem IndiLearn is solving.

The problem standard speech recognition cannot solve

Speech recognition APIs are trained on continuous speech. The underlying models learn from context: knowing what word came before and after dramatically improves recognition accuracy. A model that sees "she sells sea—" has very high confidence about what comes next. This contextual disambiguation is central to how modern speech recognition achieves high accuracy.

Isolated phoneme production removes all of that context. A child saying /b/ produces a brief burst of sound with no preceding or following context. The model that was trained on millions of hours of continuous adult speech has very little signal to work with — and the signal it has is systematically different from its training data.

The technical gap

Commercial speech recognition achieves 95%+ word error rate on adult continuous speech. On isolated phoneme production from primary-aged children, the same APIs produce error rates that make them unusable for assessment. We measured false acceptance rates high enough to make the assessment meaningless — the system would accept an incorrect phoneme production as correct at rates no teacher would tolerate.

Why children's phoneme production is uniquely hard to recognise

Children's vocal tracts are structurally different from adults'. The formant frequencies — the resonance bands that characterise vowels and many consonants — are shifted relative to adult speech in ways that are not simply a scaled version of adult acoustics. A child producing /e/ sounds different from an adult producing /e/ in ways that a model trained on adult speech does not automatically accommodate.

For phonics instruction specifically, the challenge compounds. Children learning to read are learning to produce phonemes, often for the first time in a formal context. Approximations, half-formed sounds, and attempts that are educationally valid but acoustically imprecise are part of the learning process. An assessment engine that cannot tolerate appropriate variation will reject correct attempts; one that is too permissive will accept incorrect ones. Calibrating this tolerance for young learners is a design problem that has not been solved off the shelf.

The Australian accent challenge

Most available speech recognition training data skews heavily towards American English, with British English as a secondary corpus. Australian English has distinct acoustic features that create systematic recognition errors in models trained on US or UK data.

Australian vowels are notably different. The TRAP vowel (/æ/ as in "cat") is raised towards DRESS (/e/) in many Australian speakers. The GOAT vowel has a different starting position. Australian English is also non-rhotic — the /r/ sound is not produced after vowels — which affects the acoustic realisation of phonemes adjacent to /r/. These are not minor variations. They are systematic differences that produce predictable error patterns when US-trained models assess Australian speakers.

Built for Australia first

IndiLearn's neural engine is trained with Australian accent variants as a primary constraint. The AU/US toggle is not an afterthought — it adjusts both the recognition engine and the text-to-speech output simultaneously, so the model the child hears in the app matches the accent the engine expects to recognise. This consistency matters for young learners who are mapping sound to symbol for the first time.

How IndiLearn is solving it

IndiLearn is developing a purpose-built phoneme classifier specifically for isolated phoneme production from primary-aged children, with accent variants as explicit design constraints. Unlike general speech recognisers that transcribe arbitrary speech, our approach targets the narrow task of classifying whether a child produced a specific phoneme correctly — a much harder problem that existing APIs simply weren't designed for.

The classifier runs on-site within the school's infrastructure — no audio leaves the building, no child's voice is sent to an external server. All processing happens within the school network.

ApproachWhat it solvesWhat it doesn't
Commercial speech API (word-level)Continuous adult speech, high accuracyIsolated phonemes, children's voices, Australian accents
Keyword spottingFixed vocabulary recognitionPhoneme-level acoustic discrimination, accent variation
IndiLearn neural classifierIsolated phoneme production, children, AU/US accentsContinuous speech, languages outside scope

PencilKit tracing: three-algorithm scoring for handwriting

The writing component of the phonics app uses Apple's PencilKit framework to capture stylus input on iPad. Assessment of whether a child has traced a letter correctly requires more than pixel coverage — a child who traces a messy but complete shape needs different feedback to one who traces a precise but incorrectly proportioned form.

IndiLearn uses three algorithms simultaneously:

The combination of these three scores enables zone-specific feedback: which part of the letter was correct, and which part needs attention. A child whose top loop is right but lower stroke is too shallow gets different feedback to one whose stroke direction is reversed. Generic pass/fail scoring cannot generate this level of specificity.

Personalised decodable content from session data

After each session, IndiLearn's content system generates new decodable words and sentences constrained to the student's active grapheme set — only phoneme-grapheme correspondences that have been explicitly taught — and targeted to the phonemes where the student struggled in the current session.

This generation runs on the school's on-site server using on-site AI. The constraint system ensures decodability: every word the child reads or hears can be decoded using phonics rules they have been taught. No sight-word memorisation required, no guessing from context — which is precisely what systematic synthetic phonics instruction demands.

Why this matters for teachers

A teacher running phonics sessions with 25 students cannot generate individualised decodable content for each child based on that morning's session data. IndiLearn is building this capability — so the teacher sees which phonemes each student struggled with and the app prepares targeted practice content for next time. The teacher decides what to do; the logistics are handled.

The phonics technology Australian teachers have been waiting for.

Register your interest to be part of the solution — pilot schools are shaping what we build.

Register your school

Related articles