Technology

The phonics recognition problem nobody solved: how IndiLearn built a proprietary neural engine

Updated May 2026 · 6 min readIndiLearn · Education Technology
First
On-device isolated phoneme assessment app for young readers
AU+US
Accent variants supported — toggle at school configuration level
3
Tracing scoring algorithms (pixel, Fréchet, Hausdorff) — zone-specific feedback
0
Cloud calls during speech recognition — all on-device or on-site

In this article

  1. The problem standard speech recognition cannot solve
  2. Why children's phoneme production is uniquely hard to recognise
  3. The Australian accent challenge
  4. What IndiLearn built: the proprietary neural engine
  5. PencilKit tracing: three-algorithm scoring for handwriting
  6. Personalised decodable content from session data

When we set out to build a phonics assessment app for primary students, we made the obvious first move: test every major commercial speech recognition API against the task. Apple's Speech framework. Google Speech-to-Text. AWS Transcribe. Whisper.

None of them worked reliably on isolated phoneme production from young children. Not even close.

This is not a criticism of those products. They are excellent at what they are built for: recognising continuous speech, at word and sentence level, from adult speakers. That is not what phonics instruction requires. Phonics instruction requires recognising whether a five-year-old said /sh/ correctly in isolation. These are different problems.

The problem standard speech recognition cannot solve

Speech recognition APIs are trained on continuous speech. The underlying models learn from context: knowing what word came before and after dramatically improves recognition accuracy. A model that sees "she sells sea—" has very high confidence about what comes next. This contextual disambiguation is central to how modern speech recognition achieves high accuracy.

Isolated phoneme production removes all of that context. A child saying /b/ produces a brief burst of sound with no preceding or following context. The model that was trained on millions of hours of continuous adult speech has very little signal to work with — and the signal it has is systematically different from its training data.

The technical gap

Commercial speech recognition achieves 95%+ word error rate on adult continuous speech. On isolated phoneme production from primary-aged children, the same APIs produce error rates that make them unusable for assessment. We measured false acceptance rates high enough to make the assessment meaningless — the system would accept an incorrect phoneme production as correct at rates no teacher would tolerate.

Why children's phoneme production is uniquely hard to recognise

Children's vocal tracts are structurally different from adults'. The formant frequencies — the resonance bands that characterise vowels and many consonants — are shifted relative to adult speech in ways that are not simply a scaled version of adult acoustics. A child producing /e/ sounds different from an adult producing /e/ in ways that a model trained on adult speech does not automatically accommodate.

For phonics instruction specifically, the challenge compounds. Children learning to read are learning to produce phonemes, often for the first time in a formal context. Approximations, half-formed sounds, and attempts that are educationally valid but acoustically imprecise are part of the learning process. An assessment engine that cannot tolerate appropriate variation will reject correct attempts; one that is too permissive will accept incorrect ones. Calibrating this tolerance for young learners is a design problem that has not been solved off the shelf.

The Australian accent challenge

Most available speech recognition training data skews heavily towards American English, with British English as a secondary corpus. Australian English has distinct acoustic features that create systematic recognition errors in models trained on US or UK data.

Australian vowels are notably different. The TRAP vowel (/æ/ as in "cat") is raised towards DRESS (/e/) in many Australian speakers. The GOAT vowel has a different starting position. Australian English is also non-rhotic — the /r/ sound is not produced after vowels — which affects the acoustic realisation of phonemes adjacent to /r/. These are not minor variations. They are systematic differences that produce predictable error patterns when US-trained models assess Australian speakers.

Built for Australia first

IndiLearn's neural engine is trained with Australian accent variants as a primary constraint. The AU/US toggle is not an afterthought — it adjusts both the recognition engine and the text-to-speech output simultaneously, so the model the child hears in the app matches the accent the engine expects to recognise. This consistency matters for young learners who are mapping sound to symbol for the first time.

What IndiLearn built: the proprietary neural engine

IndiLearn's phoneme recognition engine is a purpose-built neural classifier trained specifically for isolated phoneme production from primary-aged children, with Australian and US accent variants as explicit training dimensions. It runs on-device using Apple's Core ML framework on iPad, with more compute-intensive classification falling back to the school's Mac mini over the local network.

The engine is not a general speech recogniser. It is a phoneme classifier — it produces a confidence score for each possible grapheme-phoneme correspondence given the audio input, rather than attempting to transcribe arbitrary speech. This narrower task allows much higher accuracy on the specific problem it is solving.

ApproachWhat it solvesWhat it doesn't
Commercial speech API (word-level)Continuous adult speech, high accuracyIsolated phonemes, children's voices, Australian accents
Keyword spottingFixed vocabulary recognitionPhoneme-level acoustic discrimination, accent variation
IndiLearn neural classifierIsolated phoneme production, children, AU/US accentsContinuous speech, languages outside scope

PencilKit tracing: three-algorithm scoring for handwriting

The writing component of the phonics app uses Apple's PencilKit framework to capture stylus input on iPad. Assessment of whether a child has traced a letter correctly requires more than pixel coverage — a child who traces a messy but complete shape needs different feedback to one who traces a precise but incorrectly proportioned form.

IndiLearn uses three algorithms simultaneously:

The combination of these three scores enables zone-specific feedback: which part of the letter was correct, and which part needs attention. A child whose top loop is right but lower stroke is too shallow gets different feedback to one whose stroke direction is reversed. Generic pass/fail scoring cannot generate this level of specificity.

Personalised decodable content from session data

After each session, IndiLearn's content system generates new decodable words and sentences constrained to the student's active grapheme set — only phoneme-grapheme correspondences that have been explicitly taught — and targeted to the phonemes where the student struggled in the current session.

This generation runs on the school's Mac mini using a local language model. The constraint system ensures decodability: every word the child reads or hears can be decoded using phonics rules they have been taught. No sight-word memorisation required, no guessing from context — which is precisely what systematic synthetic phonics instruction demands.

Why this matters for teachers

A teacher running phonics sessions with 25 students cannot generate individualised decodable content for each child based on that morning's session data. IndiLearn does this automatically, before the next session starts. The teacher sees which phonemes each student struggled with; the app has already prepared targeted practice content. The teacher decides what to do; the logistics are handled.

The phonics technology Australian teachers have been waiting for.

Register your school's interest for pilot access in 2026.

Register your school

Related articles