Why can't standard speech recognition assess children's phonics?

Standard speech recognition systems are trained on continuous adult speech at word and sentence level. They perform poorly on isolated phoneme production for two compounding reasons: first, children's vocal tracts produce substantially different acoustic patterns to adults; second, isolated phonemes — a single /sh/ or /oo/ — lack the context that word-level models rely on to disambiguate ambiguous sounds. A child saying /b/ in isolation produces an acoustic signal that most commercial APIs simply cannot reliably identify.

What makes Australian phonics assessment technically harder?

Australian English has distinct accent features that differ from the US and UK English most speech recognition models are trained on. Australian vowel shifts — particularly the raised TRAP and DRESS vowels — and the non-rhotic character of Australian English create systematic differences in how phonemes are produced. A model trained on US children's speech will make predictable errors on Australian children's speech. IndiLearn's engine is trained with Australian accent variants as a primary design constraint, not an afterthought.

How does IndiLearn's phonics app assess letter tracing and handwriting?

IndiLearn uses Apple's PencilKit framework combined with multi-algorithm scoring across three dimensions: pixel coverage (how much of the expected letter shape was traced), Fréchet distance (how closely the stroke path matches the target), and Hausdorff distance (maximum deviation from the ideal form). This combination allows zone-specific feedback — 'your top loop is correct, the lower stroke needs to come further left' — rather than a generic pass/fail. The scoring runs entirely on-device with no cloud call.

Does IndiLearn support different Australian and US accents?

Yes. IndiLearn includes an AU/US accent toggle that adjusts both the speech recognition engine and the text-to-speech output simultaneously. This allows schools to configure the app for their student population's accent context. The toggle is a system-level setting, not a per-session option, ensuring consistency across a class.

How does IndiLearn generate decodable content for individual students?

IndiLearn's content generation uses a local language model running on the school's Mac mini to generate decodable words and sentences constrained to each student's active graphemes — the phoneme-grapheme correspondences they have been taught. The model also takes session struggle data as input, generating content that targets the specific phonemes the student found difficult in the current session. All generation happens on-site with no cloud call.

The Phonics Recognition Problem Nobody Solved: IndiLearn's Proprietary Neural Engine

When we set out to build a phonics assessment app for primary students, we made the obvious first move: test every major commercial speech recognition API against the task. Apple's Speech framework. Google Speech-to-Text. AWS Transcribe. Whisper.

None of them worked reliably on isolated phoneme production from young children. Not even close.

This is not a criticism of those products. They are excellent at what they are built for: recognising continuous speech, at word and sentence level, from adult speakers. That is not what phonics instruction requires. Phonics instruction requires recognising whether a five-year-old said /sh/ correctly in isolation. These are different problems.

The problem standard speech recognition cannot solve

Speech recognition APIs are trained on continuous speech. The underlying models learn from context: knowing what word came before and after dramatically improves recognition accuracy. A model that sees "she sells sea—" has very high confidence about what comes next. This contextual disambiguation is central to how modern speech recognition achieves high accuracy.

Isolated phoneme production removes all of that context. A child saying /b/ produces a brief burst of sound with no preceding or following context. The model that was trained on millions of hours of continuous adult speech has very little signal to work with — and the signal it has is systematically different from its training data.

The technical gap

Commercial speech recognition achieves 95%+ word error rate on adult continuous speech. On isolated phoneme production from primary-aged children, the same APIs produce error rates that make them unusable for assessment. We measured false acceptance rates high enough to make the assessment meaningless — the system would accept an incorrect phoneme production as correct at rates no teacher would tolerate.

Why children's phoneme production is uniquely hard to recognise

Children's vocal tracts are structurally different from adults'. The formant frequencies — the resonance bands that characterise vowels and many consonants — are shifted relative to adult speech in ways that are not simply a scaled version of adult acoustics. A child producing /e/ sounds different from an adult producing /e/ in ways that a model trained on adult speech does not automatically accommodate.

For phonics instruction specifically, the challenge compounds. Children learning to read are learning to produce phonemes, often for the first time in a formal context. Approximations, half-formed sounds, and attempts that are educationally valid but acoustically imprecise are part of the learning process. An assessment engine that cannot tolerate appropriate variation will reject correct attempts; one that is too permissive will accept incorrect ones. Calibrating this tolerance for young learners is a design problem that has not been solved off the shelf.

The Australian accent challenge

Most available speech recognition training data skews heavily towards American English, with British English as a secondary corpus. Australian English has distinct acoustic features that create systematic recognition errors in models trained on US or UK data.

Australian vowels are notably different. The TRAP vowel (/æ/ as in "cat") is raised towards DRESS (/e/) in many Australian speakers. The GOAT vowel has a different starting position. Australian English is also non-rhotic — the /r/ sound is not produced after vowels — which affects the acoustic realisation of phonemes adjacent to /r/. These are not minor variations. They are systematic differences that produce predictable error patterns when US-trained models assess Australian speakers.

Built for Australia first

IndiLearn's neural engine is trained with Australian accent variants as a primary constraint. The AU/US toggle is not an afterthought — it adjusts both the recognition engine and the text-to-speech output simultaneously, so the model the child hears in the app matches the accent the engine expects to recognise. This consistency matters for young learners who are mapping sound to symbol for the first time.

What IndiLearn built: the proprietary neural engine

IndiLearn's phoneme recognition engine is a purpose-built neural classifier trained specifically for isolated phoneme production from primary-aged children, with Australian and US accent variants as explicit training dimensions. It runs on-device using Apple's Core ML framework on iPad, with more compute-intensive classification falling back to the school's Mac mini over the local network.

The engine is not a general speech recogniser. It is a phoneme classifier — it produces a confidence score for each possible grapheme-phoneme correspondence given the audio input, rather than attempting to transcribe arbitrary speech. This narrower task allows much higher accuracy on the specific problem it is solving.

Approach	What it solves	What it doesn't
Commercial speech API (word-level)	Continuous adult speech, high accuracy	Isolated phonemes, children's voices, Australian accents
Keyword spotting	Fixed vocabulary recognition	Phoneme-level acoustic discrimination, accent variation
IndiLearn neural classifier	Isolated phoneme production, children, AU/US accents	Continuous speech, languages outside scope

PencilKit tracing: three-algorithm scoring for handwriting

The writing component of the phonics app uses Apple's PencilKit framework to capture stylus input on iPad. Assessment of whether a child has traced a letter correctly requires more than pixel coverage — a child who traces a messy but complete shape needs different feedback to one who traces a precise but incorrectly proportioned form.

IndiLearn uses three algorithms simultaneously:

Pixel coverage — what proportion of the expected letter form was traced? Low coverage indicates incomplete strokes.
Fréchet distance — how closely does the stroke path follow the target trajectory? This penalises correct endpoints but wrong paths.
Hausdorff distance — what is the maximum deviation between any point on the child's stroke and the nearest point on the target? This catches strokes that are mostly correct but have a significant outlier region.

The combination of these three scores enables zone-specific feedback: which part of the letter was correct, and which part needs attention. A child whose top loop is right but lower stroke is too shallow gets different feedback to one whose stroke direction is reversed. Generic pass/fail scoring cannot generate this level of specificity.

Personalised decodable content from session data

After each session, IndiLearn's content system generates new decodable words and sentences constrained to the student's active grapheme set — only phoneme-grapheme correspondences that have been explicitly taught — and targeted to the phonemes where the student struggled in the current session.

This generation runs on the school's Mac mini using a local language model. The constraint system ensures decodability: every word the child reads or hears can be decoded using phonics rules they have been taught. No sight-word memorisation required, no guessing from context — which is precisely what systematic synthetic phonics instruction demands.

Why this matters for teachers

A teacher running phonics sessions with 25 students cannot generate individualised decodable content for each child based on that morning's session data. IndiLearn does this automatically, before the next session starts. The teacher sees which phonemes each student struggled with; the app has already prepared targeted practice content. The teacher decides what to do; the logistics are handled.

The phonics recognition problem nobody solved: how IndiLearn built a proprietary neural engine

In this article