Why your voice-activated device does not work on my grandmother's voice

My medicine reminder system achieves 96.4% keyword spotting accuracy. I am proud of that number. I also want to be honest about what it means, because the difference between what that number says and what it does not say is, I think, one of the most important and least discussed problems in applied machine learning.

96.4% on a held-out test set drawn from the same distribution as the training data. One speaker. One room. One accent. Controlled conditions, consistent microphone placement, no background television, no overlapping conversation, no cognitive impairment affecting speech patterns. The number is real. The conditions that produced it are not the conditions under which the system would actually be used.

This is not a confession of failure. It is a description of a gap that exists in nearly every deployed speech system, from the most resource-constrained embedded device to the most heavily funded commercial assistant. The difference is that cloud-based systems have the compute budget to partially close the gap. On-device systems running on microcontrollers with 320KB of usable RAM do not. And the population that most needs offline, low-cost, embedded voice interaction is precisely the population whose speech is least represented in any training dataset.

What 96.4% actually measures

When a paper reports accuracy on a speech task, that number describes performance on a test set. The test set is a held-out portion of the full dataset, usually split randomly or by stratified sampling. If the dataset was collected from one speaker in one environment, the test set is also from one speaker in one environment. The model has never seen those exact samples, but it has seen samples from the same distribution. The accuracy number measures how well the model interpolates within a known distribution, not how well it generalises to distributions it has never encountered. In the research literature this is the distinction between in-distribution performance and out-of-distribution (OOD) robustness. Most TinyML papers report the former. Deployment demands the latter.

This distinction matters enormously in practice and is frequently elided in papers, including my own. The reported accuracy is the best-case number. It is the number you get when the person using the system sounds like the person who built it.

The target user of my medicine reminder is an elderly person in a low-connectivity setting, potentially with cognitive impairment affecting speech clarity, speaking a regional dialect of a language my training data barely covers. That person is as far from the training distribution as it is possible to be while still speaking the same language.

I did not formally benchmark performance on unseen speakers. I tested informally with a few other voices and the degradation was noticeable. I did not include those numbers in the paper because they were not from a controlled evaluation. I am including the observation here because pretending the limitation does not exist would be worse than acknowledging it without precise numbers.

Why this problem is specifically hard on embedded hardware

On a cloud system or a server, the standard approaches to this problem are well understood. You train on a large, diverse dataset covering many speakers, accents, and acoustic conditions. You use data augmentation to simulate variation you did not capture. You apply domain adaptation techniques to shift the model toward new distributions at inference time. You fine-tune on user-specific data after deployment. None of these approaches are straightforwardly available on an MCU.

Available SRAM

320KB

Total usable after firmware. Both models and arenas must fit here.

On-device fine-tuning

Not viable

Backpropagation requires storing gradients. The memory cost is prohibitive.

Model download

No guarantee

The deployment context assumes intermittent or no connectivity.

Cloud fallback

By design, absent

Offline operation is a hard requirement, not a nice-to-have.

Fine-tuning on device is the most obvious solution and the most obviously unavailable one. Training a neural network requires storing the forward pass activations to compute gradients during backpropagation. On a device with 320KB of total usable memory, where the inference-only forward pass already requires careful arena management just to fit, there is no room for the additional memory that training requires. This is not an engineering problem waiting for a clever implementation. It is a fundamental constraint of the hardware class.

Larger models trained on more diverse data would help, but larger models do not fit. My KWS model occupies approximately 130KB of SRAM and 0.7MB of flash. Getting meaningfully better generalisation from a model this size, trained on the data I could collect as one person working alone, is a hard upper bound problem. You can push the boundary but you cannot eliminate it without either more data, more model capacity, or a fundamentally different approach to how the model learns.

The data problem compounds the hardware problem

Training data for low-resource languages and regional accents is scarce. Training data for elderly speakers with mild cognitive impairment is scarcer still. Training data for any of these populations recorded specifically for medicine-related conversational commands, in realistic home acoustic conditions, essentially does not exist at the scale needed to train a robust model.

I recorded my dataset myself. Approximately 2,500 samples for keyword spotting, 180 to 190 per class for intent classification, all in one voice, in one room, over several sessions. The augmentation pipeline I built expanded this by roughly six times through noise addition, time stretching, and pitch shifting. This is standard practice and it helps. It does not solve the underlying problem. Augmenting one speaker's voice does not produce another speaker's voice. It produces variations of the same speaker under different acoustic conditions.

The populations most underserved by current voice technology -- elderly speakers, non-native speakers, speakers of low-resource languages, speakers with speech differences -- are also the populations for whom offline, low-cost, embedded voice interaction would be most valuable. The people who cannot rely on a smartphone or a stable internet connection are the people who need a device that works locally. The device that works locally is the device least equipped to handle the acoustic diversity of the population that needs it.

This circularity is not accidental. It is a structural feature of how voice technology has been developed: optimised for the connected, the English-speaking, and the demographically central, deployed first where the market is largest, and reaching the edges of the distribution last if at all.

What the literature says and what it leaves out

The TinyML literature has made remarkable progress on model compression, quantization, and efficient architecture design for constrained hardware. The results are genuinely impressive. A model that achieves competitive keyword spotting accuracy at 130KB of SRAM, with int8 quantization and under 30ms inference latency, would have been considered implausible five years ago.

What the literature reports less consistently is speaker-independent accuracy. Many papers evaluate on held-out samples from the same recording sessions as the training data. Some papers use established benchmarks like Google Speech Commands, which provides better speaker diversity but is still primarily English, primarily non-elderly, and primarily recorded in controlled conditions. The gap between benchmark performance and deployment performance in genuinely diverse real-world conditions is rarely quantified, and when it is, the numbers are sobering.

I am not criticising this literature. I am a product of it and my work depends on it. I am noting that the evaluation methodology systematically overstates the readiness of these systems for the populations that need them most, and that this overstating is rarely acknowledged explicitly in the papers themselves.

The approaches that might actually work

This is where the problem becomes genuinely interesting as a research question rather than just a critique of existing work.

One direction is personalisation at enrollment time rather than continuous fine-tuning. If a new user can provide a small number of samples when they first set up the device, and the model can adapt to their voice without a full retraining cycle, you get meaningful generalisation without the memory cost of backpropagation. Techniques like prototypical networks and metric learning have shown promise in few-shot settings. Whether they can be compressed to the memory budget of an MCU while retaining meaningful adaptation capability is an open question.

Another direction is architecture search specifically targeting generalisation within fixed memory budgets. The models I built were designed to fit within the hardware constraint while maximising accuracy on the training distribution. A model designed from the start to maximise speaker-independent accuracy within the same constraint might look architecturally quite different. Neural architecture search at this scale, with generalisation as the explicit optimisation target rather than just accuracy, is not well explored in the TinyML literature.

A third direction is federated approaches where devices contribute anonymised updates to a shared model without transmitting raw audio. This preserves privacy, addresses the data scarcity problem over time, and does not require connectivity at inference time. The coordination infrastructure is non-trivial and the communication constraints for low-power devices are significant, but the direction is promising.

None of these are solved. All of them are active areas of research. The constraint that makes them hard, fitting generalisation capability inside the memory envelope of an MCU-class device without connectivity, is exactly the constraint that the deployment context demands.

Why I think about this problem the way I do

I built a system that works reliably for one person. The paper I published describes that system accurately and reports its performance honestly. What the paper cannot fully convey, and what I want to convey here, is that building it changed my understanding of what the unsolved problem actually is.

Before building it, I thought the hard problem was getting a neural network to run on a microcontroller. That problem is mostly solved. The frameworks exist, the quantization pipelines exist, the architecture patterns are well understood. Getting inference to work within a memory budget is an engineering challenge and a real one, but it is a tractable one.

The hard problem is getting a neural network that was trained on data you could collect to work on a voice you have never heard, in an acoustic environment you have never recorded, on a device that cannot be updated without physical access, for a person whose needs motivated the entire project but whose voice was never in the training set.

That problem sits at the intersection of machine learning, systems architecture, and the practical realities of deployment in resource-constrained settings. It is the problem I want to work on. The medicine reminder was not the answer. It was the question, stated precisely enough to be worth asking.

The system described in this post is documented in full at Shiv07ansh/AIoT-Medicine-Reminder. The paper is available as a preprint at zenodo.org/records/19034554. The story of how the system got built is in the first post in this series.