How I built a voice-activated medicine reminder on a $4 microcontroller

This post is not a clean technical walkthrough. Those exist, and they are useful, but they leave out the part that actually matters: how a project like this gets built by one person with no funding, no lab, no supervisor, and navigating the steep learning curve of bare-metal systems programming from scratch. This is that story.

The paper that came out of this work was accepted to ICDECT-2025 and published in Springer LNNS. It reports 96.4% keyword spotting accuracy and 94.1% intent classification accuracy on an ESP32-S3 microcontroller that costs about four dollars. What it does not report is that the first version of this system had none of that. It had an ultrasonic sensor and an email.

Semester six, a borrowed breadboard, and a missing mentor

It started as a semester project. The brief was vague enough that you could build almost anything, and I had been curious about IoT for a while without ever having touched the hardware. The problem was that I had no hardware. No microcontroller, no sensors, no breadboard, nothing. A Raspberry Pi, which most of my classmates were using for their projects, was simply out of budget. Looking back, that constraint saved me from building something generic.

I found an ESP8266 through someone in my extended friend group who did electronics as a hobby. The breadboard and jumper cables came from the electronics engineering department, borrowed with varying degrees of formality. The ultrasonic sensor came from a third person entirely. My first working circuit was assembled from components that belonged to at least four different people, none of whom fully understood what I was building.

The first version was genuinely simple. The ultrasonic sensor detected when someone reached into a box. The ESP8266 sent an email when it did. That was it. Medicine taken, email sent. It worked the first time I tested it in my hostel room and I felt disproportionately proud of something that was, in hindsight, about thirty lines of code.

Around this time, my faculty mentor left for another job. No warning, no handover, no replacement assigned for weeks. I was suddenly working on a graded project with no supervision and no one to ask when things went wrong. At the time this felt like a disaster. In practice it forced something valuable: I had to think independently about every decision, justify it to myself, and live with the consequences when it was wrong. I did not know it then, but this is more or less what research feels like.

The itch that would not go away

After that semester I moved on. I learned machine learning through subsequent projects, built classifiers, worked with neural networks, deployed things to cloud APIs. But every time a project touched on AI I found myself wanting to push it onto hardware. The cloud felt like cheating somehow. I wanted to know if the device itself could be intelligent.

That itch sat dormant for two years. Then I joined Cadence Design Systems and spent a year building EDA tooling in C++, working inside constraints that made cloud connectivity not just impractical but architecturally inappropriate. You do not phone home when you are synthesizing silicon. Software has to know what the hardware costs. There is something about working in EDA that recalibrates your sense of what software actually is. You are not moving pixels or serving JSON. The code you write defines constraints that eventually become physical silicon. For the first time, my work felt consequential in a way that is hard to articulate but impossible to ignore once you have felt it. When I came back to the medicine reminder, I came back with a completely different sense of what it meant to build something real.

The question was no longer "how do I remind someone to take medicine." The question was: can a microcontroller understand what someone says, in real time, with no internet connection, on hardware cheap enough that a family in rural India could actually own it?

The constraint that shaped everything

No cloud. This was not a preference. It was a hard requirement that eliminated most of the obvious approaches immediately. Speech-to-text APIs are too large to run locally on an MCU. A ChatGPT API call costs money per inference, requires connectivity, and adds hundreds of milliseconds of latency. Even small transformer models were far outside the memory budget of any device I could afford.

I needed to run inference locally. That meant TensorFlow Lite Micro. That meant I needed a microcontroller that could actually support it. I had moved on from the ESP8266 by now, which had nowhere near enough memory for what I was planning. Finding the right board was its own research project.

Device	The problem
Arduino Nano	2KB SRAM. Cannot run TFLite Micro at all.
Raspberry Pi	650mA draw, $55. Wrong tool entirely.
ESP32 (original)	Viable but no PSRAM, tight on memory.
ESP32-S3 Mini	External PSRAM, vector extensions, $5. This one.

I could not afford the devkit version of the ESP32-S3 with more onboard memory. I ordered the Mini C3 variant and worked within what it had. The external PSRAM turned out to be the feature that made everything else possible.

The dataset problem: one voice, one room, one shot

Before I could train anything I needed data. There was no existing dataset of medicine-related commands in the conversational style I needed. I recorded everything myself, in my room, across multiple sessions, trying to capture the natural variation in how someone might say "yes I took it" or "remind me in thirty minutes" when they are half-awake at seven in the morning.

About 2,500 samples for keyword spotting. Around 180 to 190 samples per intent class for the spoken language understanding model, across eight categories. All in .wav format at 16kHz. All me.

This is the primary limitation of the work, and I want to be honest about it: the system works well on my voice and degrades on voices it has not seen. That is not a minor caveat. It is the central open problem the work surfaces. I will come back to this.

Getting the augmentation pipeline right took longer than I expected. Noise addition at different SNR levels, time stretching, pitch shifting. Each technique had to be calibrated so it introduced genuine variance without making the samples unrecognisable. The augmentation script expanded each class by roughly six times. Without it, the models were badly overfit on a single speaker's vocal characteristics.

The first model was wrong

My first instinct was a single end-to-end model: audio in, intent out. I spent time on this, tuned it, got reasonable validation accuracy in Colab. Then I tried to fit it on the device. It exceeded the SRAM budget immediately and by a large margin. Not close. Not fixable by quantization alone. The architecture was fundamentally wrong for the hardware.

Going back to the drawing board after investing that much time in a model was genuinely deflating. But it forced a better question: what does the device actually need to do at each moment in time?

Most of the time the device is just listening for a trigger. It does not need to understand language. It needs to detect whether a medicine-related sound just happened. That is a much simpler problem. Only when it detects one does it need to understand what was said.

This is the cascade. A lightweight Keyword Spotting model runs continuously, doing binary detection at about 30ms per inference. When it fires, a heavier Spoken Language Understanding model activates and classifies the intent. Two models, each small enough to fit, each doing one thing well. The architecture is not novel. Alexa and Siri use the same pattern at commercial scale. The difference is that mine runs on a chip that costs less than a cup of tea, with no network connection, in real time.

The electronics were a different kind of hard

I had forgotten, or perhaps never fully appreciated, how much of hardware work is physical. The INMP441 microphone pins needed soldering to headers before they would fit a breadboard. I do not own a soldering iron. I had to find someone who did, explain what I needed, and negotiate the use of their time and equipment. This happened more than once across different components.

The HC-SR04 ultrasonic sensor outputs 5V on its echo pin. The ESP32-S3 GPIO pins are 3.3V maximum. This is not something you discover in a tutorial. You discover it when the readings start behaving strangely and you trace it back through the datasheet. The fix is a resistor voltage divider: a 10kΩ and a 20kΩ resistor between the echo pin and ground, dropping the signal to 3.33V. Simple once you know it. Invisible until you do.

The microphone brought a different class of problem. The INMP441 is an I2S MEMS microphone configured at 16kHz, 16-bit mono. Getting clean audio out of it required understanding the full signal chain: the DMA buffer size determines how much audio you capture before processing, the bit depth determines your dynamic range, and the sampling rate sets your Nyquist ceiling at 8kHz, which is the upper bound of the frequency content your models can ever see. When the WiFi radio was active during capture, it induced noise that showed up as spurious high-frequency content in the spectrogram. You cannot filter what you do not understand is there.

The PAM8403 amplifier would reset the entire board on startup due to a voltage surge on the 5V rail. A 100µF bulk capacitor fixed it. The I2S microphone was picking up interference from the WiFi radio during audio capture. Increasing the DMA buffer size and disabling WiFi power saving during recording windows fixed it. None of these solutions were things I knew going in. Each one was an afternoon of reading datasheets, forum posts, and GitHub issues, followed by a test, followed by either relief or another afternoon.

Three things that broke spectacularly

The bidirectional LSTM

My SLU model initially used bidirectional LSTM layers. Validation accuracy in Colab was noticeably better than with unidirectional layers. The model converted to TFLite without complaint. On device, it failed immediately with an op-resolution error. TFLite Micro does not include the ReverseV2 kernel that bidirectional layers require at runtime. The error message was not helpful. Finding the cause took two days and a GitHub issue from 2021 that had twelve upvotes and no official response.

I replaced the bidirectional layers with stacked unidirectional ones. Accuracy dropped by less than one percent. Two days of debugging to learn that the simpler architecture was fine all along.

AllOpsResolver

The first time I loaded both models simultaneously the device rebooted. Immediately. Every time. The cause was AllOpsResolver, which loads every operator TFLite Micro knows about into memory including all the backpropagation operators used for training, which are completely useless for inference. On a device with 320KB of usable SRAM this exhausts memory before you run a single forward pass.

// Every reboot traced back to this one line
tflite::AllOpsResolver resolver;

// The fix: register only what the model actually uses
tflite::MicroMutableOpResolver<8> resolver;
resolver.AddDepthwiseConv2D();
resolver.AddConv2D();
resolver.AddMaxPool2D();
resolver.AddFullyConnected();
resolver.AddSoftmax();
resolver.AddReshape();

I identified which ops to register by reading the runtime error codes from micro_mutable_op_resolver.h, not from any documentation. The documentation does not tell you this. Switching resolvers saved approximately 40KB of SRAM and stopped the reboots.

The C++ segfaults

I had been writing C++ at Cadence for a year by this point, in a large production codebase, with experienced engineers around me. Arduino C++ is a different experience. Segfaults with no stack trace, crashes that only appeared after thirty seconds of operation, memory corruption that manifested as wrong inference results rather than crashes. The tensor arena needed alignas(16) to prevent load-store alignment faults. Feature extraction buffers needed to be allocated in PSRAM rather than the stack or they would silently corrupt adjacent memory. None of this was in any tutorial I could find.

Testing without the hardware was its own problem. The firmware could not be unit tested in any conventional sense. I ended up simulating individual components in isolation and testing inference in Colab, then flashing and testing the integration on device. Each flash cycle was slow. Each test required physical interaction. It was the least efficient development process I have ever worked in, and somehow it produced something that actually worked.

The architecture that almost did not fit

Quantization was not optional. Float32 models consumed four times the memory of their int8 equivalents and exceeded the budget immediately. Post-training int8 quantization reduced model sizes by about 75% with less than two percent accuracy drop. But even quantized, fitting both models alongside the firmware required moving the tensor arenas out of internal SRAM entirely.

// Both arenas live in PSRAM, not internal SRAM
EXT_RAM_BSS_ATTR static uint8_t kws_arena[KWS_ARENA_SIZE];
EXT_RAM_BSS_ATTR static uint8_t slu_arena[SLU_ARENA_SIZE];

The arena sizes themselves had to be determined empirically. There is no formula. You start high, run inference on device, reduce by 1KB until it crashes, add 2KB margin. I did this separately for both models, on hardware, which meant many flash cycles and many crashes before I had stable numbers.

I had spent real time squeezing accuracy out of these models in Colab, going back and fixing the dataset, adjusting augmentation parameters, running ablation studies on architecture choices. Finding out that the bidirectional layers I had worked hard to keep had to be removed entirely was the kind of setback that tests whether you actually care about the outcome or just about your prior work. I cared about the outcome. I removed them.

The paper: page limits, LaTeX, and a cold email

I had not planned to write a paper. The project had grown well beyond its original scope and I had results that I thought were worth documenting properly, but I had no co-author and no institutional affiliation to submit under. I wrote a cold email to a professor at Delhi Technological University whose research overlapped with what I had built, explained what I had done, and asked if they would be willing to co-author and provide the institutional backing for submission.

They said yes. I am grateful for that.

Learning LaTeX in parallel with writing the paper was its own adventure. The conference had a page limit, and the full version of what I wanted to document was significantly longer than what fit. The extended version with all the tables, the full engineering challenges section, and the decision logs is in the GitHub repository. The paper itself was constrained by page count, and paying for extra pages was not something I could afford.

What I know now that I did not know at the start

The system works. It achieves 96.4% keyword spotting accuracy and 94.1% intent classification accuracy, runs entirely offline, costs under fifteen dollars in hardware, and lasts nearly eight days on a standard power bank. Those numbers are real. They are also from controlled evaluation on my voice, in my environment, under conditions I controlled.

The hard problem I did not solve is generalisation. The models degrade on accents, on speakers they have not seen, on linguistic variation I did not capture in a dataset recorded by one person in one room. This is not a small limitation. It is the reason the system cannot be deployed as-is to the population it was designed for. Solving it within the memory constraints of an MCU-class device, without retraining on device, is a genuinely hard research problem. It is the one I want to work on next.

What this project taught me is harder to quantify. I learned that constraints are not obstacles to working around. They are the thing that forces interesting decisions. I learned that being unsupervised is uncomfortable but not fatal. I learned that curiosity sustained over a long enough period will eventually produce something, even when the path is not clear and the resources are borrowed from other people's hobby shelves.

I started this with an ultrasonic sensor and a function that sent an email. I ended it with a published paper about cascading neural networks on embedded hardware. The distance between those two things was covered entirely on foot, one problem at a time.

The complete implementation, including every engineering decision, failure mode, and workaround documented in detail, is at Shiv07ansh/AIoT-Medicine-Reminder. The paper is available as a preprint at zenodo.org/records/19034554.