Blog Calibronic Blog | TinyML Tutorials & Embedded AI Projects

Recording Audio with STM32L4 and MEMS Microphones For TinyML Voice Applications

Acronyms

DFSDM: Digital filter for sigma-delta modulators
FOSR: Filter Oversampling Ratio
IOSR: Integrator Oversampling Ratio

TinyML applications are expanding rapidly across domains, and one of the most exciting areas is voice interfaces, where humans can control and interact with devices—like microwaves, coffee makers, or other home appliances—using their voice. To build such a system, three main components are required:

MEMS Microphone: To capture voice input
MCU: To run ML/DL inference on the audio
Actuator: To communicate with the MCU and take action in the system

In this post, I’ll guide you through the first step: recording audio in WAV format using a MEMS microphone and the STM32L4’s audio peripherals. Subsequent posts will cover the other components of a complete voice assistant system. As an STM32 fan, I’ll be using STM32 MCUs for all my TinyML projects. STM32 offers three common built-in audio peripherals, as follows and in this tutorial, I’ll focus on DFSDM. As an STM32 fan, I’ll be using STM32 MCUs for all my TinyML projects. STM32 offers three common built-in audio peripherals, as follows and in this tutorial, I’ll focus on DFSDM.

Understanding DFSDM

DFSDM stands for Digital Filter for Sigma-Delta Modulators. It has two main stages:

Input stage – receives the high-frequency bitstream from a digital microphone
Filtering stage – includes a low-pass digital filter and a decimator

Why filtering and decimation?

Low-pass filter removes high-frequency noise.
Decimator downsamples the PDM signal from the MEMS microphone (e.g., MP34DT01, which outputs 2–3 MHz) to a usable 16 kHz PCM signal for voice recognition.

Each PCM audio sample has 24-bit resolution, which we need to consider when writing MCU programs.

Program Overview

In this tutorial, the goal is to write a program that receives real-time 24-bit PCM data from the DFSDM via DMA1 Channel 2.

The raw 24-bit samples often have more resolution than needed, so we downscale them to 16-bit PCM.
These samples are stored in a ring buffer:

#define kAudioCaptureBufferSize 1024 * 16 // 16,384 samples
int16_t g_audio_capture_buffer[kAudioCaptureBufferSize]; // 16-bit per sample</span></em></code>

#define kAudioCaptureBufferSize 1024 * 16 // 16,384 samples
int16_t g_audio_capture_buffer[kAudioCaptureBufferSize]; // 16-bit per sample</span></em></code>

Memory usage: 16,384 × 2 bytes = 32 KB

Duration: 16,384 samples ÷ 16,000 Hz ≈ 1.024 seconds

Storing Audio in WAV Format

DMA1 Channel 1 is set up to send audio samples from the ring buffer using USART. On the PC, a Python script gets the data and saves it as a WAV file. This allows the microcontroller to send audio efficiently to the computer, keeping the sound samples correct for listening or later use.

In this tutorial, I am using the B-L4S5I-IOT01A Eval board, which has an STM32L4S5VIT6, and two MP34DT01-M microphones. The two microphones share the data and clock lines and are connected as shown in the diagram below. The LR pin (Left/Right channel mode) of one microphone is connected to VCC, and the other is connected to GND. This means that if LR is connected to VCC, the microphone sends bits on the clock’s rising edge, and if LR is connected to GND, it sends bits on the clock’s falling edge. This is important to remember when configuring the DFSDM channel. Since I am using only one microphone (rising edge data), I will be configuring one DFSDM channel and one filter.

DFSDM Configuration

Before using the DFSDM, we need to answer some important questions.

What sample rate will you use for recording, like 16 kHz, 32 kHz, or 48 kHz?
What is the bit depth for each sample, for example 16-bit or 24-bit?
How many channels do you need, 1 for mono or 2 for stereo?
What is the maximum frequency your MEMS microphone can handle?
How are the microphones connected to your PCB?

Finding the answers to these questions will help us set up the DFSDM correctly. For the sake of simplicity, I will answer the questions. Here are my answers:

The sample rate is 16 kHz because it is suitable for low-bitrate audio and speech recognition on the STM32.
The bit depth is 24-bit since the STM32 DFSDM has a 24-bit resolution.
I am using mono because only one microphone is active.
The maximum frequency is 3.2 MHz because I am using a digital MEMS microphone, the MP34DT01-M from STMicroelectronics, which supports up to 3.25 MHz.

Two microphones share the PDM data and clock lines and connect to the MCU’s DFSDM INT2 and DFSDM CLKOUT. One microphone sends data on the falling edge and the other on the rising edge, creating interleaved left/right data. In this case, STM32CubeIDE allows choosing PDM/SPI input from channel 2 with an internal clock. You might ask about SPI—yes, SPI is a separate peripheral, but STM32CubeIDE calls it SPI mode because the PDM interface behaves like SPI.

DFSDM Clock Configuration:

The first step is to configure the correct frequency that will be supplied to the DFSDM. Since the MEMS microphone has a maximum frequency of 3.2 MHz and my MCU clock is 80 MHz, we need to calculate the proper divider value. The divider can be found using the formula:

DFSDM Filter Configuration

There are three important parameters that must be set before using the DFSDM. These are two low-pass filter parameters, FOSR and IOSR, and one output offset value called RightBitShift. The filter parameters control the decimation process of the audio signal, while the RightBitShift helps scale the final output to avoid overflow. Based on the information given earlier, the values of FOSR and IOSR can be calculated using the formula:

What is RightBitShift and how do we choose the correct value for it?

To answer this question, we first need to understand how the Sinc filter is configured and what the bit depth of each data sample is. In this audio recording tutorial, I want to downscale each data sample from 32-bit to 16-bit PCM. The main reason for this is to save memory, and for speech recognition on STM32, this reduction does not affect performance.

Let’s review our setup:

FOSR (Filter Oversampling Ratio) = 200
IOSR (Integrator Oversampling Ratio) = 1
Filter type: FastSinc
DFSDM input clock: 3.2 MHz
Target: 16-bit PCM (int16_t)

Before calculating the RightBitShift value, let’s go over a few audio signal concepts. In audio, a DC value corresponds to a steady, unchanging input. The gain at DC of a filter shows how much a constant input will be amplified. For the DFSDM, which converts 1-bit PDM into multi-bit samples:

Gain at DC = maximum multiplier the filter applies to a constant PDM signal

Let’s clarify this with an example. Suppose the PDM input is always “1” (maximum +1). If the filter has a DC gain of 400, then after decimation, the output will be 400 in digital units (before any scaling).

Since our filter type is FastSinc, the gain at DC is approximately:

Gain = FOSR × 2 =200 × 2 = 400

Now let’s calculate the maximum output of the DFSDM. The formula is:

Because the STM32’s DFSDM has 24-bit resolution, the maximum value of the PCM data sample can be expressed as:

Without shifting, the raw DFSDM output range is:

If we apply RightBitShift = 0x03 (shift by 3):

So the scaled output range becomes:

This ensures the signal fits within the desired 16-bit PCM range without overflow.

In this tutorial, I am going to show how to use DMA with DFSDM and USART to reduce the MCU’s workload. Think of the Cortex-M core as the king of the MCU. It is smart, makes decisions, and runs the main logic, but it should not waste time on repetitive tasks. For that, it relies on DMA, which acts like a dedicated worker. The DMA is very efficient at doing one simple job over and over: moving data between peripherals and RAM. Since audio recording fits into this model, we will use two DMA channels. DMA1 Channel 1 is responsible for capturing the bitstream from DFSDM and saving it into RAM.

Each data packet is an integer (4 bytes), and we use a buffer of 1024 elements. Then, DMA1 Channel 2 takes the stored data and sends it through USART1 Tx, where it is received by a Python script on the PC and saved as a WAV file. Here is the overall process:

In DFSDM DMA configuration, enable Memory Increment so that the samples are stored sequentially in the buffer, and set the data width to Word, since the sample resolution is 24-bit.For the DMA mode, choose Circular. This mode triggers two callback functions: one when the first half of the buffer is filled, and another when the second half of the buffer is filled

Before the data is sent to USART, we need to process the DFSDM output. The DFSDM produces 32-bit signed values, and the buffer dfsdm_dma_buffer[] is declared as int32_t. After decimation, the valid audio sample width is typically 24 bits, but most audio pipelines expect 16-bit PCM. For this reason, we convert the data before transmission. Given a sample rate of 16 kHz and a buffer size of 1024, each DMA callback interrupt occurs every 32 ms. Therefore, the USART baud rate must be fast enough to send out each half of the buffer within this time frame. For example, if we use a baud rate of 921600, the transmission time is:

This is a great advantage because we can start sending the first half of the buffer to USART as soon as the half-complete callback is triggered, and then send the second half when the complete callback is called. In this way, audio samples are streamed in near real time.