Implementation
We are building a real-time system, so the output data rate will necessarily be equal to the input data rate. In the previous section we saw that grains are produced via a periodic pattern whose period is equal to the stride length. It would make perfect sense, therefore, to set the length of the DMA buffer equal to the stride and let that be the cadence of the processing function.
Unfortunately this simple approach clashes with the capabilities of the hardware and so we need to trade resources for some extra code complexity: welcome to the world of embedded DSP!
Memory limitations
If we play around with the Jupyter notebook implementation of granular synthesis, we can quickly verify that the voice changer works best with a grain length of about 30ms and an overlap factor of about 50%. Using the formula derived in the previous section, this gives us a grain stride of 20ms.
Now, remember that the smallest sampling frequency of our digital microphone is 32KHz so that 20ms correspond to 640 samples. Each sample is 2 bytes and the I2S protocol requires us to allocate a stereo buffer. This means that each DMA half-buffer will be
Since we need to use double buffering for DMA, and since we need symmetric input and output buffers, in total we will need to allocate over 10KB of RAM to the DMA buffers alone; when we start adding the internal buffering required for computation, we are going to quickly exceed the 16KB available on the Nucleo F072RB!
(As a side note, although 16KB may seem ludicrously low these days, remember that small memory footprints are absolutely essential for all devices that are not a personal computer. The success of IoT hinges upon low memory and low power consumption!)
To avoid the need of large DMA buffers, we will implement granular synthesis using the following tricks:
to save memory, all processing will be carried out on a mono signal;
we will use a single internal circular buffer that holds enough data to build the grains; we have seen in the previous section that we need a buffer at most as long as the grain. Using mono samples, this will require a length of 1024 samples, for a memory footprint of 2 KBytes.
we will fill the internal buffer with short DMA input transfers and compute a corresponding amount of output samples for each DMA call; DMA transfers can be as short as 16 or 32 samples each, thereby reducing the amount of memory required by the DMA buffers.
we will use a "smart choice" for the size of the grain, the tapering and the DMA transfer, so as to minimize processing
The code
To code the granular synthesis algorithm, copy and paste the Alien Voice project from within the STM32CubeIDE environment. We recommend choosing a name with the current date and "granular"
in it. Remember to delete the old binary (ELF) file inside the copied project.
Here, we will set up the code for the "Darth Vader" voice transformer and will consider more advanced modifications in the next section.
DMA size
As we explained, the idea is to fill the main audio buffer in small increments to save memory. To this end, set the DMA half-buffer size to 32 samples in the USER CODE BEGIN PV
section:
Grain size and taper
We will use a grain length of samples which corresponds to about 30ms for a sampling rate of 32KHz. The overlap is set at 50%, i.e., we will use a tapering slope of samples. The resulting grain stride is .
TASK 1: Write a short Python function that returns the values of a tapering slope for a given length.
Add the following lines to the USER CODE BEGIN 4
section in main.c
, where the values for the tapering slope are those computed by your Python function:
Main buffer
We choose the buffer length to be equal to the size of the grain, since anyways the voice transformer doesn't sound too good for . With a size equal to a power of two, we will be able to use bit masking to enforce circular access to the buffer. Add the following lines after the previous ones:
With these values the buffers are set up for causal operation (i.e., for lowering the voice pitch); we will tackle the problem of noncausal operation later.
You can now examine the memory footprint of the application by compiling the code and looking at the "Build Analyzer" tab on the lower right corner of the IDE. You should see that we are only using less than 30% of the onboard RAM.
Processing function
This is the main processing function:
The processing loop uses an auxiliary function Resample(uint16_t m, uint16_t N)
that is supposed to return the interpolated value .
A simplistic implementation is to return the sample with integer index closest to :
TASK 2: Write version of Resample()
that performs proper linear interpolation between neighboring samples.
Benchmarking
Since our processing function is becoming a bit more complex than before, it is interesting to start benchmarking its performance.
Remember that, at 32KHz, we can use at most per sample; we can modify the timing function to return the number of microseconds per sample like so:
If we now use the method described before, we can see that the current implementation (with the full fractional resampling code) requires between and per sample, which is well below the limit. The oscillation between the two values reflects the larger computational requirements during the tapering slope.
Solutions
Are you ready to see the answer? :)
Last updated