Implementation

We are building a real-time system, so the output data rate will necessarily be equal to the input data rate. In the previous section we saw that grains are produced via a periodic pattern whose period is equal to the stride length. It would make perfect sense, therefore, to set the length of the DMA buffer equal to the stride and let that be the cadence of the processing function.

Unfortunately this simple approach clashes with the capabilities of the hardware and so we need to trade resources for some extra code complexity: welcome to the world of embedded DSP!

Memory limitations

If we play around with the Jupyter notebook implementation of granular synthesis, we can quickly verify that the voice changer works best with a grain length of about 30ms and an overlap factor of about 50%. Using the formula derived in the previous section, this gives us a grain stride of 20ms.

Now, remember that the smallest sampling frequency of our digital microphone is 32KHz so that 20ms correspond to 640 samples. Each sample is 2 bytes and the I2S protocol requires us to allocate a stereo buffer. This means that each DMA half-buffer will be

22640=2560 bytes.2 * 2 * 640 = 2560 \rm{~bytes.}

Since we need to use double buffering for DMA, and since we need symmetric input and output buffers, in total we will need to allocate over 10KB of RAM to the DMA buffers alone; when we start adding the internal buffering required for computation, we are going to quickly exceed the 16KB available on the Nucleo F072RB!

(As a side note, although 16KB may seem ludicrously low these days, remember that small memory footprints are absolutely essential for all devices that are not a personal computer. The success of IoT hinges upon low memory and low power consumption!)

To avoid the need of large DMA buffers, we will implement granular synthesis using the following tricks:

  • to save memory, all processing will be carried out on a mono signal;

  • we will use a single internal circular buffer that holds enough data to build the grains; we have seen in the previous section that we need a buffer at most as long as the grain. Using mono samples, this will require a length of 1024 samples, for a memory footprint of 2 KBytes.

  • we will fill the internal buffer with short DMA input transfers and compute a corresponding amount of output samples for each DMA call; DMA transfers can be as short as 16 or 32 samples each, thereby reducing the amount of memory required by the DMA buffers.

  • we will use a "smart choice" for the size of the grain, the tapering and the DMA transfer, so as to minimize processing

The code

To code the granular synthesis algorithm, copy and paste the Alien Voice project from within the STM32CubeIDE environment. We recommend choosing a name with the current date and "granular" in it. Remember to delete the old binary (ELF) file inside the copied project.

Here, we will set up the code for the "Darth Vader" voice transformer and will consider more advanced modifications in the next section.

DMA size

As we explained, the idea is to fill the main audio buffer in small increments to save memory. To this end, set the DMA half-buffer size to 32 samples in the USER CODE BEGIN PV section:

#define FRAMES_PER_BUFFER 32

Grain size and taper

We will use a grain length of L=1024L=1024 samples which corresponds to about 30ms for a sampling rate of 32KHz. The overlap is set at 50%, i.e., we will use a tapering slope of W=384W=384 samples. The resulting grain stride is S=640S=640.

TASK 1: Write a short Python function that returns the values of a tapering slope for a given length.

Add the following lines to the USER CODE BEGIN 4 section in main.c, where the values for the tapering slope are those computed by your Python function:

// grain length; 1024 samples correspond to 32ms @ 32KHz
#define GRAIN_LEN 1024
// length of the tapering slope using 50% overlap
#define TAPER_LEN 384
#define GRAIN_STRIDE (GRAIN_LEN - TAPER_LEN)

// tapering slope, from 0 to 1 in TAPER_LEN steps
static int32_t TAPER[TAPER_LEN] = {...};

Main buffer

We choose the buffer length to be equal to the size of the grain, since anyways the voice transformer doesn't sound too good for α>1.5\alpha > 1.5. With a size equal to a power of two, we will be able to use bit masking to enforce circular access to the buffer. Add the following lines after the previous ones:

#define BUF_LEN 1024
#define BUFLEN_MASK (BUF_LEN-1)
static int16_t buffer[BUF_LEN];

// input index for inserting DMA data
static uint16_t buf_ix = 0;
// index to beginning of current grain
static uint16_t curr_ix = 0;
// index to beginning of previous grain
static uint16_t prev_ix = BUF_LEN - GRAIN_STRIDE;
// index of sample within grain
static uint16_t grain_m = 0;

With these values the buffers are set up for causal operation (i.e., for lowering the voice pitch); we will tackle the problem of noncausal operation later.

You can now examine the memory footprint of the application by compiling the code and looking at the "Build Analyzer" tab on the lower right corner of the IDE. You should see that we are only using less than 30% of the onboard RAM.

Processing function

This is the main processing function:

inline static void VoiceEffect(int16_t *pIn, int16_t *pOut, uint16_t size) {
  // put LEFT channel samples to mono buffer
  for (int n = 0; n < size; n += 2) {
    buffer[buf_ix++] = pIn[n];
    buf_ix &= BUFLEN_MASK;
  }

  // compute output samples
  for (int n = 0; n < size; n += 2) {
    // sample from current grain
    int16_t y = Resample(grain_m, curr_ix);
    // if we are in the overlap zone, compute sample from previous grain and mix using tapering slope
    if (grain_m < TAPER_LEN) {
      int32_t z = Resample(grain_m + GRAIN_STRIDE, prev_ix) * (0x07FFF - TAPER[grain_m]);
      z += y * TAPER[grain_m];
      y = (int16_t)(z >> 15);
    }
    // put sample into both LEFT and RIGHT output slots
    pOut[n] = pOut[n+1] = y;
    // update index inside grain; if we are at the end of the stride, update buffer indices
    if (++grain_m >= GRAIN_STRIDE) {
      grain_m = 0;
      prev_ix = curr_ix;
      curr_ix = (curr_ix + GRAIN_STRIDE) & BUFLEN_MASK;
    }
  }
}

The processing loop uses an auxiliary function Resample(uint16_t m, uint16_t N) that is supposed to return the interpolated value x(N+αm)x(N + \alpha m).

A simplistic implementation is to return the sample with integer index closest to N+αmN + \alpha m:

// rate change factor
static int32_t alpha = (int32_t)(0x7FFF * 2.0 / 3.0);

inline static int16_t Resample(uint16_t m, uint16_t start) {
  // non-integer index
  int32_t t = alpha * (int32_t)m;
  int16_t T = (int16_t)(t >> 15) + (int16_t)start;
  return buffer[T & BUFLEN_MASK];
}

TASK 2: Write version of Resample() that performs proper linear interpolation between neighboring samples.

Benchmarking

Since our processing function is becoming a bit more complex than before, it is interesting to start benchmarking its performance.

Remember that, at 32KHz, we can use at most 30μs30\mu s per sample; we can modify the timing function to return the number of microseconds per sample like so:

#define STOP_TIMER {\
  timer_value_us = 1000 * __HAL_TIM_GET_COUNTER(&htim2) / FRAMES_PER_BUFFER;\
  HAL_TIM_Base_Stop(&htim2); }

If we now use the method described before, we can see that the current implementation (with the full fractional resampling code) requires between 5.2μs5.2\mu s and 8.5μs8.5\mu s per sample, which is well below the limit. The oscillation between the two values reflects the larger computational requirements during the tapering slope.

Solutions

Are you ready to see the answer? :)

Last updated