The Formulas

The key technique behind simple pitch shifting is that of playing the audio file either more slowly (to lower the pitch) or faster (to raise the pitch). This is just like spinning an old record at a speed different than the nominal RPM value.

In the digital world we can always simulate the effect of changing an analog playing speed by using fractional resampling; for a refresher on this technique please refer to Lecture 3.3.2 on Coursera. Resampling, however, has two problems:

  • the pitch of speech is changed but so is the overall speed, which we do not want;

  • the ratio of output to input samples for the operation is not one, so it cannot be implemented in real time.

To overcome these limitations, we can use granular synthesis: we split the input signal into chunks of a given length (the grains) and we perform resampling on each grain independently to produce a sequence of equal-length output grains.

Grain rate vs length

In order to implement granular synthesis in real time we need to take into account the concepts of grain length and grain stride. A grain should be long enough so that it contains enough pitched speech for resampling to work; but it should also be short enough so that it doesn't straddle too many different sounds in an utterance. Experimentally, the best results for speech are obtained using grains between 20 and 40ms.

The grain stride indicates the displacement in samples between successive grains and it is a function of grain length and of the overlap between successive grains. With no overlap, the grain stride is equal to the grain length; however, overlap between neighboring grains is essential to reduce the artifacts due to the segmentation. Overlapping output grains are blended together using a tapering window; the window is designed so that it performs linear interpolation between samples from overlapping grains.

Note that the stride is constant for any amount of overlap and that each grain starts at the same instants independently of overlap; this is the key observation that will allow us to implement granular synthesis in real time.

The grains' content

Causality

By contrast when we raise the pitch we are using subsampling, that is, samples are being discarded to create an output grain and so, to fill the grain, we will need to "look ahead" and borrow data from beyond the original grain's end boundary. The algorithm therefore is noncausal but, crucially, we can exactly quantify the amount of lookahead and handle it via buffering.

We will see in the next sections that buffering is required anyway in order to implement overlapping windows, so that the extra buffering required by subsampling will just be an extension of the general setup.

The tapering window

The output signal

The full output signal can be expressed in closed form by looking at the following picture, which shows the periodic pattern of overlapping grains:

Buffering

We need to compute:

Solutions

Are you ready to see the answer? :)

Last updated