The Formulas
Last updated
Last updated
The key technique behind simple pitch shifting is that of playing the audio file either more slowly (to lower the pitch) or faster (to raise the pitch). This is just like spinning an old record at a speed different than the nominal RPM value.
In the digital world we can always simulate the effect of changing an analog playing speed by using fractional resampling; for a refresher on this technique please refer to Lecture 3.3.2 on Coursera. Resampling, however, has two problems:
the pitch of speech is changed but so is the overall speed, which we do not want;
the ratio of output to input samples for the operation is not one, so it cannot be implemented in real time.
To overcome these limitations, we can use granular synthesis: we split the input signal into chunks of a given length (the grains) and we perform resampling on each grain independently to produce a sequence of equal-length output grains.
In order to implement granular synthesis in real time we need to take into account the concepts of grain length and grain stride. A grain should be long enough so that it contains enough pitched speech for resampling to work; but it should also be short enough so that it doesn't straddle too many different sounds in an utterance. Experimentally, the best results for speech are obtained using grains between 20 and 40ms.
The grain stride indicates the displacement in samples between successive grains and it is a function of grain length and of the overlap between successive grains. With no overlap, the grain stride is equal to the grain length; however, overlap between neighboring grains is essential to reduce the artifacts due to the segmentation. Overlapping output grains are blended together using a tapering window; the window is designed so that it performs linear interpolation between samples from overlapping grains.
Call the amount of overlap (as a percentage) between neighboring grains. With there is no overlap whereas with all the samples in a grain overlap with another grain. The relationship between grain length and grain stride is . This is illustrated in the following figure for varying degrees of overlap and a stride ofsamples; grains are represented using the shape of the appropriate tapering window:
Note that the stride is constant for any amount of overlap and that each grain starts at the same instants independently of overlap; this is the key observation that will allow us to implement granular synthesis in real time.
By contrast when we raise the pitch we are using subsampling, that is, samples are being discarded to create an output grain and so, to fill the grain, we will need to "look ahead" and borrow data from beyond the original grain's end boundary. The algorithm therefore is noncausal but, crucially, we can exactly quantify the amount of lookahead and handle it via buffering.
We will see in the next sections that buffering is required anyway in order to implement overlapping windows, so that the extra buffering required by subsampling will just be an extension of the general setup.
The full output signal can be expressed in closed form by looking at the following picture, which shows the periodic pattern of overlapping grains:
We need to compute:
Are you ready to see the answer? :)
We can express the content of the -th output grain as
whereis the interpolated, continuous-time version of the input signal andis the sampling rate change factor (with for subsampling, i.e. to lower the pitch, and for upsampling, i.e. to raise the pitch). Note that the -th grain starts at and is built using input data from as well.
In practice we will obviously perform local interpolation rather than full interpolation to continuous time, as explained in Lecture 3.3.2 on Coursera. Let and set and ; with this, the interpolation can be approximated as
Note that when we lower the voice's pitch (i.e. we implement the "Darth Vader" voice transformer), since , the computation of the output grains is strictly causal, that is, at any point in time we only need to access past input samples. Indeed, when we oversample, only a fraction of the grain's data will be used to regenerate its content; if a grain's length is, say, 100, and we are lowering the frequency by , we will only need 2/3 of the grain's original data to build the new grain.
For instance, if we are raising the frequency by and our grain length is, say, 100 samples, we will need a buffer of 50 "future" samples; this can be accomplished by accepting an additional processing delay of 50 samples. The difference between over- and under-sampling is clear when we look at the illustration in the notebook that shows the input sample index as a function of the output sample index:
The tapering window is as long as the grain and it is shaped so that the overlapping grains are linearly interpolated. The left sloping part of the window is samples long, with The tapering weights are therefore expressed by the formula:
Any output index can be written as:
is the index of the current grain and is the index of the sample within the current grain. Note that the sample at is also the sample with index with respect to the previous grain. With this, the output at is the sum of the sample number from the current grain plus the sample number from the previous grain; both samples are weighed by the linear tapering slope:
Consider once again the grain computation pattern, periodic with period ; let's use the index to indicate the current position inside the current pattern; as goes from zero to we need to compute:
for all values of
for (the tail of the previous grain).
Which audio samples do we need to have access to at any given time? Without loss of generality, consider the grain for as in the following figure:
for
for
If both expressions are causal so that we can use a standard buffer to store past values. The size of the buffer is determined by "how far" in the past we need to reach; in the limit, for close to zero, we need to access from when we compute the end of the tapering section, so that, in the worst case, the buffer must be as long as the grain size . The overall processing delay of the voice changer in this case is equal to the size of the DMA transfer.
If , on the other hand, we need to also access future samples; this is of course not possible but we can circumvent the problem by introducing a larger processing delay. This is achieved by moving the input data pointer in the buffer further ahead with respect to the output data pointer. The maximum displacement between the current time and the future sample that we need takes place for (i.e., at the end of the tapering slope) for which:
By offsetting the input and output pointers by samples, we can raise the pitch of the voice by at the price of a processing delay equal to samples.
TASK 1: Determine the maximum range for if the size of the audio buffer is equal to the grain size .
We have already seen that for we need a causal buffer whose maximum length is equal to . For the needed buffer size is , so if the maximum buffer size is , we must have .