The MIT Media Lab Phase Vocoder | ||
01 | Testfiles for pvanal | |
1 | 4 partials, 1 sec | |
2 | 4 partials, .2 sec | |
Testing Procedure | ||
A Sample Session at the Terminal | ||
From Phase to Frequency | ||
22 | Ktimpnt Variations | |
1 | LINE pointer, 1:1 ratio, santur1.pv1 | |
1B | LINSEG pointer, variable ratios, speech1.pv1 | |
1C | LINE pointer, 1:250 ratio, snap.pv1 | |
2 | EXPON, 1:1 & 1:2 ratio, santur1.pv1 | |
3 | LFO pointer, 1:1.25 ratio, santur1.pv1 | |
23 | Kfmod Variations | |
1 | EXPON transposition, santur1.pv1 | |
2 | LINSEG transposition, speech1.pv1 |
Analysis The program pvanal performs an analysis of a soundfile and writes the result to an analysis file. The analysis basically consists of an FFT, where framesize of the window (a number of consecutive audio samples) and window overlap factor are optional CL arguments. The analysis file will be at least twice as big as the soundfile and contains alternating magnitude and frequency values. This data can be interpreted as sequence of short time Fourier transforms (per frame) or as the time-varying response of a bank of narrow bandpass filters (per channel).
Modification Software can modify frequency and/or amplitude data over the whole or a partial range of analyzed data. Moore gives a thorough exposition of programs to implement these possibilities. (Moore 1990: pp.227-263) Synthesis The synthesis is done by Csound's unit generator PVOC. The original sound can be resynthesized with high fidelity. More commonly time and/or frequency modifications are performed during resynthesis. In this case, the PVOC unit will interpolate between given 'stable' points.
Dolson, Mark 1987.
"The Phase Vocoder: A Tutorial."
Computer Music Journal 10(4):14-27.
Dudley, Homer 1939.
"The Vocoder." Bell Labs. Rec. 18:122-126.
Reprinted in IEEE Transactions on Acoustics, Speech and Signal
Processing ASSP-29(3):347-351.
Flanagan, J.L. and R.M. Golden 1966.
"Phase Vocoder."
Bell System Technical Journal 45:1493-1509.
Reprinted in IEEE Transactions on Acoustics, Speech and Signal
Processing ASSP-29(3):388-404.
Gordon, J.W., and John Strawn 1985.
"An Introduction To The Phase Vocoder."
in J. Strawn, ed. 1985.
Digital Audio Signal Processing: An Anthology.
A-R Editions, pp. 221-270.
Grey J.M. 1977.
"Multidimensional Perceptual Scaling of Musical Timbres."
Journal of the Acoustical Society of America 61(5):1270-1277.
Grey, J.M., and J.A. Moorer 1977.
"Perceptual Evaluations of Synthesized Musical Instrument Tones."
Journal of the Acoustical Society of America 62(2):454-462.
Grey, J.M., and J.W. Gordon 1978.
"Perceptual Effects of Spectral Modification on Musical Timbres."
Journal of the Acoustical Society of America 63(5):1493-1500.
Griffin, D.W., and J.S. Lim 1984.
"Signal Estimation from Modified Short-Time Fourier Transform."
IEEE Transactions on Acoustics, Speech and Signal Processing
ASSP-32(2):236-243.
Moore, F.R. 1990.
"The Phase Vocoder."
in Elements of Computer Music.
Prentice-Hall, pp. 227-263.
Moorer, J.A. 1978.
"The Use of The Phase Vocoder in Computer Music Applications."
Journal of the Audio Engineering Society 26(1/2):42-45.
Portnoff, M.R. 1976.
"Implementation of the Digital Phase Vocoder Using the Fast
Fourier Transform."
IEEE Transactions on Acoustics, Speech and Signal Processing
ASSP-24:243-248.
Portnoff, M.R. 1980.
"Time-Frequency Representation of Digital Signals and Systems
Based on Short-Time Fourier Analysis."
IEEE Transactions on Acoustics, Speech and Signal Processing
ASSP-28(1):55-69.
Portnoff, M.R. 1981a.
"Short-Time Fourier Analysis of Sampled Speech."
IEEE Transactions on Acoustics, Speech and Signal Processing
ASSP-29(3):364-373.
Portnoff, M.R. 1981b.
"Time-Scale Modification of Speech Based on Short-Time Fourier
Analysis."
IEEE Transactions on Acoustics, Speech and Signal Processing
ASSP-29(3):374-390.
Press, William H. et al. 1988.
"Fast Fourier Transform."
in Numerical Recipes in C: The Art of
Scientific Computing.
Cambridge: Cambridge University Press, pp. 496-536.
Wishart, T. 1990.
"The Phase Vocoder."
Composers' Desktop Project Csound for Atari
Manual. York, England: CDP.
Following this header, the analysis data is stored as float, with magnitudes and frequencies in turn for the first N/2+1 Fourier bins of each frame. We wrote a few programs to investigate the phase vocoder algorithm on its analysis side.
The source codes of these programs are not included here, but they are shipped with the Internet version of the catalogue. Written in C and not depending on audio hardware, the programs work on any Csound platform. Here is a short summary of their function.
-channels [analysis file] reads file, data displayed per channel -frames [analysis file] reads file, data displayed per frame -magnitud [analysis file] reads file, data displayed per frame, above threshold magnitude only -wrapped [analysis file] reads file, displays the value of phase and intermediate variables on the way to the approximated frequency.Sample runs of two of these programs are shown on the following pages. The data flood is large and illustrates the need for specific display programs, adapted to purpose and nature of the sound material at hand. The ability to skip a number of analysis frames can further reduce the stream of data.
The program flow of pvanal gives a crude picture of this FFT-based phase vocoder, avoiding the necessity to go too deeply into the intricate network of source files.
The meaningful part is the FFT loop of pvanal.c, where amplitude/time values are transformed into amplitude/frequency values. The loop functions are found in the source dsputil.c and the buffers below are central in the this Fourier transform procedure. We have looked into the phase/frequency conversion in greater detail later.
The interpolation mechanism used by the PVOC unit generator during re-synthesis has not been covered in this catalogue. It is paramount to a fruitful understanding of this synthesis technique.
The second and third command illustrate the operation of 'magnitud'. First the program displays the information found in the file header. After specifying number of frames and threshold magnitude, analysis data fills the screen. The data resulting from the first 'magnitud' call is reproduced on pages 168 and 169.
It is the _first_ frame of the analyzed speech file. A hard copy of the complete analysis file occupies 1718 pages!
The output of the second readfmag call (page 170,171) is restricted to channels with magnitudes above 3. This greatly reduces the data flood.
speech1.SF: AIFF, 220500 samples, baseFrq 261.6 (midi 60), sustnLp: mode 0, relesLp: mode 0 audio sr = 44100, monaural analysing 220500 sample frames (5.0 secs) 1024 infrsize, 256 infrInc 859 output frames estimated frame: 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 620 640 660 680 700 720 740 760 780 800 820 840 858 859 output frames written speech1.pv1:dataBsize: data occupies 3525336 bytes dataFormat: 36 minFreq: 0.00 maxFreq: 22050.00 freqFormat, log or lin: 1 frameFormat: 7 mono, stereo, quad: 1 samplingRate of original audio: 44100 frameSize: 1024 samps per frame frameIncr: 256 samps per frame frameBsize: total of 859 frames How many frames do you want me to display? 1 What threshold magnitude should I take? 0 <-- ie none speech1.pv1:dataBsize: data occupies 3525336 bytes dataFormat: 36 minFreq: 0.00 maxFreq: 22050.00 freqFormat, log or lin: 1 frameFormat: 7 mono, stereo, quad: 1 samplingRate of original audio: 44100 frameSize: 1024 samps per frame frameIncr: 256 samps per frame frameBsize: total of 859 frames How many frames do you want me to display? 400 What threshold magnitude should I take? 3
Output of 'magnitud', 1st call. mag freq fr 1: 1.48 0.00 1.06 86.13 0.56 129.20 0.46 172.27 0.93 215.33 1.23 258.40 1.71 301.46 1.95 344.53 1.50 387.60 0.73 430.66 0.37 473.73 0.28 516.80 0.28 559.86 0.29 602.93 0.27 646.00 0.32 689.06 0.32 732.13 0.28 775.20 etc. every 40 Hz channel till sr/2 ------ snip ----- snip ---- snip 0.00 21705.47 0.01 21748.54 0.01 21791.60 0.00 21834.67 0.00 21877.73 0.01 21920.80 0.00 21963.87 0.00 22006.93 0.00 22050.00 Done: 1 frame of speech1.pv1
Output of 'magnitud' 2nd call. mag freq fr 1: fr 2: 3.52 344.53 fr 3: fr 4: fr 5: fr 6: fr 7: 3.04 344.53 fr 8: fr 9: fr 10: fr 11: 3.75 344.53 fr 12: 3.75 344.53 fr 13: fr 14: fr 15: fr 16: 3.55 344.53 fr 17: 3.62 344.53 fr 18: fr 19: fr 20: fr 21: 3.66 344.53 fr 22: 3.19 344.53 fr 23: fr 24: 3.06 215.33 fr 25: 3.43 215.33 3.60 344.53 fr 26: 3.78 344.53 fr 27: fr 28: fr 29: fr 30: 3.72 344.53 fr 31: 3.13 344.53 fr 32: fr 33: fr 34: 3.16 43.07 fr 35: 4.56 0.00 3.29 43.07 4.02 344.53 fr 36: 3.08 43.07 3.57 344.53 fr 37: 3.45 0.00 3.40 43.07 fr 38: 3.74 43.07 fr 39: 3.71 0.00 4.54 43.07 fr 40: 4.68 0.00 4.15 43.07 3.79 344.53 fr 41: fr 42: 3.14 0.00 fr 43: fr 44: 3.11 0.00 3.61 43.07 fr 45: 4.60 0.00 3.86 43.07 3.09 344.53 fr 46: 3.54 43.07 fr 47: 4.31 0.00 3.62 43.07 fr 48: 3.56 43.07 fr 49: 4.15 0.00 3.95 43.07 3.06 344.53 fr 50: 4.36 0.00 3.71 43.07 3.02 344.53 fr 51: 3.35 43.07 fr 52: 3.81 0.00 3.36 43.07 fr 53: fr 54: 4.29 0.00 3.42 43.07 3.89 344.53 fr 55: 4.27 0.00 3.72 43.07 3.41 344.53 fr 56: 3.20 43.07 fr 57: 4.16 0.00 3.37 43.07 fr 58: 3.34 43.07 3.29 344.53 fr 59: 4.18 0.00 3.65 43.07 3.82 344.53 fr 60: 4.60 0.00 4.18 43.07 fr 61: 3.98 43.07 fr 62: 4.19 0.00 3.71 43.07 fr 63: 3.04 43.07 fr 64: 4.19 0.00 3.16 43.07 3.24 344.53 fr 65: 3.84 0.00 3.71 43.07 fr 66: 3.76 43.07 fr 67: 4.12 0.00 3.82 43.07 fr 68: 3.51 43.07 4.03 344.53 fr 69: 4.69 0.00 3.62 43.07 3.60 344.53 fr 70: 3.45 0.00 fr 71: 3.17 43.07 fr 72: 4.00 0.00 4.00 43.07 fr 73: 3.82 43.07 4.02 344.53 fr 74: 4.91 0.00 3.85 43.07 3.33 344.53 fr 75: 3.24 43.07 fr 76: fr 77: 3.24 0.00 fr 78: 3.31 344.53 fr 79: 4.51 0.00 3.80 43.07 fr 80: 3.53 43.07 fr 81: fr 82: 3.11 344.53 fr 83: 3.27 344.53 fr 84: 3.75 0.00 fr 85: fr 86: fr 87: 3.74 344.53 fr 88: fr 89: 3.39 0.00 3.08 215.33 fr 90: fr 91: fr 92: fr 93: 3.18 301.46 3.40 344.53 fr 94: ============================= etc. all frames till the end ============================= ------------ snip ----------------------- fr 362: 3.54 344.53 fr 363: 3.18 344.53 fr 364: fr 365: fr 366: 3.06 344.53 fr 367: 4.13 344.53 fr 368: 3.27 344.53 fr 369: fr 370: fr 371: 3.25 344.53 fr 372: 3.13 344.53 fr 373: fr 374: fr 375: fr 376: 3.81 344.53 fr 377: 3.74 344.53 fr 378: fr 379: fr 380: fr 381: 4.11 344.53 fr 382: 3.57 344.53 fr 383: fr 384: fr 385: fr 386: 3.67 344.53 fr 387: 3.01 344.53 fr 388: fr 389: fr 390: fr 391: 3.02 344.53 fr 392: fr 393: fr 394: fr 395: 3.28 344.53 fr 396: 3.10 344.53 fr 397: fr 398: fr 399: fr 400: 3.62 344.53 Done: 400 frames of speech1.pv1
The program 'wrapped' displays the successive values of certain expressions in the source code per channel. In the discussion below, these expressions will be identified by boldfaced text.
First the phase value resulting from the rectangular to polar conversion is assigned to float variable p:
p = pha[2L*i] see phaseThen the phase change since the last frame is computed:
p -= oldPh[i] see diff-pfollowed by a call of macro MMmaskPhs, p = þ phase:
MMmaskPhs(p,z,pi,oneOnPi)The preprocessor has replaced this macro call by:
z = (int)(p*oneOnPi); p -= pi*(float)(int)((z+((z>=0)?(z&1):-(z&1) ))));In the first statement, the integer z is assigned the value of float p, scaled by 1/ã. By casting float to integer, all decimals are truncated. The second statement is far more complex and can best be approached in a number of small steps. First we look at the expression condit
z + ( (z>=0) ? (z&1) : -(z&1) ) see conditwhich consists of a test, an evaluation and an addition. The expression
z&1discriminates between odd and even numbers. In binary representation, all odd integers have their bit 0 set. Therefore, if z is odd, then z&1 = 1. For even integers z, z&1 evaluates to 0 and condit = z. So the conditional test
z>=0 ? (z&1): -(z&1)will only have consequences for odd integers. These can be simplified to the following if-else statement:
If z>=0, condit = z+1 else condit = z-1The net effect of condit is to make all z even: z = n*2.
Then the masking/unwrapping is completed by
p = p - pi * condit see maskedThe value of condit is scaled back by ã and added or subtracted from the þ phase p. All p are unwrapped into the principal branch of the inverse tangent function: -pi
In the last two statements of UnwrapPhase, the old polar phase and the unwrapped þ phase p are saved:
oldPh[i] = pha[2L]*i; pha[2L*i] = p;In this way, the loop works through all the phase values of the FFT frame, and then proceeds to PhaseToFreq.
Output from 'wrapped'
We show only the first 6 channels out of 17: each channel with successive expression values for 5 frames. The example serves to cross-check one's understanding of the variables in the analyzed source codes.
=== program output ==== The program displays analysis data of pvoc.file Successful read of: 60_01_2.pv1 Displaying Header Data dataBsize: data occupies 37264 bytes dataFormat: 36 samplingRate of original audio: 22050 mono, stereo, quad: 1 frameSize: 32 audio samples per frame frameIncr: increasing 16 audio samples per frame frameBsize: 274 frames in this file frameFormat: 7 minFreq: 0.00 maxFreq: 11025.00 freqFormat, log or lin: 1 How many frames do you want me to display? 5 **************** Settings ************************** All phase related expressions are scaled by factor ã. srOn2pi = 219.34 eDphIncr = 3.14 frqPerBin = ñ 689.06 channel 1: -689.06 < 0.00 < 689.06 phase diff-p condit masked expDpha Diff masked local-F glob-F 0.00 0.00 0 0.00 0.00 0.00 0.00 0.00 0.00 219.34 219.34 220 -0.66 0.00 -0.66 -0.66 -457.95 -457.95 -219.34 -438.67 -438 -0.67 0.00 -0.67 -0.67 -462.23 -462.23 0.00 219.34 220 -0.66 0.00 -0.66 -0.66 -457.95 -457.95 219.34 219.34 220 -0.66 0.00 -0.66 -0.66 -457.95 -457.95 channel 2: 0.00 < 689.06 < 1378.12 phase diff-p condit masked expDpha Diff masked local-F glob-F 161.35 161.35 162 -0.65 -1.00 0.35 0.35 244.36 933.42 266.06 104.71 104 0.71 -1.00 1.71 -0.29 -202.56 486.50 319.40 53.34 54 -0.66 -1.00 0.34 0.34 234.98 924.04 315.85 -3.55 -4 0.45 -1.00 1.45 -0.55 -381.35 307.71 322.95 7.11 8 -0.89 -1.00 0.11 0.11 73.05 762.12 channel 3: 689.06 < 1378.12 < 2067.19 phase diff-p condit masked expDpha Diff masked local-F glob-F 315.20 315.20 316 -0.80 0.00 -0.80 -0.80 -554.07 824.06 339.37 24.17 24 0.17 0.00 0.17 0.17 120.58 1498.71 288.42 -50.95 -50 -0.95 0.00 -0.95 -0.95 -657.27 720.85 340.08 51.66 52 -0.34 0.00 -0.34 -0.34 -234.41 1143.71 341.29 1.21 2 -0.79 0.00 -0.79 -0.79 -541.72 836.40 channel 4: 1378.12 < 2067.19 < 2756.25 phase diff-p condit masked expDpha Diff masked local-F glob-F 716.45 716.45 716 0.45 1.00 1.45 -0.55 -375.74 1691.45 600.35 -116.10 -116 -0.10 -1.00 0.90 0.90 617.56 2684.75 680.25 79.90 80 -0.10 -1.00 0.90 0.90 621.59 2688.77 616.58 -63.68 -64 0.32 -1.00 1.32 -0.68 -466.72 1600.47 600.84 -15.74 -16 0.26 -1.00 1.26 -0.74 -508.18 1559.01 channel 5: 2067.19 < 2756.25 < 3445.31 phase diff-p condit masked expDpha Diff masked local-F glob-F 878.03 878.03 878 0.03 0.00 0.03 0.03 19.44 2775.69 1000.71 122.68 122 0.68 0.00 0.68 0.68 470.24 3226.49 855.26 -145.45 -146 0.55 0.00 0.55 0.55 377.39 3133.64 991.26 136.00 136 0.00 0.00 0.00 0.00 0.27 2756.52 1004.12 12.86 12 0.86 0.00 0.86 0.86 592.27 3348.52 channel 6: 2756.25 < 3445.31 < 4134.37 phase diff-p condit masked expDpha Diff masked local-F glob-F 1060.97 1060.97 1060 0.97 -1.00 1.97 -0.03 -18.81 3426.50 944.04 -116.93 -116 -0.93 -1.00 0.07 0.07 45.07 3490.39 1044.30 100.27 100 0.27 -1.00 1.27 -0.73 -505.84 2939.47 935.53 -108.77 -108 -0.77 -1.00 0.23 0.23 155.56 3600.87 905.72 -29.81 -30 0.19 -1.00 1.19 -0.81 -560.77 2884.55 Done: 6 channels and 5 frames of file 60_01_2.pv1
Since the phase difference is measured in regular frame increase intervals, its value depends on the window overlap factor. Increases of a whole frame result in phase differences of 2ã, increases of 1/2 frame result in phase differences of ã, and so on.
The corresponding constant is called eDphIncr. In the expression below, an expected phase difference (see expDpha) is subtracted from the unwrapped phase difference.
p = pha[2L*i]-expectedDphas; see DiffThe emerging difference is masked (as in UnwrapPhase) and saved:
pha[2L*i] = p; see maskedNext, the difference is converted to frequency by
pha[2L*i] = pha[2L*i] * srOn2pi; see local-F pha[2L*i] = pha[2L*i] + binMidFrq; see global-FIn the latter statement the channel's center frequency is added to the local frequency value.
In the last three statements the values of expectedDpha and
binMidFrq are updated for the next pass. The values for the
expected phase difference depend on eDphaIncr, but must lay
within the range -ã
The instruments of subgroup 22 explore the manipulation of PVOC's
ktimpnt input. The variable ktimpnt signifies a point in time of
the analysis file.
Instrument 1 shows how to resynthesize the original soundfile (a
santur). For idur = 5 seconds, LINE produces a linear set of
values that will let PVOC progress through the analysis file at
the original speed.
In instrument 2, LINE will have PVOC re-synthesis the analysis
file backwards. Choosing durations that differ from the original
soundfile duration results in time-stretching or time-compression
of the analysis file, without altering the pitch.
WAV and mp3
The pointer into this speech analysis file follows the values
produced by LINSEG. Resynthesis proceeds in forward motion. The
various slopes determine the amount of time-stretching or
compression. With regard to these effects, the slope pattern
results in a subtle modulation of the original speech
inflections.
WAV and mp3
This variation shows us a giant magnification of a time fragment
of 40 ms: an audio microscope!
The original soundfile featured a finger snap happening at time
0.71 seconds. LINE frames beginning and ending time points for
this computer instrument such that it captures the original
'snap'. Since the note duration is 10 seconds in the example, the
snap has been blown up by factor 250.
WAV and mp3
Here EXPON directs the pace of resynthesis of the santur analysis
file.
Instrument 1 moves forward and at growing speed through
santur1.pv1, while instrument 2 resynthesizes backwards and
slowing down.
The variable ifildur is set to the duration of the original audio
file.
WAV and mp3
Here we find an experimental design where the pointer is made to
oscillate through the santur1.pv1 analysis file.
The oscillator controlling the pointer is set to a 1/4 Hz and has
a phase offset of 3/2ã. By adding the constant 2.5, this signal
slowly oscillates between times 2.25 and 2.75, and in total
completes a bit more than one cycle during the note duration of 5
seconds.
In this subgroup, we focus on the second krate variable of PVOC:
kfmod. The pointer ktimpnt is neutralized by a simple linear
resynthesis control.
The first example, again using the santur analysis file,
demonstrates the effect of an EXPON control signal, whose target
value is variable.
WAV and mp3
As in the previous example, some care needs to be taken in order
to convert the pitch of the note into a value suitable for
transposition manipulations. Values in the neighbourhood of unity
are required. Specific values will vary with the approximate fun-
damental frequency of the sound(s) in the analysis file. Then,
multiplication of EXPON with ifsc will produce the desired pitch
modifications.
Variables, names, meanings, values EXAMPLE
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
incr = frameIncr = samples between frames 16
sampRate = sampling rate of audio file: sr 22050
srOn2pi = sampRate/(2ã*incr) 219.34 Hz
binMidFrq = channel center frequency variable
frqPerBin = maximum deviation from center in Hz ñ 689.06 Hz
size = indepVals = independent values in a frame 17
macro: actual(size); replaced by: ((size-1L)*2L),
((17-1)*2) thus returns the actual framesize. 32
eDphIncr = 2ã*incr/((float)actual(size)) 2ã*16/32
expectedDphas = expected þ phase between channels variable
60_22_1
additional parameters: none
60_22_1B
additional parameters:
60_22_1C
additional parameters:
60_22_2
additional parameters:
60_22_3
additional parameters:
60_23_1
additional parameters:
60_23_2
additional parameters:
Return to main index
jpff@cs.bath.ac.uk
Last modified: Sun Feb 26 13:35:36 GMT 2006