Main Group 60: Phase Vocoder Analysis/Synthesis

Return to main index

		The MIT Media Lab Phase Vocoder
01		Testfiles for pvanal
	1	4 partials, 1 sec
	2	4 partials, .2 sec
		Testing Procedure
		A Sample Session at the Terminal
		From Phase to Frequency
22		Ktimpnt Variations
	1	LINE pointer, 1:1 ratio, santur1.pv1
	1B	LINSEG pointer, variable ratios, speech1.pv1
	1C	LINE pointer, 1:250 ratio, snap.pv1
	2	EXPON, 1:1 & 1:2 ratio, santur1.pv1
	3	LFO pointer, 1:1.25 ratio, santur1.pv1
23		Kfmod Variations
	1	EXPON transposition, santur1.pv1
	2	LINSEG transposition, speech1.pv1

Overview

The phase vocoder analyzes and resynthesizes audio signals. The underlying analysis-synthesis model assumes in this case that the signal is well represented by a sum of sinusoids. Wind, brass, string, speech and a number of percussive sounds are well represented by the phase vocoding technique. Some other percussive sounds like clicks, or certain signal-to-noise sound combinations are not well represented as a sum of sinusoids. (Dolson 1987)

Analysis The program pvanal performs an analysis of a soundfile and writes the result to an analysis file. The analysis basically consists of an FFT, where framesize of the window (a number of consecutive audio samples) and window overlap factor are optional CL arguments. The analysis file will be at least twice as big as the soundfile and contains alternating magnitude and frequency values. This data can be interpreted as sequence of short time Fourier transforms (per frame) or as the time-varying response of a bank of narrow bandpass filters (per channel).

Modification Software can modify frequency and/or amplitude data over the whole or a partial range of analyzed data. Moore gives a thorough exposition of programs to implement these possibilities. (Moore 1990: pp.227-263) Synthesis The synthesis is done by Csound's unit generator PVOC. The original sound can be resynthesized with high fidelity. More commonly time and/or frequency modifications are performed during resynthesis. In this case, the PVOC unit will interpolate between given 'stable' points.

The MIT Media Lab Phase Vocoder

This phase vocoder is split into an analysis and a synthesis part. The analysis part is done by the program pvanal. It produces a phase vocoder analysis file with the special file header (see below), including information about the source sound, analysis framesize and the overlap factor.

Following this header, the analysis data is stored as float, with magnitudes and frequencies in turn for the first N/2+1 Fourier bins of each frame. We wrote a few programs to investigate the phase vocoder algorithm on its analysis side.

The source codes of these programs are not included here, but they are shipped with the Internet version of the catalogue. Written in C and not depending on audio hardware, the programs work on any Csound platform. Here is a short summary of their function.

  -channels [analysis file] 
     reads file, data displayed per channel         
  -frames   [analysis file]
     reads file, data displayed per frame
  -magnitud [analysis file]
     reads file, data displayed per frame,
     above threshold magnitude only
  -wrapped  [analysis file]
     reads file, displays the value of phase
     and intermediate variables on the way to the
     approximated frequency.

Sample runs of two of these programs are shown on the following pages. The data flood is large and illustrates the need for specific display programs, adapted to purpose and nature of the sound material at hand. The ability to skip a number of analysis frames can further reduce the stream of data.

The program flow of pvanal gives a crude picture of this FFT-based phase vocoder, avoiding the necessity to go too deeply into the intricate network of source files.

The meaningful part is the FFT loop of pvanal.c, where amplitude/time values are transformed into amplitude/frequency values. The loop functions are found in the source dsputil.c and the buffers below are central in the this Fourier transform procedure. We have looked into the phase/frequency conversion in greater detail later.

[buffers]

The interpolation mechanism used by the PVOC unit generator during re-synthesis has not been covered in this catalogue. It is paramount to a fruitful understanding of this synthesis technique.

[dependences] [program flow]

Testing Procedure

In order to get a feel of how pvanal operates, we first create a simple audio signal with csound. For example, the soundfile 60_01_1.SF holds a complex signal with partials at 1000, 2000, 3000 and 4000 Hz. Calling the program pvanal with different windowing factors creates several analysis files. Now the data can be studied by using display programs like 'magnitud'.

A Sample Session at the Terminal

The user commands are displayed in text boxes. First we show a call of pvanal analysing a speech file of 5 seconds, sampled at 44.1 KHz (2.2 MB). The FFT window has a size of 1024 points and the overlap factor is 4 by default. The resulting analysis file speech1.pv1 already occupies 3.5 MB of storage space.

The second and third command illustrate the operation of 'magnitud'. First the program displays the information found in the file header. After specifying number of frames and threshold magnitude, analysis data fills the screen. The data resulting from the first 'magnitud' call is reproduced on pages 168 and 169.

It is the _first_ frame of the analyzed speech file. A hard copy of the complete analysis file occupies 1718 pages!

The output of the second readfmag call (page 170,171) is restricted to channels with magnitudes above 3. This greatly reduces the data flood.

speech1.SF: AIFF,
            220500 samples,
            baseFrq 261.6 (midi 60),
            sustnLp: mode 0,
            relesLp: mode 0
            audio sr = 44100, monaural
            analysing 220500 sample frames (5.0 secs)
            1024 infrsize, 256 infrInc
            859 output frames estimated

frame: 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300
320 340 360 380 400 420
440 460 480 500 520 540 560 580 600 620 640 660 680 700 720 740
760 780 800 820 840 858
859 output frames written
speech1.pv1:dataBsize:            data occupies 3525336 bytes
            dataFormat:                     36
            minFreq:                        0.00
            maxFreq:                        22050.00
            freqFormat, log or lin:         1
            frameFormat:                    7
            mono, stereo, quad:             1
            samplingRate of original audio: 44100
            frameSize:                      1024 samps per frame
            frameIncr:                      256 samps per frame
            frameBsize:                     total of 859 frames

How many frames do you want me to display?         1
What threshold magnitude should I take?            0  <-- ie none
speech1.pv1:dataBsize:               data occupies 3525336 bytes
            dataFormat:                     36
            minFreq:                        0.00
            maxFreq:                        22050.00
            freqFormat, log or lin:         1
            frameFormat:                    7
            mono, stereo, quad:             1
            samplingRate of original audio: 44100
            frameSize:                      1024 samps per frame
            frameIncr:                      256 samps per frame
            frameBsize:                     total of 859 frames

How many frames do you want me to display?       400
What threshold magnitude should I take?            3

Output of 'magnitud',
1st call.

         mag    freq

fr 1:    1.48    0.00           
         1.06   86.13
         0.56   129.20
         0.46   172.27
         0.93   215.33
         1.23   258.40
         1.71   301.46
         1.95   344.53
         1.50   387.60
         0.73   430.66
         0.37   473.73
         0.28   516.80
         0.28   559.86
         0.29   602.93
         0.27   646.00
         0.32   689.06
         0.32   732.13
         0.28   775.20

         etc.   every 40 Hz channel till sr/2       

------ snip ----- snip ---- snip
        

         0.00   21705.47
         0.01   21748.54
         0.01   21791.60
         0.00   21834.67
         0.00   21877.73
         0.01   21920.80
         0.00   21963.87
         0.00   22006.93
         0.00   22050.00

Done: 
1 frame of speech1.pv1

Output of 'magnitud'
2nd call.

         mag    freq

fr 1:
fr 2:    3.52   344.53
fr 3:
fr 4:
fr 5:
fr 6:
fr 7:    3.04   344.53
fr 8:
fr 9:
fr 10:

fr 11:   3.75   344.53
fr 12:   3.75   344.53
fr 13:
fr 14:
fr 15:
fr 16:   3.55   344.53
fr 17:   3.62   344.53
fr 18:
fr 19:
fr 20:

fr 21:   3.66   344.53
fr 22:   3.19   344.53
fr 23:
fr 24:   3.06   215.33
fr 25:   3.43   215.33
         3.60   344.53
fr 26:   3.78   344.53
fr 27:
fr 28:
fr 29:
fr 30:   3.72   344.53

fr 31:   3.13   344.53
fr 32:
fr 33:
fr 34:   3.16   43.07
fr 35:   4.56    0.00
         3.29   43.07
         4.02   344.53
fr 36:   3.08   43.07
         3.57   344.53
fr 37:   3.45    0.00
         3.40   43.07
fr 38:   3.74   43.07
fr 39:   3.71    0.00
         4.54   43.07
fr 40:   4.68    0.00
         4.15   43.07
         3.79   344.53

fr 41:
fr 42:   3.14    0.00
fr 43:
fr 44:   3.11    0.00
         3.61   43.07
fr 45:   4.60    0.00
         3.86   43.07
         3.09   344.53
fr 46:   3.54   43.07
fr 47:   4.31    0.00
         3.62   43.07
fr 48:   3.56   43.07
fr 49:   4.15    0.00
         3.95   43.07
         3.06   344.53
fr 50:   4.36    0.00
         3.71   43.07
         3.02   344.53

fr 51:   3.35   43.07
fr 52:   3.81    0.00
         3.36   43.07
fr 53:
fr 54:   4.29    0.00
         3.42   43.07
         3.89   344.53
fr 55:   4.27    0.00
         3.72   43.07
         3.41   344.53
fr 56:   3.20   43.07
fr 57:   4.16    0.00
         3.37   43.07
fr 58:   3.34   43.07
         3.29   344.53
fr 59:   4.18    0.00
         3.65   43.07
         3.82   344.53
fr 60:   4.60    0.00
         4.18   43.07

fr 61:   3.98   43.07
fr 62:   4.19    0.00
         3.71   43.07
fr 63:   3.04   43.07
fr 64:   4.19    0.00
         3.16   43.07
         3.24   344.53
fr 65:   3.84    0.00
         3.71   43.07
fr 66:   3.76   43.07
fr 67:   4.12    0.00
         3.82   43.07
fr 68:   3.51   43.07
         4.03   344.53
fr 69:   4.69    0.00
         3.62   43.07
         3.60   344.53
fr 70:   3.45    0.00

fr 71:   3.17   43.07
fr 72:   4.00    0.00
         4.00   43.07
fr 73:   3.82   43.07
         4.02   344.53
fr 74:   4.91    0.00
         3.85   43.07
         3.33   344.53
fr 75:   3.24   43.07
fr 76:
fr 77:   3.24    0.00
fr 78:   3.31   344.53
fr 79:   4.51    0.00
         3.80   43.07
fr 80:   3.53   43.07

fr 81:
fr 82:   3.11   344.53
fr 83:   3.27   344.53
fr 84:   3.75    0.00
fr 85:
fr 86:
fr 87:   3.74   344.53
fr 88:
fr 89:   3.39    0.00
         3.08   215.33
fr 90:

fr 91:
fr 92:
fr 93:   3.18   301.46
         3.40   344.53
fr 94:

=============================
etc.  all frames till the end
=============================

------------ snip -----------------------

fr 362:  3.54   344.53
fr 363:  3.18   344.53
fr 364:
fr 365:
fr 366:  3.06   344.53
fr 367:  4.13   344.53
fr 368:  3.27   344.53
fr 369:
fr 370:
fr 371:  3.25   344.53
fr 372:  3.13   344.53
fr 373:
fr 374:
fr 375:
fr 376:  3.81   344.53
fr 377:  3.74   344.53
fr 378:
fr 379:
fr 380:
fr 381:  4.11   344.53
fr 382:  3.57   344.53
fr 383:
fr 384:
fr 385:
fr 386:  3.67   344.53
fr 387:  3.01   344.53
fr 388:
fr 389:
fr 390:
fr 391:  3.02   344.53
fr 392:
fr 393:
fr 394:
fr 395:  3.28   344.53
fr 396:  3.10   344.53
fr 397:
fr 398:
fr 399:
fr 400:  3.62   344.53

Done: 
400 frames of speech1.pv1

From Phase to Frequency

The transformation occurs in the functions UnwrapPhase and PhaseToFrq of the main loop of pvanal. We analyze these two functions step by step.

The program 'wrapped' displays the successive values of certain expressions in the source code per channel. In the discussion below, these expressions will be identified by boldfaced text.

UnwrapPhase

The function loops through one frame of the analysis file. During a pass, the phase difference since the last frame is evaluated and unwrapped. The old phase value is saved in buffer OldPh, then the unwrapped phase is saved in the main buffer tmpbuf. Here is one pass:

First the phase value resulting from the rectangular to polar conversion is assigned to float variable p:

        p = pha[2L*i]                            see phase

Then the phase change since the last frame is computed:

        p -= oldPh[i]                           see diff-p

followed by a call of macro MMmaskPhs, p = þ phase:

        
        MMmaskPhs(p,z,pi,oneOnPi)

The preprocessor has replaced this macro call by:

        z = (int)(p*oneOnPi);
        p -= pi*(float)(int)((z+((z>=0)?(z&1):-(z&1) ))));

In the first statement, the integer z is assigned the value of float p, scaled by 1/ã. By casting float to integer, all decimals are truncated. The second statement is far more complex and can best be approached in a number of small steps. First we look at the expression condit

        z + ( (z>=0) ? (z&1) : -(z&1) )         see condit

which consists of a test, an evaluation and an addition. The expression

z&1

discriminates between odd and even numbers. In binary representation, all odd integers have their bit 0 set. Therefore, if z is odd, then z&1 = 1. For even integers z, z&1 evaluates to 0 and condit = z. So the conditional test

                   
        z>=0  ?   (z&1): -(z&1)

will only have consequences for odd integers. These can be simplified to the following if-else statement:

        If z>=0, condit = z+1
           else condit = z-1

The net effect of condit is to make all z even: z = n*2.

Then the masking/unwrapping is completed by

        p = p - pi * condit                     see masked

The value of condit is scaled back by ã and added or subtracted from the þ phase p. All p are unwrapped into the principal branch of the inverse tangent function: -pi In the last two statements of UnwrapPhase, the old polar phase and the unwrapped þ phase p are saved:

        oldPh[i] = pha[2L]*i;
        pha[2L*i] = p;

In this way, the loop works through all the phase values of the FFT frame, and then proceeds to PhaseToFreq.

Output from 'wrapped'

We show only the first 6 channels out of 17: each channel with successive expression values for 5 frames. The example serves to cross-check one's understanding of the variables in the analyzed source codes.

=== program output ====

The program displays analysis data of pvoc.file
Successful read of: 60_01_2.pv1

Displaying Header Data

dataBsize:                       data occupies 37264 bytes
dataFormat:                      36
samplingRate of original audio:  22050
mono, stereo, quad:              1
frameSize:                       32 audio samples per frame
frameIncr:                       increasing 16 audio samples per
frame
frameBsize:                      274 frames in this file
frameFormat:                     7
minFreq:                         0.00
maxFreq:                         11025.00
freqFormat, log or lin:          1

How many frames do you want me to display?             5

**************** Settings **************************
All phase related expressions are scaled by factor ã.
srOn2pi = 219.34      eDphIncr =  3.14      frqPerBin = ñ 689.06

channel 1:                              -689.06 <  0.00 < 689.06

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
   0.00    0.00   0    0.00  0.00    0.00   0.00    0.00    0.00
 219.34  219.34  220  -0.66  0.00   -0.66  -0.66 -457.95 -457.95
-219.34 -438.67 -438  -0.67  0.00   -0.67  -0.67 -462.23 -462.23
   0.00  219.34  220  -0.66  0.00   -0.66  -0.66 -457.95 -457.95
 219.34  219.34  220  -0.66  0.00   -0.66  -0.66 -457.95 -457.95

channel 2:                               0.00 < 689.06 < 1378.12

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
 161.35  161.35  162  -0.65 -1.00    0.35   0.35  244.36  933.42
 266.06  104.71  104   0.71 -1.00    1.71  -0.29 -202.56  486.50
 319.40   53.34   54  -0.66 -1.00    0.34   0.34  234.98  924.04
 315.85   -3.55   -4   0.45 -1.00    1.45  -0.55 -381.35  307.71
 322.95    7.11    8  -0.89 -1.00    0.11   0.11   73.05  762.12

channel 3:                            689.06 < 1378.12 < 2067.19

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
 315.20  315.20  316  -0.80  0.00   -0.80  -0.80 -554.07  824.06
 339.37   24.17   24   0.17  0.00    0.17   0.17  120.58 1498.71
 288.42  -50.95  -50  -0.95  0.00   -0.95  -0.95 -657.27  720.85
 340.08   51.66   52  -0.34  0.00   -0.34  -0.34 -234.41 1143.71
 341.29    1.21    2  -0.79  0.00   -0.79  -0.79 -541.72  836.40

channel 4:                           1378.12 < 2067.19 < 2756.25

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
 716.45  716.45  716   0.45  1.00    1.45  -0.55 -375.74 1691.45
 600.35 -116.10 -116  -0.10 -1.00    0.90   0.90  617.56 2684.75 
 680.25   79.90   80  -0.10 -1.00    0.90   0.90  621.59 2688.77
 616.58  -63.68  -64   0.32 -1.00    1.32  -0.68 -466.72 1600.47
 600.84  -15.74  -16   0.26 -1.00    1.26  -0.74 -508.18 1559.01

channel 5:                           2067.19 < 2756.25 < 3445.31

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
 878.03  878.03  878   0.03  0.00    0.03   0.03   19.44 2775.69
1000.71  122.68  122   0.68  0.00    0.68   0.68  470.24 3226.49
 855.26 -145.45 -146   0.55  0.00    0.55   0.55  377.39 3133.64
 991.26 136.00   136   0.00  0.00    0.00   0.00    0.27 2756.52
1004.12  12.86    12   0.86  0.00    0.86   0.86  592.27 3348.52

channel 6:                           2756.25 < 3445.31 < 4134.37

 phase  diff-p  condit masked expDpha Diff masked local-F glob-F
1060.97 1060.97 1060   0.97 -1.00    1.97  -0.03  -18.81 3426.50
 944.04 -116.93 -116  -0.93 -1.00    0.07   0.07   45.07 3490.39
1044.30  100.27  100   0.27 -1.00    1.27  -0.73 -505.84 2939.47
 935.53 -108.77 -108  -0.77 -1.00    0.23   0.23  155.56 3600.87
 905.72 -29.81   -30   0.19 -1.00    1.19  -0.81 -560.77 2884.55

Done: 6 channels and 5 frames of file 60_01_2.pv1

PhaseToFrq

Like the previous function, PhaseToFrq loops through the number of independent values in one FFT frame.

Since the phase difference is measured in regular frame increase intervals, its value depends on the window overlap factor. Increases of a whole frame result in phase differences of 2ã, increases of 1/2 frame result in phase differences of ã, and so on.

The corresponding constant is called eDphIncr. In the expression below, an expected phase difference (see expDpha) is subtracted from the unwrapped phase difference.

        p = pha[2L*i]-expectedDphas;                    see Diff

The emerging difference is masked (as in UnwrapPhase) and saved:

                 
        pha[2L*i] = p;                                 see masked

Next, the difference is converted to frequency by

        pha[2L*i] = pha[2L*i] * srOn2pi;              see local-F
        pha[2L*i] = pha[2L*i] + binMidFrq;           see global-F

In the latter statement the channel's center frequency is added to the local frequency value.

In the last three statements the values of expectedDpha and binMidFrq are updated for the next pass. The values for the expected phase difference depend on eDphaIncr, but must lay within the range -ã In every pass, the variable binMidFrq takes on the next channel's center frequency. The list below facilitates the study of the source code.

Variables, names, meanings, values                     EXAMPLE    
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++    
                                                        
incr = frameIncr = samples between frames                   16   
sampRate = sampling rate of audio file: sr               22050
srOn2pi = sampRate/(2ã*incr)                         219.34 Hz
binMidFrq = channel center frequency                  variable
frqPerBin = maximum deviation from center in Hz    ñ 689.06 Hz
size = indepVals = independent values in a frame            17
macro: actual(size); replaced by: ((size-1L)*2L),                 
((17-1)*2) thus returns the actual framesize.               32
eDphIncr = 2ã*incr/((float)actual(size))              2ã*16/32

expectedDphas = expected þ phase between channels     variable

60_22_1

additional parameters: none

The instruments of subgroup 22 explore the manipulation of PVOC's ktimpnt input. The variable ktimpnt signifies a point in time of the analysis file.

Instrument 1 shows how to resynthesize the original soundfile (a santur). For idur = 5 seconds, LINE produces a linear set of values that will let PVOC progress through the analysis file at the original speed.

In instrument 2, LINE will have PVOC re-synthesis the analysis file backwards. Choosing durations that differ from the original soundfile duration results in time-stretching or time-compression of the analysis file, without altering the pitch.

[flowchart]

Orchestra and Score

WAV and mp3

60_22_1B

additional parameters:

The pointer into this speech analysis file follows the values produced by LINSEG. Resynthesis proceeds in forward motion. The various slopes determine the amount of time-stretching or compression. With regard to these effects, the slope pattern results in a subtle modulation of the original speech inflections.

[flowchart]

Orchestra and Score

WAV and mp3

60_22_1C

additional parameters:

This variation shows us a giant magnification of a time fragment of 40 ms: an audio microscope!

The original soundfile featured a finger snap happening at time 0.71 seconds. LINE frames beginning and ending time points for this computer instrument such that it captures the original 'snap'. Since the note duration is 10 seconds in the example, the snap has been blown up by factor 250.

[flowchart]

Orchestra and Score

WAV and mp3

60_22_2

additional parameters:

Here EXPON directs the pace of resynthesis of the santur analysis file.

Instrument 1 moves forward and at growing speed through santur1.pv1, while instrument 2 resynthesizes backwards and slowing down.

The variable ifildur is set to the duration of the original audio file.

[flowchart]

Orchestra and Score

WAV and mp3

60_22_3

additional parameters:

Here we find an experimental design where the pointer is made to oscillate through the santur1.pv1 analysis file.

The oscillator controlling the pointer is set to a 1/4 Hz and has a phase offset of 3/2ã. By adding the constant 2.5, this signal slowly oscillates between times 2.25 and 2.75, and in total completes a bit more than one cycle during the note duration of 5 seconds. [flowchart]

Orchestra and Score

WAV and mp3

60_23_1

additional parameters:

In this subgroup, we focus on the second krate variable of PVOC: kfmod. The pointer ktimpnt is neutralized by a simple linear resynthesis control.

The first example, again using the santur analysis file, demonstrates the effect of an EXPON control signal, whose target value is variable.

[flowchart]

Orchestra and Score

WAV and mp3

60_23_2

additional parameters:

As in the previous example, some care needs to be taken in order to convert the pitch of the note into a value suitable for transposition manipulations. Values in the neighbourhood of unity are required. Specific values will vary with the approximate fun- damental frequency of the sound(s) in the analysis file. Then, multiplication of EXPON with ifsc will produce the desired pitch modifications.

[flowchart]