Home > voicebox > fxrapt.m

fxrapt

PURPOSE ^

FXRAPT RAPT pitch tracker [FX,VUV]=(S,FS)

SYNOPSIS ^

function [fx,tt]=fxrapt(s,fs,mode)

DESCRIPTION ^

FXRAPT RAPT pitch tracker [FX,VUV]=(S,FS)

 Input:   s(ns)      Speech signal
          fs         Sample frequency (Hz)
          mode       'g' will plot a graph [default if no output arguments]
                     'u' will include unvoiced fames (with fx=NaN)

 Outputs: fx(nframe)     Larynx frequency for each fram,e (or NaN for silent/unvoiced)
          tt(nframe,3)  Start and end samples of each frame. tt(*,3)=1 at the start of each talk spurt

 Plots a graph if no outputs are specified showing lag candidates and selected path

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SUBFUNCTIONS ^

SOURCE CODE ^

0001 function [fx,tt]=fxrapt(s,fs,mode)
0002 %FXRAPT RAPT pitch tracker [FX,VUV]=(S,FS)
0003 %
0004 % Input:   s(ns)      Speech signal
0005 %          fs         Sample frequency (Hz)
0006 %          mode       'g' will plot a graph [default if no output arguments]
0007 %                     'u' will include unvoiced fames (with fx=NaN)
0008 %
0009 % Outputs: fx(nframe)     Larynx frequency for each fram,e (or NaN for silent/unvoiced)
0010 %          tt(nframe,3)  Start and end samples of each frame. tt(*,3)=1 at the start of each talk spurt
0011 %
0012 % Plots a graph if no outputs are specified showing lag candidates and selected path
0013 %
0014 
0015 % Bugs/Suggestions:
0016 %   (1) Include backward DP pass and output the true cost for each candidate.
0017 %   (2) Add an extra state to distinguish between voiceless and silent
0018 %   (3) N-best DP to allow longer term penalties (e.g. for frequent pitch doubling/halving)
0019 
0020 % The algorithm is taken from [1] with the following differences:
0021 %
0022 %      (a)  the factor AFACT which in the Talkin algorithm corresponds roughly
0023 %           to the absolute level of harmonic noise in the correlation window. This value
0024 %           is here calculated as the maximum of three figures:
0025 %                   (i) an absolute floor set by PP.rapt_absnoise
0026 %                  (ii) a multiple of the peak signal set by PP.rapt_signoise
0027 %                 (iii) a multiple of the noise floor set by PP.rapt_relnoise
0028 %      (b) The LPC used in calculating the Itakura distance uses a Hamming window rather than
0029 %          a Hanning window.
0030 %
0031 % A C implementation of this algorithm by Derek Lin and David Talkin is included as  "get_f0.c"
0032 % in the esps.zip package available from http://www.speech.kth.se/esps/esps.zip under the BSD
0033 % license.
0034 %
0035 % Refs:
0036 %      [1]   D. Talkin, "A Robust Algorithm for Pitch Tracking (RAPT)"
0037 %            in "Speech Coding & Synthesis", W B Kleijn, K K Paliwal eds,
0038 %            Elsevier ISBN 0444821694, 1995
0039 
0040 %      Copyright (C) Mike Brookes 2006
0041 %      Version: $Id: fxrapt.m 713 2011-10-16 14:45:43Z dmb $
0042 %
0043 %   VOICEBOX is a MATLAB toolbox for speech processing.
0044 %   Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
0045 %
0046 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0047 %   This program is free software; you can redistribute it and/or modify
0048 %   it under the terms of the GNU General Public License as published by
0049 %   the Free Software Foundation; either version 2 of the License, or
0050 %   (at your option) any later version.
0051 %
0052 %   This program is distributed in the hope that it will be useful,
0053 %   but WITHOUT ANY WARRANTY; without even the implied warranty of
0054 %   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
0055 %   GNU General Public License for more details.
0056 %
0057 %   You can obtain a copy of the GNU General Public License from
0058 %   http://www.gnu.org/copyleft/gpl.html or by writing to
0059 %   Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
0060 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0061 
0062 s=s(:); % force s to be a column
0063 if nargin<3
0064     mode=' ';
0065 end
0066 doback=0;   % don't do backwards DP for now
0067 
0068 % read in parameters
0069 
0070 PP=voicebox;
0071 f0min=PP.rapt_f0min;            % Min F0 (Hz)                               [50]
0072 f0max=PP.rapt_f0max;            % Max F0 (Hz)                               [500]
0073 tframe=PP.rapt_tframe;          % frame size (s)                            [0.01]
0074 tlpw=PP.rapt_tlpw;              % low pass filter window size (s)           [0.005]
0075 tcorw=PP.rapt_tcorw;            % correlation window size (s)               [0.0075]
0076 candtr=PP.rapt_candtr;          % minimum peak in NCCF                      [0.3]
0077 lagwt=PP.rapt_lagwt;            % linear lag taper factor                   [0.3]
0078 freqwt=PP.rapt_freqwt;          % cost factor for F0 change                 [0.02]
0079 vtranc=PP.rapt_vtranc;          % fixed voice-state transition cost         [0.005]
0080 vtrac=PP.rapt_vtrac;            % delta amplitude modulated transition cost [0.5]
0081 vtrsc=PP.rapt_vtrsc;            % delta spectrum modulated transition cost  [0.5]
0082 vobias=PP.rapt_vobias;          % bias to encourage voiced hypotheses       [0.0]
0083 doublec=PP.rapt_doublec;        % cost of exact doubling or halving         [0.35]
0084 absnoise=PP.rapt_absnoise;      % absolute rms noise level                  [0]
0085 relnoise=PP.rapt_relnoise;      % rms noise level relative to noise floor   [2.0]
0086 signoise=PP.rapt_signoise;      % ratio of peak signal rms to noise floor   [0.001]
0087 ncands=PP.rapt_ncands;          % max hypotheses at each frame              [20]
0088 trms=PP.rapt_trms;              % window length for rms measurement         [0.03]
0089 dtrms=PP.rapt_dtrms;            % window spacing for rms measurement        [0.02]
0090 preemph=PP.rapt_preemph;        % s-plane position of preemphasis zero      [-7000]
0091 nfullag=PP.rapt_nfullag;        % number of full lags to try (must be odd)  [7]
0092 
0093 % derived parameters (mostly dependent on sample rate fs)
0094 
0095 krms=round(trms*fs);            % window length for rms measurement
0096 kdrms=round(dtrms*fs);          % window spacing for rms measurement
0097 rmswin=hanning(krms).^2;
0098 kdsmp=round(0.25*fs/f0max);
0099 hlpw=round(tlpw*fs/2);          % force window to be an odd length
0100 blp=sinc((-hlpw:hlpw)/kdsmp).*hamming(2*hlpw+1).';
0101 fsd=fs/kdsmp;
0102 kframed=round(fsd*tframe);      % downsampled frame length
0103 kframe=kframed*kdsmp;           % frame increment at full rate
0104 rmsix=(1:krms)+floor((kdrms-kframe)/2); % rms index according to Talkin; better=(1:krms)+floor((kdrms-krms+1)/2)
0105 minlag=ceil(fsd/f0max);
0106 maxlag=round(fsd/f0min);        % use round() only because that is what Talkin does
0107 kcorwd=round(fsd*tcorw);        % downsampled correlation window
0108 kcorw=kcorwd*kdsmp;             % full rate correlation window
0109 spoff=max(hlpw-floor(kdsmp/2),1+kdrms-rmsix(1)-kframe);  % offset for first speech frame at full rate
0110 sfoff=spoff-hlpw+floor(kdsmp/2); % offset for downsampling filter
0111 sfi=1:kcorwd;                   % initial decimated correlation window index array
0112 sfhi=1:kcorw;                   % initial correlation window index array
0113 sfj=1:kcorwd+maxlag;
0114 sfmi=repmat((minlag:maxlag)',1,kcorwd)+repmat(sfi,maxlag-minlag+1,1);
0115 lagoff=(minlag-1)*kdsmp;        % lag offset when converting to high sample rate
0116 beta=lagwt*f0min/fs;            % bias towards low lags
0117 log2=log(2);
0118 lpcord=2+round(fs/1000);        % lpc order for itakura distance
0119 hnfullag=floor(nfullag/2);
0120 jumprat=exp((doublec+log2)/2);  % lag ratio at which octave jump cost is lowest
0121 ssq=s.^2;
0122 csssq=cumsum(ssq);
0123 sqrt(min(csssq(kcorw+1:end)-csssq(1:end-kcorw))/kcorw);
0124 afact=max([absnoise^2,max(ssq)*signoise^2,min(csssq(kcorw+1:end)-csssq(1:end-kcorw))*(relnoise/kcorw)^2])^2*kcorw^2;
0125 
0126 % downsample signal to approx 2 kHz to speed up autocorrelation calculation
0127 % kdsmp is the downsample factor
0128 
0129 sf=filter(blp/sum(blp),1,s(sfoff+1:end));
0130 sp=filter([1 exp(preemph/fs)],1,s); % preemphasised speech for LPC calculation
0131 sf(1:length(blp)-1)=[];         % remove startup transient
0132 sf=sf(1:kdsmp:end);             % downsample to =~2kHz
0133 nsf=length(sf);                 % length of downsampled speech
0134 ns=length(s);                   % length of full rate speech
0135 
0136 % Calculate the frame limit to ensure we don't run off the end of the speech or decimated speech:
0137 %   (a) For decimated autocorrelation when calculating sff():  (nframe-1)*kframed+kcorwd+maxlag <= nsf
0138 %   (b) For full rate autocorrelation when calculating sfh():  max(fho)+kcorw+maxlag*kdsamp+hnfllag <= ns
0139 %   (c) For rms ratio window when calculating rr            :  max(fho)+rmsix(end) <= ns
0140 % where max(fho) = (nframe-1)*kframe + spoff
0141 
0142 nframe=floor(1+min((nsf-kcorwd-maxlag)/kframed,(ns-spoff-max(kcorw-maxlag*kdsmp-hnfullag,rmsix(end)))/kframe));
0143 
0144 % now search for autocorrelation peaks in the downsampled signal
0145 
0146 cost=zeros(nframe,ncands);      % cumulative cost
0147 prev=zeros(nframe,ncands);      % traceback pointer
0148 mcands=zeros(nframe,1);         % number of actual candidates excluding voiceless
0149 lagval=repmat(NaN,nframe,ncands-1);    % lag of each voiced candidate
0150 tv=zeros(nframe,3);             % diagnostics: 1=voiceless cost, 2=min voiced cost, 3:cumulative voiceless-min voiced
0151 if doback
0152     costms=cell(nframe,1);
0153 end
0154 
0155 % Main processing loop for each 10 ms frame
0156 
0157 for iframe=1:nframe       % loop for each frame (~10 ms)
0158 
0159     % Find peaks in the normalized autocorrelation of subsampled (2Khz) speech
0160     % only keep peaks that are > 30% of highest peak
0161 
0162     fho=(iframe-1)*kframe+spoff;
0163     sff=sf((iframe-1)*kframed+sfj);
0164     sffdc=mean(sff(sfi));       % mean of initial correlation window length
0165     sff=sff-sffdc;              % subtract off the mean
0166     nccfd=normxcor(sff(1:kcorwd),sff(minlag+1:end));
0167     [ipkd,vpkd]=findpeaks(nccfd,'q');
0168 
0169     % Debugging: execute the line below to plot the autocorrelation peaks.
0170     % findpeaks(nccfd,'q'); xlabel(sprintf('Lag = (x+%d)*%g ms',minlag-1,1000*kdsmp/fs)); ylabel('Normalized Cross Correlation'); title (sprintf('Frame %d/%d',iframe,nframe));
0171 
0172     vipkd=[vpkd ipkd];
0173     vipkd(vpkd<max(vpkd)*candtr,:)=[];          % eliminate peaks that are small
0174     if size(vipkd,1)
0175         if size(vipkd,1)>ncands-1
0176             vipkd=sortrows(vipkd);
0177             vipkd(1:size(vipkd,1)-ncands+1,:)=[];   % eliminate lowest to leave only ncands-1
0178         end
0179         lagcan=round(vipkd(:,2)*kdsmp+lagoff);        % convert the lag candidate values to the full sample rate
0180         nlcan=length(lagcan);
0181     else
0182         nlcan=0;
0183     end
0184 
0185     % If there are any candidate lag values (nlcan>0) then refine their accuracy at the full sample rate
0186 
0187     if nlcan
0188         laglist=reshape(repmat(lagcan(:)',nfullag,1)+repmat((-hnfullag:hnfullag)',1,nlcan),nfullag*nlcan,1);
0189         sfh=s(fho+(1:kcorw+max(lagcan)+hnfullag));
0190         sfhdc=mean(sfh(sfhi));
0191         sfh=sfh-sfhdc;
0192         e0=sum(sfh(sfhi).^2);                     % energy of initial correlation window (only needed to store in tv(:,6)
0193         lagl2=repmat(lagcan(:)',nfullag+kcorw-1,1)+repmat((1-hnfullag:hnfullag+kcorw)',1,nlcan);
0194         nccf=normxcor(sfh(1:kcorw),sfh(lagl2),afact);
0195 
0196         [maxcc,maxcci]=max(nccf,[],1);
0197         vipk=[maxcc(:) lagcan(:)+maxcci(:)-hnfullag-1];
0198         vipk=vipk(:,[1 2 2]);
0199         maxccj=maxcci(:)'+nfullag*(0:nlcan-1);    % vector index into nccf array
0200         msk=mod(maxcci,nfullag-1)~=1 & 2*nccf(maxccj)-nccf(mod(maxccj-2,nfullag*nlcan)+1)-nccf(mod(maxccj,nfullag*nlcan)+1)>0;  % don't do quadratic interpolation for the end ones
0201         if any(msk)
0202             maxccj=maxccj(msk);
0203             vipk(msk,3)=vipk(msk,3)+(nccf(maxccj+1)-nccf(maxccj-1))'./(2*(2*nccf(maxccj)-nccf(maxccj-1)-nccf(maxccj+1)))';
0204         end
0205         vipk(maxcc<max(maxcc)*candtr,:)=[];          % eliminate peaks that are small
0206         if size(vipk,1)>ncands-1
0207             vipk=sortrows(vipk);
0208             vipk(1:size(vipk,1)-ncands+1,:)=[];   % eliminate lowest to leave only ncands-1
0209         end
0210 
0211         % vipk(:,1) has NCCF value, vipk(:,2) has integer peak position, vipk(:,3) has refined peak position
0212 
0213         mc=size(vipk,1);
0214     else
0215         mc=0;
0216     end
0217 
0218     % We now have mc lag candidates at the full sample rate
0219 
0220     mc1=mc+1;               % total number of candidates including "unvoiced" possibility
0221     mcands(iframe)=mc;      % save number of lag candidates (needed for pitch consistency cost calculation)
0222     if mc
0223         lagval(iframe,1:mc)=vipk(:,3)';
0224         cost(iframe,1)=vobias+max(vipk(:,1));   % voiceless cost
0225         cost(iframe,2:mc1)=1-vipk(:,1)'.*(1-beta*vipk(:,3)');   % local voiced costs
0226         tv(iframe,2)=min(cost(iframe,2:mc1));
0227     else
0228         cost(iframe,1)=vobias;          % if no lag candidates (mc=0), then the voiceless case is the only possibility
0229     end
0230     tv(iframe,1)=cost(iframe,1);
0231     if iframe>1                         % if it is not the first frame, then calculate pitch consistency and v/uv transition costs
0232         mcp=mcands(iframe-1);
0233         costm=zeros(mcp+1,mc1);         % cost matrix: rows and cols correspond to candidates in previous and current frames (incl voiceless)
0234 
0235         % if both frames have at least one lag candidate, then calculate a pitch consistency cost
0236 
0237         if mc*mcp
0238             lrat=abs(log(repmat(lagval(iframe,1:mc),mcp,1)./repmat(lagval(iframe-1,1:mcp)',1,mc)));
0239             costm(2:end,2:end)=freqwt*min(lrat,doublec+abs(lrat-log2));  % allow pitch doubling/halving
0240         end
0241 
0242         % if either frame has a lag candidate, then calculate the cost of voiced/voiceless transition and vice versa
0243 
0244         if mc+mcp
0245             rr=sqrt((rmswin'*s(fho+rmsix).^2)/(rmswin'*s(fho+rmsix-kdrms).^2)); % amplitude "gradient"
0246             ss=0.2/(distitar(lpcauto(sp(fho+rmsix),lpcord),lpcauto(sp(fho+rmsix-kdrms),lpcord),'e')-0.8);   % Spectral stationarity: note: Talkin uses Hanning instead of Hamming windows for LPC
0247             costm(1,2:end)= vtranc+vtrsc*ss+vtrac/rr;   % voiceless -> voiced cost
0248             costm(2:end,1)= vtranc+vtrsc*ss+vtrac*rr;
0249             tv(iframe,4:5)=[costm(1,mc1) costm(mcp+1,1)];
0250         end
0251         costm=costm+repmat(cost(iframe-1,1:mcp+1)',1,mc1);  % add in cumulative costs
0252         [costi,previ]=min(costm,[],1);
0253         cost(iframe,1:mc1)=cost(iframe,1:mc1)+costi;
0254         prev(iframe,1:mc1)=previ;
0255     else                            % first ever frame
0256         costm=zeros(1,mc1); % create a cost matrix in case doing a backward recursion
0257     end
0258     if mc
0259         tv(iframe,3)=cost(iframe,1)-min(cost(iframe,2:mc1));
0260         tv(iframe,6)=5*log10(e0*e0/afact);
0261     end
0262     if doback
0263         costms{iframe}=costm; % need to add repmatted cost into this
0264     end
0265 end
0266 
0267 % now do traceback
0268 
0269 best=zeros(nframe,1);
0270 [cbest,best(nframe)]=min(cost(nframe,1:mcands(nframe)+1));
0271 for i=nframe:-1:2
0272     best(i-1)=prev(i,best(i));
0273 end
0274 vix=find(best>1);
0275 fx=repmat(NaN,nframe,1);                        % unvoiced frames will be NaN
0276 fx(vix)=fs*lagval(vix+nframe*(best(vix)-2)).^(-1); % leave as NaN if unvoiced
0277 tt=zeros(nframe,3);
0278 tt(:,1)=(1:nframe)'*kframe+spoff;       % find frame times
0279 tt(:,2)=tt(:,1)+kframe-1;
0280 jratm=(jumprat+1/jumprat)/2;
0281 tt(2:end,3)=abs(fx(2:end)./fx(1:end-1)-jratm)>jumprat-jratm;    % new spurt if frequency ratio is outside (1/jumprat,jumprat)
0282 tt(1,3)=1;           % first frame always starts a spurt
0283 tt(1+find(isnan(fx(1:end-1))),3)=1; % NaN always forces a new spurt
0284 
0285 % plot results if there are no output arguments of if the 'g' mode option is specified
0286 
0287 if ~nargout || any(mode=='g')
0288     tf=spoff+(0:nframe-1)'*kframe;      % one sample before start of each frame
0289     blag=repmat(NaN,nframe,1);                        % unvoiced frames will be NaN
0290     blag(vix)=lagval(vix+nframe*(best(vix)-2)); % leave as NaN if unvoiced
0291     ts=(1:ns)/fs;                       % time scale for speech samples
0292     tsa=[1:tf(1) tf(end)+kframe+1:ns];  % indexes for unprocessed speech [-1 term is an error methinks]
0293     sup=repmat(NaN,ns,1);               % unprocessed speech - plot in black
0294     sup(tsa)=s(tsa);
0295     sv=reshape(s(tf(1)+1:tf(end)+kframe),kframe,nframe);               % processed speech
0296     su=sv;
0297     su(:,best>1)=NaN;                   % delete all voiced samples
0298     sv(:,best==1)=NaN;                  % delete all unvoiced samples
0299     tsuv=(tf(1)+1:tf(end)+kframe)/fs;
0300     su=su(:);
0301     sv=sv(:);
0302     ax=zeros(2,1);
0303     ax(1)=subplot(211);
0304     plot(ts,sup,'-k',tsuv,su,'r-',tsuv,sv,'b-');
0305     title('Speech');
0306     ax(2)=subplot(212);
0307     plot((tf+(kframe+1)/2)/fs,lagval*1000/fs,'xr',(tf+(kframe+1)/2)/fs,blag*1000/fs,'-b')
0308     xlabel('Time (s)');
0309     ylabel('Period (ms)');
0310     title('Lag Candidates');
0311     linkaxes(ax,'x');
0312 end
0313 if ~any(mode=='u')
0314 tt(isnan(fx),:)=[];    % remove NaN spurts
0315 fx(isnan(fx),:)=[];
0316 end
0317 
0318 
0319 
0320 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0321 
0322 function v=normxcor(x,y,d)
0323 % Calculate the normalized cross correlation of column vectors x and y
0324 % we can calculate this in two ways but fft is much faster even for nx small
0325 % We must have nx<=ny and the output length is ny-nx+1
0326 % note that this routine does not do mean subtraction even though this is normally a good idea
0327 % if y is a matrix, we correlate with each column
0328 % d is a constant added onto the normalization factor
0329 % v(j)=x'*yj/sqrt(d + x'*x * yj'*yj) where yj=y(j:j+nx-1) for j=1:ny-nx+1
0330 
0331 if nargin<3
0332     d=0;
0333 end
0334 nx=length(x);
0335 [ny,my]=size(y);
0336 nv=1+ny-nx;
0337 if nx>ny
0338     error('second argument is shorter than the first');
0339 end
0340 
0341 nf=pow2(nextpow2(ny));
0342 w=irfft(repmat(conj(rfft(x,nf,1)),1,my).*rfft(y,nf,1));
0343 s=zeros(ny+1,my);
0344 s(2:end,:)=cumsum(y.^2,1);
0345 v=w(1:nv,:)./sqrt(d+(x'*x).*(s(nx+1:end,:)-s(1:end-nx,:)));

Generated on Wed 19-Jun-2013 18:51:26 by m2html © 2003