


FXRAPT RAPT pitch tracker [FX,VUV]=(S,FS)
Input: s(ns) Speech signal
fs Sample frequency (Hz)
mode 'g' will plot a graph [default if no output arguments]
'u' will include unvoiced fames (with fx=NaN)
Outputs: fx(nframe) Larynx frequency for each fram,e (or NaN for silent/unvoiced)
tt(nframe,3) Start and end samples of each frame. tt(*,3)=1 at the start of each talk spurt
Plots a graph if no outputs are specified showing lag candidates and selected path


0001 function [fx,tt]=fxrapt(s,fs,mode) 0002 %FXRAPT RAPT pitch tracker [FX,VUV]=(S,FS) 0003 % 0004 % Input: s(ns) Speech signal 0005 % fs Sample frequency (Hz) 0006 % mode 'g' will plot a graph [default if no output arguments] 0007 % 'u' will include unvoiced fames (with fx=NaN) 0008 % 0009 % Outputs: fx(nframe) Larynx frequency for each fram,e (or NaN for silent/unvoiced) 0010 % tt(nframe,3) Start and end samples of each frame. tt(*,3)=1 at the start of each talk spurt 0011 % 0012 % Plots a graph if no outputs are specified showing lag candidates and selected path 0013 % 0014 0015 % Bugs/Suggestions: 0016 % (1) Include backward DP pass and output the true cost for each candidate. 0017 % (2) Add an extra state to distinguish between voiceless and silent 0018 % (3) N-best DP to allow longer term penalties (e.g. for frequent pitch doubling/halving) 0019 0020 % The algorithm is taken from [1] with the following differences: 0021 % 0022 % (a) the factor AFACT which in the Talkin algorithm corresponds roughly 0023 % to the absolute level of harmonic noise in the correlation window. This value 0024 % is here calculated as the maximum of three figures: 0025 % (i) an absolute floor set by PP.rapt_absnoise 0026 % (ii) a multiple of the peak signal set by PP.rapt_signoise 0027 % (iii) a multiple of the noise floor set by PP.rapt_relnoise 0028 % (b) The LPC used in calculating the Itakura distance uses a Hamming window rather than 0029 % a Hanning window. 0030 % 0031 % A C implementation of this algorithm by Derek Lin and David Talkin is included as "get_f0.c" 0032 % in the esps.zip package available from http://www.speech.kth.se/esps/esps.zip under the BSD 0033 % license. 0034 % 0035 % Refs: 0036 % [1] D. Talkin, "A Robust Algorithm for Pitch Tracking (RAPT)" 0037 % in "Speech Coding & Synthesis", W B Kleijn, K K Paliwal eds, 0038 % Elsevier ISBN 0444821694, 1995 0039 0040 % Copyright (C) Mike Brookes 2006 0041 % Version: $Id: fxrapt.m 713 2011-10-16 14:45:43Z dmb $ 0042 % 0043 % VOICEBOX is a MATLAB toolbox for speech processing. 0044 % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 0045 % 0046 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0047 % This program is free software; you can redistribute it and/or modify 0048 % it under the terms of the GNU General Public License as published by 0049 % the Free Software Foundation; either version 2 of the License, or 0050 % (at your option) any later version. 0051 % 0052 % This program is distributed in the hope that it will be useful, 0053 % but WITHOUT ANY WARRANTY; without even the implied warranty of 0054 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 0055 % GNU General Public License for more details. 0056 % 0057 % You can obtain a copy of the GNU General Public License from 0058 % http://www.gnu.org/copyleft/gpl.html or by writing to 0059 % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. 0060 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0061 0062 s=s(:); % force s to be a column 0063 if nargin<3 0064 mode=' '; 0065 end 0066 doback=0; % don't do backwards DP for now 0067 0068 % read in parameters 0069 0070 PP=voicebox; 0071 f0min=PP.rapt_f0min; % Min F0 (Hz) [50] 0072 f0max=PP.rapt_f0max; % Max F0 (Hz) [500] 0073 tframe=PP.rapt_tframe; % frame size (s) [0.01] 0074 tlpw=PP.rapt_tlpw; % low pass filter window size (s) [0.005] 0075 tcorw=PP.rapt_tcorw; % correlation window size (s) [0.0075] 0076 candtr=PP.rapt_candtr; % minimum peak in NCCF [0.3] 0077 lagwt=PP.rapt_lagwt; % linear lag taper factor [0.3] 0078 freqwt=PP.rapt_freqwt; % cost factor for F0 change [0.02] 0079 vtranc=PP.rapt_vtranc; % fixed voice-state transition cost [0.005] 0080 vtrac=PP.rapt_vtrac; % delta amplitude modulated transition cost [0.5] 0081 vtrsc=PP.rapt_vtrsc; % delta spectrum modulated transition cost [0.5] 0082 vobias=PP.rapt_vobias; % bias to encourage voiced hypotheses [0.0] 0083 doublec=PP.rapt_doublec; % cost of exact doubling or halving [0.35] 0084 absnoise=PP.rapt_absnoise; % absolute rms noise level [0] 0085 relnoise=PP.rapt_relnoise; % rms noise level relative to noise floor [2.0] 0086 signoise=PP.rapt_signoise; % ratio of peak signal rms to noise floor [0.001] 0087 ncands=PP.rapt_ncands; % max hypotheses at each frame [20] 0088 trms=PP.rapt_trms; % window length for rms measurement [0.03] 0089 dtrms=PP.rapt_dtrms; % window spacing for rms measurement [0.02] 0090 preemph=PP.rapt_preemph; % s-plane position of preemphasis zero [-7000] 0091 nfullag=PP.rapt_nfullag; % number of full lags to try (must be odd) [7] 0092 0093 % derived parameters (mostly dependent on sample rate fs) 0094 0095 krms=round(trms*fs); % window length for rms measurement 0096 kdrms=round(dtrms*fs); % window spacing for rms measurement 0097 rmswin=hanning(krms).^2; 0098 kdsmp=round(0.25*fs/f0max); 0099 hlpw=round(tlpw*fs/2); % force window to be an odd length 0100 blp=sinc((-hlpw:hlpw)/kdsmp).*hamming(2*hlpw+1).'; 0101 fsd=fs/kdsmp; 0102 kframed=round(fsd*tframe); % downsampled frame length 0103 kframe=kframed*kdsmp; % frame increment at full rate 0104 rmsix=(1:krms)+floor((kdrms-kframe)/2); % rms index according to Talkin; better=(1:krms)+floor((kdrms-krms+1)/2) 0105 minlag=ceil(fsd/f0max); 0106 maxlag=round(fsd/f0min); % use round() only because that is what Talkin does 0107 kcorwd=round(fsd*tcorw); % downsampled correlation window 0108 kcorw=kcorwd*kdsmp; % full rate correlation window 0109 spoff=max(hlpw-floor(kdsmp/2),1+kdrms-rmsix(1)-kframe); % offset for first speech frame at full rate 0110 sfoff=spoff-hlpw+floor(kdsmp/2); % offset for downsampling filter 0111 sfi=1:kcorwd; % initial decimated correlation window index array 0112 sfhi=1:kcorw; % initial correlation window index array 0113 sfj=1:kcorwd+maxlag; 0114 sfmi=repmat((minlag:maxlag)',1,kcorwd)+repmat(sfi,maxlag-minlag+1,1); 0115 lagoff=(minlag-1)*kdsmp; % lag offset when converting to high sample rate 0116 beta=lagwt*f0min/fs; % bias towards low lags 0117 log2=log(2); 0118 lpcord=2+round(fs/1000); % lpc order for itakura distance 0119 hnfullag=floor(nfullag/2); 0120 jumprat=exp((doublec+log2)/2); % lag ratio at which octave jump cost is lowest 0121 ssq=s.^2; 0122 csssq=cumsum(ssq); 0123 sqrt(min(csssq(kcorw+1:end)-csssq(1:end-kcorw))/kcorw); 0124 afact=max([absnoise^2,max(ssq)*signoise^2,min(csssq(kcorw+1:end)-csssq(1:end-kcorw))*(relnoise/kcorw)^2])^2*kcorw^2; 0125 0126 % downsample signal to approx 2 kHz to speed up autocorrelation calculation 0127 % kdsmp is the downsample factor 0128 0129 sf=filter(blp/sum(blp),1,s(sfoff+1:end)); 0130 sp=filter([1 exp(preemph/fs)],1,s); % preemphasised speech for LPC calculation 0131 sf(1:length(blp)-1)=[]; % remove startup transient 0132 sf=sf(1:kdsmp:end); % downsample to =~2kHz 0133 nsf=length(sf); % length of downsampled speech 0134 ns=length(s); % length of full rate speech 0135 0136 % Calculate the frame limit to ensure we don't run off the end of the speech or decimated speech: 0137 % (a) For decimated autocorrelation when calculating sff(): (nframe-1)*kframed+kcorwd+maxlag <= nsf 0138 % (b) For full rate autocorrelation when calculating sfh(): max(fho)+kcorw+maxlag*kdsamp+hnfllag <= ns 0139 % (c) For rms ratio window when calculating rr : max(fho)+rmsix(end) <= ns 0140 % where max(fho) = (nframe-1)*kframe + spoff 0141 0142 nframe=floor(1+min((nsf-kcorwd-maxlag)/kframed,(ns-spoff-max(kcorw-maxlag*kdsmp-hnfullag,rmsix(end)))/kframe)); 0143 0144 % now search for autocorrelation peaks in the downsampled signal 0145 0146 cost=zeros(nframe,ncands); % cumulative cost 0147 prev=zeros(nframe,ncands); % traceback pointer 0148 mcands=zeros(nframe,1); % number of actual candidates excluding voiceless 0149 lagval=repmat(NaN,nframe,ncands-1); % lag of each voiced candidate 0150 tv=zeros(nframe,3); % diagnostics: 1=voiceless cost, 2=min voiced cost, 3:cumulative voiceless-min voiced 0151 if doback 0152 costms=cell(nframe,1); 0153 end 0154 0155 % Main processing loop for each 10 ms frame 0156 0157 for iframe=1:nframe % loop for each frame (~10 ms) 0158 0159 % Find peaks in the normalized autocorrelation of subsampled (2Khz) speech 0160 % only keep peaks that are > 30% of highest peak 0161 0162 fho=(iframe-1)*kframe+spoff; 0163 sff=sf((iframe-1)*kframed+sfj); 0164 sffdc=mean(sff(sfi)); % mean of initial correlation window length 0165 sff=sff-sffdc; % subtract off the mean 0166 nccfd=normxcor(sff(1:kcorwd),sff(minlag+1:end)); 0167 [ipkd,vpkd]=findpeaks(nccfd,'q'); 0168 0169 % Debugging: execute the line below to plot the autocorrelation peaks. 0170 % findpeaks(nccfd,'q'); xlabel(sprintf('Lag = (x+%d)*%g ms',minlag-1,1000*kdsmp/fs)); ylabel('Normalized Cross Correlation'); title (sprintf('Frame %d/%d',iframe,nframe)); 0171 0172 vipkd=[vpkd ipkd]; 0173 vipkd(vpkd<max(vpkd)*candtr,:)=[]; % eliminate peaks that are small 0174 if size(vipkd,1) 0175 if size(vipkd,1)>ncands-1 0176 vipkd=sortrows(vipkd); 0177 vipkd(1:size(vipkd,1)-ncands+1,:)=[]; % eliminate lowest to leave only ncands-1 0178 end 0179 lagcan=round(vipkd(:,2)*kdsmp+lagoff); % convert the lag candidate values to the full sample rate 0180 nlcan=length(lagcan); 0181 else 0182 nlcan=0; 0183 end 0184 0185 % If there are any candidate lag values (nlcan>0) then refine their accuracy at the full sample rate 0186 0187 if nlcan 0188 laglist=reshape(repmat(lagcan(:)',nfullag,1)+repmat((-hnfullag:hnfullag)',1,nlcan),nfullag*nlcan,1); 0189 sfh=s(fho+(1:kcorw+max(lagcan)+hnfullag)); 0190 sfhdc=mean(sfh(sfhi)); 0191 sfh=sfh-sfhdc; 0192 e0=sum(sfh(sfhi).^2); % energy of initial correlation window (only needed to store in tv(:,6) 0193 lagl2=repmat(lagcan(:)',nfullag+kcorw-1,1)+repmat((1-hnfullag:hnfullag+kcorw)',1,nlcan); 0194 nccf=normxcor(sfh(1:kcorw),sfh(lagl2),afact); 0195 0196 [maxcc,maxcci]=max(nccf,[],1); 0197 vipk=[maxcc(:) lagcan(:)+maxcci(:)-hnfullag-1]; 0198 vipk=vipk(:,[1 2 2]); 0199 maxccj=maxcci(:)'+nfullag*(0:nlcan-1); % vector index into nccf array 0200 msk=mod(maxcci,nfullag-1)~=1 & 2*nccf(maxccj)-nccf(mod(maxccj-2,nfullag*nlcan)+1)-nccf(mod(maxccj,nfullag*nlcan)+1)>0; % don't do quadratic interpolation for the end ones 0201 if any(msk) 0202 maxccj=maxccj(msk); 0203 vipk(msk,3)=vipk(msk,3)+(nccf(maxccj+1)-nccf(maxccj-1))'./(2*(2*nccf(maxccj)-nccf(maxccj-1)-nccf(maxccj+1)))'; 0204 end 0205 vipk(maxcc<max(maxcc)*candtr,:)=[]; % eliminate peaks that are small 0206 if size(vipk,1)>ncands-1 0207 vipk=sortrows(vipk); 0208 vipk(1:size(vipk,1)-ncands+1,:)=[]; % eliminate lowest to leave only ncands-1 0209 end 0210 0211 % vipk(:,1) has NCCF value, vipk(:,2) has integer peak position, vipk(:,3) has refined peak position 0212 0213 mc=size(vipk,1); 0214 else 0215 mc=0; 0216 end 0217 0218 % We now have mc lag candidates at the full sample rate 0219 0220 mc1=mc+1; % total number of candidates including "unvoiced" possibility 0221 mcands(iframe)=mc; % save number of lag candidates (needed for pitch consistency cost calculation) 0222 if mc 0223 lagval(iframe,1:mc)=vipk(:,3)'; 0224 cost(iframe,1)=vobias+max(vipk(:,1)); % voiceless cost 0225 cost(iframe,2:mc1)=1-vipk(:,1)'.*(1-beta*vipk(:,3)'); % local voiced costs 0226 tv(iframe,2)=min(cost(iframe,2:mc1)); 0227 else 0228 cost(iframe,1)=vobias; % if no lag candidates (mc=0), then the voiceless case is the only possibility 0229 end 0230 tv(iframe,1)=cost(iframe,1); 0231 if iframe>1 % if it is not the first frame, then calculate pitch consistency and v/uv transition costs 0232 mcp=mcands(iframe-1); 0233 costm=zeros(mcp+1,mc1); % cost matrix: rows and cols correspond to candidates in previous and current frames (incl voiceless) 0234 0235 % if both frames have at least one lag candidate, then calculate a pitch consistency cost 0236 0237 if mc*mcp 0238 lrat=abs(log(repmat(lagval(iframe,1:mc),mcp,1)./repmat(lagval(iframe-1,1:mcp)',1,mc))); 0239 costm(2:end,2:end)=freqwt*min(lrat,doublec+abs(lrat-log2)); % allow pitch doubling/halving 0240 end 0241 0242 % if either frame has a lag candidate, then calculate the cost of voiced/voiceless transition and vice versa 0243 0244 if mc+mcp 0245 rr=sqrt((rmswin'*s(fho+rmsix).^2)/(rmswin'*s(fho+rmsix-kdrms).^2)); % amplitude "gradient" 0246 ss=0.2/(distitar(lpcauto(sp(fho+rmsix),lpcord),lpcauto(sp(fho+rmsix-kdrms),lpcord),'e')-0.8); % Spectral stationarity: note: Talkin uses Hanning instead of Hamming windows for LPC 0247 costm(1,2:end)= vtranc+vtrsc*ss+vtrac/rr; % voiceless -> voiced cost 0248 costm(2:end,1)= vtranc+vtrsc*ss+vtrac*rr; 0249 tv(iframe,4:5)=[costm(1,mc1) costm(mcp+1,1)]; 0250 end 0251 costm=costm+repmat(cost(iframe-1,1:mcp+1)',1,mc1); % add in cumulative costs 0252 [costi,previ]=min(costm,[],1); 0253 cost(iframe,1:mc1)=cost(iframe,1:mc1)+costi; 0254 prev(iframe,1:mc1)=previ; 0255 else % first ever frame 0256 costm=zeros(1,mc1); % create a cost matrix in case doing a backward recursion 0257 end 0258 if mc 0259 tv(iframe,3)=cost(iframe,1)-min(cost(iframe,2:mc1)); 0260 tv(iframe,6)=5*log10(e0*e0/afact); 0261 end 0262 if doback 0263 costms{iframe}=costm; % need to add repmatted cost into this 0264 end 0265 end 0266 0267 % now do traceback 0268 0269 best=zeros(nframe,1); 0270 [cbest,best(nframe)]=min(cost(nframe,1:mcands(nframe)+1)); 0271 for i=nframe:-1:2 0272 best(i-1)=prev(i,best(i)); 0273 end 0274 vix=find(best>1); 0275 fx=repmat(NaN,nframe,1); % unvoiced frames will be NaN 0276 fx(vix)=fs*lagval(vix+nframe*(best(vix)-2)).^(-1); % leave as NaN if unvoiced 0277 tt=zeros(nframe,3); 0278 tt(:,1)=(1:nframe)'*kframe+spoff; % find frame times 0279 tt(:,2)=tt(:,1)+kframe-1; 0280 jratm=(jumprat+1/jumprat)/2; 0281 tt(2:end,3)=abs(fx(2:end)./fx(1:end-1)-jratm)>jumprat-jratm; % new spurt if frequency ratio is outside (1/jumprat,jumprat) 0282 tt(1,3)=1; % first frame always starts a spurt 0283 tt(1+find(isnan(fx(1:end-1))),3)=1; % NaN always forces a new spurt 0284 0285 % plot results if there are no output arguments of if the 'g' mode option is specified 0286 0287 if ~nargout || any(mode=='g') 0288 tf=spoff+(0:nframe-1)'*kframe; % one sample before start of each frame 0289 blag=repmat(NaN,nframe,1); % unvoiced frames will be NaN 0290 blag(vix)=lagval(vix+nframe*(best(vix)-2)); % leave as NaN if unvoiced 0291 ts=(1:ns)/fs; % time scale for speech samples 0292 tsa=[1:tf(1) tf(end)+kframe+1:ns]; % indexes for unprocessed speech [-1 term is an error methinks] 0293 sup=repmat(NaN,ns,1); % unprocessed speech - plot in black 0294 sup(tsa)=s(tsa); 0295 sv=reshape(s(tf(1)+1:tf(end)+kframe),kframe,nframe); % processed speech 0296 su=sv; 0297 su(:,best>1)=NaN; % delete all voiced samples 0298 sv(:,best==1)=NaN; % delete all unvoiced samples 0299 tsuv=(tf(1)+1:tf(end)+kframe)/fs; 0300 su=su(:); 0301 sv=sv(:); 0302 ax=zeros(2,1); 0303 ax(1)=subplot(211); 0304 plot(ts,sup,'-k',tsuv,su,'r-',tsuv,sv,'b-'); 0305 title('Speech'); 0306 ax(2)=subplot(212); 0307 plot((tf+(kframe+1)/2)/fs,lagval*1000/fs,'xr',(tf+(kframe+1)/2)/fs,blag*1000/fs,'-b') 0308 xlabel('Time (s)'); 0309 ylabel('Period (ms)'); 0310 title('Lag Candidates'); 0311 linkaxes(ax,'x'); 0312 end 0313 if ~any(mode=='u') 0314 tt(isnan(fx),:)=[]; % remove NaN spurts 0315 fx(isnan(fx),:)=[]; 0316 end 0317 0318 0319 0320 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0321 0322 function v=normxcor(x,y,d) 0323 % Calculate the normalized cross correlation of column vectors x and y 0324 % we can calculate this in two ways but fft is much faster even for nx small 0325 % We must have nx<=ny and the output length is ny-nx+1 0326 % note that this routine does not do mean subtraction even though this is normally a good idea 0327 % if y is a matrix, we correlate with each column 0328 % d is a constant added onto the normalization factor 0329 % v(j)=x'*yj/sqrt(d + x'*x * yj'*yj) where yj=y(j:j+nx-1) for j=1:ny-nx+1 0330 0331 if nargin<3 0332 d=0; 0333 end 0334 nx=length(x); 0335 [ny,my]=size(y); 0336 nv=1+ny-nx; 0337 if nx>ny 0338 error('second argument is shorter than the first'); 0339 end 0340 0341 nf=pow2(nextpow2(ny)); 0342 w=irfft(repmat(conj(rfft(x,nf,1)),1,my).*rfft(y,nf,1)); 0343 s=zeros(ny+1,my); 0344 s(2:end,:)=cumsum(y.^2,1); 0345 v=w(1:nv,:)./sqrt(d+(x'*x).*(s(nx+1:end,:)-s(1:end-nx,:)));