V_VADSOHN implements a voice activity detector [VS,ZO]=(S,FSZ,M,P) Inputs: si input speech signal fsz sample frequency in Hz Alternatively, the input state from a previous call (see below) m mode [default = 'a']: 'n' output frame start/end in samples 't' output frame start/end in seconds 'a' output activity decision [default] 'b' output activity likelihood ratio 'p' plot graph [default if no outputs] pp algorithm parameters [optional] Outputs: vs(:,n) outputs in the order [t1,t2,a,b] as selected by m options if 'n' or 't' is specified in m, vs has one row per frame and t1 and t2 give the frame start end end times (in samples or seconds). otherwise, vs has one row per sample. zo output state (used as input for a subsequent call). The algorithm operation is controlled by a small number of parameters: pp.of % overlap factor = (fft length)/(frame increment) [2] pp.ti % desired output frame increment (10 ms) pp.tj % internal frame increment (10 ms) pp.ri % set to 1 to round tj to the nearest power of 2 samples [0] pp.ta % Time const for smoothing SNR estimate [0.396 seconds] pp.gx % maximum posterior SNR as a power ratio [1000 = 30dB] pp.gz % minimum posterior SNR as a power ratio [0.0001 = -40dB] pp.xn % minimum prior SNR [0] pp.pr % Speech probability threshold [0.7] pp.ts % mean talkspurt length (100 ms) pp.tn % mean silence length (50 ms) pp.ne % noise estimation: 0=min statistics, 1=MMSE [0] In addition it is possible to specify parameters for the noise estimation algorithm which implements reference [3] from which equation numbers are given in parentheses. They are as follows: pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds] pp.tamax % (3): max smoothing time constant [0.392 seconds] pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds] pp.tpfall % (12): time constant for P to fall [0.064 seconds] pp.tbmax % (20): max smoothing time constant [0.0717 seconds] pp.qeqmin % (23): minimum value of Qeq [2] pp.qeqmax % max value of Qeq per frame [14] pp.av % (23)+13 lines: fudge factor for bc calculation [2.12] pp.td % time to take minimum over [1.536 seconds] pp.nu % number of subwindows to use [3] pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ] pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1] If convenient, you can call v_vadsohn in chunks of arbitrary size. Thus the following are equivalent: (a) y=v_vadsohn(s,fs); (b) [y1,z]=v_vadsohn(s(1:1000),fs); [y2,z]=v_vadsohn(s(1001:2000),z); y3=v_vadsohn(s(2001:end),z); y=[y1; y2; y3]; Note that in all cases the number of output samples will be less than the number of input samples if the input length is not an exact number of frames. In addition, if two output arguments are specified, the last partial frame samples will be retained for overlap adding with the output from the next call to v_specsub(). Refs: [1] J. Sohn, N. S. Kim, and W. Sung. A statistical model-based voice activity detection. IEEE Signal Processing Lett., 6 (1): 1-3, 1999. doi: 10.1109/97.736233. [2] Ephraim, Y. & Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984 [3] Rainer Martin. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.
0001 function [vs,zo]=v_vadsohn(si,fsz,m,pp) 0002 %V_VADSOHN implements a voice activity detector [VS,ZO]=(S,FSZ,M,P) 0003 % 0004 % Inputs: 0005 % si input speech signal 0006 % fsz sample frequency in Hz 0007 % Alternatively, the input state from a previous call (see below) 0008 % m mode [default = 'a']: 0009 % 'n' output frame start/end in samples 0010 % 't' output frame start/end in seconds 0011 % 'a' output activity decision [default] 0012 % 'b' output activity likelihood ratio 0013 % 'p' plot graph [default if no outputs] 0014 % pp algorithm parameters [optional] 0015 % 0016 % Outputs: 0017 % vs(:,n) outputs in the order [t1,t2,a,b] as selected by m options 0018 % if 'n' or 't' is specified in m, vs has one row per frame and 0019 % t1 and t2 give the frame start end end times (in samples or seconds). 0020 % otherwise, vs has one row per sample. 0021 % zo output state (used as input for a subsequent call). 0022 % 0023 % The algorithm operation is controlled by a small number of parameters: 0024 % 0025 % pp.of % overlap factor = (fft length)/(frame increment) [2] 0026 % pp.ti % desired output frame increment (10 ms) 0027 % pp.tj % internal frame increment (10 ms) 0028 % pp.ri % set to 1 to round tj to the nearest power of 2 samples [0] 0029 % pp.ta % Time const for smoothing SNR estimate [0.396 seconds] 0030 % pp.gx % maximum posterior SNR as a power ratio [1000 = 30dB] 0031 % pp.gz % minimum posterior SNR as a power ratio [0.0001 = -40dB] 0032 % pp.xn % minimum prior SNR [0] 0033 % pp.pr % Speech probability threshold [0.7] 0034 % pp.ts % mean talkspurt length (100 ms) 0035 % pp.tn % mean silence length (50 ms) 0036 % pp.ne % noise estimation: 0=min statistics, 1=MMSE [0] 0037 % 0038 % In addition it is possible to specify parameters for the noise estimation algorithm 0039 % which implements reference [3] from which equation numbers are given in parentheses. 0040 % They are as follows: 0041 % 0042 % pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds] 0043 % pp.tamax % (3): max smoothing time constant [0.392 seconds] 0044 % pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds] 0045 % pp.tpfall % (12): time constant for P to fall [0.064 seconds] 0046 % pp.tbmax % (20): max smoothing time constant [0.0717 seconds] 0047 % pp.qeqmin % (23): minimum value of Qeq [2] 0048 % pp.qeqmax % max value of Qeq per frame [14] 0049 % pp.av % (23)+13 lines: fudge factor for bc calculation [2.12] 0050 % pp.td % time to take minimum over [1.536 seconds] 0051 % pp.nu % number of subwindows to use [3] 0052 % pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ] 0053 % pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1] 0054 % 0055 % 0056 % If convenient, you can call v_vadsohn in chunks of arbitrary size. Thus the following are equivalent: 0057 % 0058 % (a) y=v_vadsohn(s,fs); 0059 % 0060 % (b) [y1,z]=v_vadsohn(s(1:1000),fs); 0061 % [y2,z]=v_vadsohn(s(1001:2000),z); 0062 % y3=v_vadsohn(s(2001:end),z); 0063 % y=[y1; y2; y3]; 0064 % 0065 % Note that in all cases the number of output samples will be less than the number of input samples if 0066 % the input length is not an exact number of frames. In addition, if two output arguments 0067 % are specified, the last partial frame samples will be retained for overlap adding 0068 % with the output from the next call to v_specsub(). 0069 % 0070 % Refs: 0071 % [1] J. Sohn, N. S. Kim, and W. Sung. 0072 % A statistical model-based voice activity detection. 0073 % IEEE Signal Processing Lett., 6 (1): 1-3, 1999. doi: 10.1109/97.736233. 0074 % [2] Ephraim, Y. & Malah, D. 0075 % Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator 0076 % IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984 0077 % [3] Rainer Martin. 0078 % Noise power spectral density estimation based on optimal smoothing and minimum statistics. 0079 % IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001. 0080 0081 % Copyright (C) Mike Brookes 2004 0082 % Version: $Id: v_vadsohn.m 10865 2018-09-21 17:22:45Z dmb $ 0083 % 0084 % VOICEBOX is a MATLAB toolbox for speech processing. 0085 % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 0086 % 0087 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0088 % This program is free software; you can redistribute it and/or modify 0089 % it under the terms of the GNU General Public License as published by 0090 % the Free Software Foundation; either version 2 of the License, or 0091 % (at your option) any later version. 0092 % 0093 % This program is distributed in the hope that it will be useful, 0094 % but WITHOUT ANY WARRANTY; without even the implied warranty of 0095 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 0096 % GNU General Public License for more details. 0097 % 0098 % You can obtain a copy of the GNU General Public License from 0099 % http://www.gnu.org/copyleft/gpl.html or by writing to 0100 % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. 0101 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0102 if nargin<3 || isempty(m) 0103 m=''; 0104 end 0105 if isstruct(fsz) 0106 fs=fsz.fs; 0107 qq=fsz.qq; 0108 qp=fsz.qp; 0109 ze=fsz.ze; 0110 s=zeros(length(fsz.si)+length(si(:)),1); % allocate space for speech 0111 s(1:length(fsz.si))=fsz.si; 0112 s(length(fsz.si)+1:end)=si(:); 0113 else 0114 fs=fsz; % sample frequency 0115 s=si(:); 0116 % default algorithm constants 0117 0118 qq.of=2; % overlap factor = (fft length)/(frame increment) 0119 qq.pr=0.7; % Speech probability threshold 0120 qq.ts=0.1; % mean talkspurt length (100 ms) 0121 qq.tn=0.05; % mean silence length (50 ms) 0122 qq.ti=10e-3; % desired output frame increment (10 ms) 0123 qq.tj=10e-3; % internal frame increment (10 ms) 0124 qq.ri=0; % round ni to the nearest power of 2 0125 qq.ta=0.396; % Time const for smoothing SNR estimate = -tinc/log(0.98) from [2] 0126 qq.gx=1000; % maximum posterior SNR = 30dB 0127 qq.gz=1e-4; % minimum posterior SNR = -40dB 0128 qq.xn=0; % minimum prior SNR = -Inf dB 0129 qq.ne=0; % noise estimation: 0=min statistics, 1=MMSE [0] 0130 if nargin>=4 && ~isempty(pp) 0131 qp=pp; % save for v_estnoisem call 0132 qqn=fieldnames(qq); 0133 for i=1:length(qqn) 0134 if isfield(pp,qqn{i}) 0135 qq.(qqn{i})=pp.(qqn{i}); 0136 end 0137 end 0138 else 0139 qp=struct; % make an empty structure 0140 end 0141 end 0142 % derived algorithm constants 0143 qq.tj=min([qq.tj 0.5*qq.ts 0.5*qq.tn]); % at least two frames per talk/silence spurt 0144 nj=max(round(qq.ti/qq.tj),1); % number of internal frames per output frame 0145 if qq.ri 0146 ni=pow2(nextpow2(qq.ti*fs*sqrt(0.5)/nj)); % internal frame increment in samples 0147 else 0148 ni=round(qq.ti*fs/nj); % internal frame increment in samples 0149 end 0150 tinc=ni/fs; % true frame increment time 0151 a=exp(-tinc/qq.ta); % SNR smoothing coefficient 0152 gmax=qq.gx; % max posterior SNR = 30 dB 0153 kk=sqrt(2*pi); % sqrt(8)*Gamma(1.5) - required constant 0154 xn=qq.xn; % floor for prior SNR, xi 0155 gz=qq.gz; % floor for posterior SNR 0156 a01=tinc/qq.tn; % a01=P(silence->speech) 0157 a00=1-a01; % a00=P(silence->silence) 0158 a10=tinc/qq.ts; % a10=P(speech->silence) 0159 a11=1-a10; % a11=P(speech->speech) 0160 b11=a11/a10; 0161 b01=a01/a00; 0162 b00=a01-a00*a11/a10; 0163 b10=a11-a10*a01/a00; 0164 prat=a10/a01; % P(silence)/P(speech) 0165 lprat=log(prat); 0166 % calculate power spectrum in frames 0167 0168 no=round(qq.of); % integer overlap factor 0169 nf=ni*no; % fft length 0170 nd=floor(ni*(no-1)/2); % output delay relative to start of frame 0171 w=hamming(nf+1)'; w(end)=[]; % for now always use hamming window 0172 w=w/sqrt(sum(w(1:ni:nf).^2)); % normalize to give overall gain of 1 0173 ns=length(s); 0174 y=v_enframe(s,w,ni); 0175 yf=v_rfft(y,nf,2); 0176 if ~size(yf,1) % no data frames 0177 vs=[]; 0178 nr=0; 0179 nb=0; 0180 else 0181 yp=yf.*conj(yf); % power spectrum of input speech 0182 [nr,nf2]=size(yp); % number of frames 0183 nb=ni*nr; 0184 isz=isstruct(fsz); 0185 if isz 0186 if qq.ne>0 0187 [dp,ze]=v_estnoiseg(yp,ze); % estimate the noise using MMSE 0188 else 0189 [dp,ze]=v_estnoisem(yp,ze); % estimate the noise using minimum statistics 0190 end 0191 xu=fsz.xu; % previously saved unsmoothed SNR 0192 lggami=fsz.gg; 0193 nv=fsz.nv; 0194 else 0195 if qq.ne>0 0196 [dp,ze]=v_estnoiseg(yp,tinc,qp); % estimate the noise using MMSE 0197 else 0198 [dp,ze]=v_estnoisem(yp,tinc,qp); % estimate the noise using minimum statistics 0199 end 0200 xu=1; % dummy unsmoothed SNR from previous frame [2](53)++ 0201 lggami=0; % initial prob ratio 0202 nv=0; % number of previous speech samples 0203 end 0204 gam=max(min(yp./dp,gmax),gz); % gamma = posterior SNR [2](10) 0205 prsp=zeros(nr,1); % create space for prob ratio vector 0206 for i=1:nr % loop for each frame 0207 gami=gam(i,:); % gamma(i) = a posteriori SNR [2](10) 0208 xi=a*xu+(1-a)*max(gami-1,xn); % xi = smoothed a priori SNR [2](53) 0209 xi1=1+xi; 0210 v=0.5*xi.*gami./xi1; % note that this is 0.5*vk in [2] 0211 lamk=2*v-log(xi1); % log likelihood ratio for each frequency bin [1](3) 0212 lami=sum(lamk(2:nf2))/(nf2-1); % mean log LR over frequency omitting DC term [1](4) 0213 if lggami<0 % avoid overflow in calculating [1](11) 0214 lggami=lprat+lami+log(b11+b00/(a00+a10*exp(lggami))); 0215 else 0216 lggami=lprat+lami+log(b01+b10/(a10+a00*exp(-lggami))); 0217 end 0218 if lggami<0 0219 gg=exp(lggami); 0220 prsp(i)=gg/(1+gg); 0221 else 0222 prsp(i)=1/(1+exp(-lggami)); 0223 end 0224 gi=(0.277+2*v)./gami; % accurate to 0.02 dB for v>0.5 0225 mv=v<0.5; 0226 if any(mv) % only calculate Bessel functions for v<0.5 0227 vmv=v(mv); 0228 gi(mv)=kk*sqrt(vmv).*((0.5+vmv).*besseli(0,vmv)+vmv.*besseli(1,vmv))./(gami(mv).*exp(vmv)); % [2](7) 0229 end 0230 xu=gami.*gi.^2; % unsmoothed prior SNR % [2](53) 0231 end 0232 ncol=any(m=='a')+any(m=='b'); % number of output data columns 0233 if ~ncol 0234 m(end+1)='a'; % force at least one output 0235 ncol=1; 0236 end 0237 nw=floor(nr/nj); % number of output or plotted frames 0238 if any(m=='n') || any(m=='t') % frame-based outputs 0239 vs=zeros(nw,2+ncol); % space for outputs 0240 vs(:,1)=nd+1+nv+(0:nw-1)'*ni*nj; % starting sample of each frame 0241 vs(:,2)=ni*nj-1+vs(:,1); % ending sample of each frame 0242 if any(m=='t') 0243 vs(:,1:2)=vs(:,1:2)/fs; % convert to seconds 0244 end 0245 if any(m=='a') 0246 vs(:,3)=any(reshape(prsp(1:nw*nj)>qq.pr,nj,[]),1).'; 0247 end 0248 if any(m=='b') 0249 vs(:,end)=max(reshape(prsp(1:nw*nj),nj,[]),[],1).'; 0250 end 0251 else 0252 na=nd*(1-isz); % include preamble if no input state supplied 0253 nc=(nargout<=1)*(ns-nd-nb); % include tail if no output state desired 0254 vs=zeros(na+nb+nc,ncol); 0255 vs(na+(1:nb),ncol)=reshape(repmat(prsp',ni,1),nb,1); 0256 vs(1:na,ncol)=vs(na+1,ncol); % replicate start 0257 vs(na+nb+1:end,ncol)=vs(na+nb,ncol); % replicate end 0258 if any(m=='a') 0259 vs(:,1)=vs(:,ncol)>qq.pr; 0260 end 0261 end 0262 end 0263 if nargout>1 0264 zo.si=s(nd+nb+1:end); % save the tail end of the input speech signal 0265 zo.fs=fs; % save sample frequency 0266 zo.qq=qq; % save local parameters 0267 zo.qp=qp; % save v_estnoisem parameters 0268 zo.ze=ze; % save state of noise estimation 0269 zo.xu=xu; % unsmoothed prior SNR [2](53) 0270 zo.gg=lggami; % posterior prob ratio: P(speech|s)/P(silence|s) [1](11) 0271 zo.nv=nv+nd+nb; % number of previous speech samples 0272 end 0273 if (~nargout || any(m=='p')) && nr>0 0274 ax=subplot(212); 0275 plot((nv+nd+[0 nr*ni])/fs,[qq.pr qq.pr],'r:',(nv+nd+ni*nj/2+(0:nw-1)*ni*nj)/fs,max(reshape(prsp(1:nw*nj),nj,[]),[],1).','b-' ); 0276 set(gca,'ylim',[-0.05 1.05]); 0277 xlabel('Time (s)'); 0278 ylabel('Pr(speech)'); 0279 ax(2)=subplot(211); 0280 plot((nv+(1:ns))/fs,s); 0281 ylabel('Speech'); 0282 title('Sohn VAD'); 0283 linkaxes(ax,'x'); 0284 end