


VADSOHN implements a voice activity detector [VS,ZO]=(S,FSZ,M,P)
Inputs:
si input speech signal
fsz sample frequency in Hz
Alternatively, the input state from a previous call (see below)
m mode [default = 'a']:
'n' output frame start/end in samples
't' output frame start/end in seconds
'a' output activity decision
'b' output activity likelihood ratio
pp algorithm parameters [optional]
Outputs:
vs(:,n) outputs in the order [t1,t2,a,b] as selected by m options
if 'n' or 't' is specified in m, vs has one row per frame and
t1 and t2 give the frame start end end times (in samples or seconds).
otherwise, vs has one row per sample.
zo output state (used as input for a subsequent call).
The algorithm operation is controlled by a small number of parameters:
pp.of % overlap factor = (fft length)/(frame increment) [2]
pp.ti % desired frame increment [0.010 seconds]
pp.ri % set to 1 to round ti to the nearest power of 2 samples [0]
pp.ta % Time const for smoothing SNR estimate [0.396 seconds]
pp.gx % maximum posterior SNR as a power ratio [1000]
pp.xn % minimum prior SNR [0]
pp.pr % Speech probability threshold [0.7]
pp.ts % mean talkspurt length (100 ms)
pp.tn % mean silence length (50 ms)
In addition it is possible to specify parameters for the noise estimation algorithm
which implements reference [3] from which equation numbers are given in parentheses.
They are as follows:
pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds]
pp.tamax % (3): max smoothing time constant [0.392 seconds]
pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds]
pp.tpfall % (12): time constant for P to fall [0.064 seconds]
pp.tbmax % (20): max smoothing time constant [0.0717 seconds]
pp.qeqmin % (23): minimum value of Qeq [2]
pp.qeqmax % max value of Qeq per frame [14]
pp.av % (23)+13 lines: fudge factor for bc calculation [2.12]
pp.td % time to take minimum over [1.536 seconds]
pp.nu % number of subwindows to use [3]
pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1]
If convenient, you can call vadsohn in chunks of arbitrary size. Thus the following are equivalent:
(a) y=vadsohn(s,fs);
(b) [y1,z]=vadsohn(s(1:1000),fs);
[y2,z]=vadsohn(s(1001:2000),z);
y3=vadsohn(s(2001:end),z);
y=[y1; y2; y3];
Note that in all cases the number of output samples will be less than the number of input samples if
the input length is not an exact number of frames. In addition, if two output arguments
are specified, the last partial frame samples will be retained for overlap adding
with the output from the next call to specsub().
Refs:
[1] J. Sohn, N. S. Kim, and W. Sung.
A statistical model-based voice activity detection.
IEEE Signal Processing Lett., 6 (1): 1–3, 1999. doi: 10.1109/97.736233.
[2] Ephraim, Y. & Malah, D.
Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984
[3] Rainer Martin.
Noise power spectral density estimation based on optimal smoothing and minimum statistics.
IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.

0001 function [vs,zo]=vadsohn(si,fsz,m,pp) 0002 %VADSOHN implements a voice activity detector [VS,ZO]=(S,FSZ,M,P) 0003 % 0004 % Inputs: 0005 % si input speech signal 0006 % fsz sample frequency in Hz 0007 % Alternatively, the input state from a previous call (see below) 0008 % m mode [default = 'a']: 0009 % 'n' output frame start/end in samples 0010 % 't' output frame start/end in seconds 0011 % 'a' output activity decision 0012 % 'b' output activity likelihood ratio 0013 % pp algorithm parameters [optional] 0014 % 0015 % Outputs: 0016 % vs(:,n) outputs in the order [t1,t2,a,b] as selected by m options 0017 % if 'n' or 't' is specified in m, vs has one row per frame and 0018 % t1 and t2 give the frame start end end times (in samples or seconds). 0019 % otherwise, vs has one row per sample. 0020 % zo output state (used as input for a subsequent call). 0021 % 0022 % The algorithm operation is controlled by a small number of parameters: 0023 % 0024 % pp.of % overlap factor = (fft length)/(frame increment) [2] 0025 % pp.ti % desired frame increment [0.010 seconds] 0026 % pp.ri % set to 1 to round ti to the nearest power of 2 samples [0] 0027 % pp.ta % Time const for smoothing SNR estimate [0.396 seconds] 0028 % pp.gx % maximum posterior SNR as a power ratio [1000] 0029 % pp.xn % minimum prior SNR [0] 0030 % pp.pr % Speech probability threshold [0.7] 0031 % pp.ts % mean talkspurt length (100 ms) 0032 % pp.tn % mean silence length (50 ms) 0033 % 0034 % In addition it is possible to specify parameters for the noise estimation algorithm 0035 % which implements reference [3] from which equation numbers are given in parentheses. 0036 % They are as follows: 0037 % 0038 % pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds] 0039 % pp.tamax % (3): max smoothing time constant [0.392 seconds] 0040 % pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds] 0041 % pp.tpfall % (12): time constant for P to fall [0.064 seconds] 0042 % pp.tbmax % (20): max smoothing time constant [0.0717 seconds] 0043 % pp.qeqmin % (23): minimum value of Qeq [2] 0044 % pp.qeqmax % max value of Qeq per frame [14] 0045 % pp.av % (23)+13 lines: fudge factor for bc calculation [2.12] 0046 % pp.td % time to take minimum over [1.536 seconds] 0047 % pp.nu % number of subwindows to use [3] 0048 % pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ] 0049 % pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1] 0050 % 0051 % 0052 % If convenient, you can call vadsohn in chunks of arbitrary size. Thus the following are equivalent: 0053 % 0054 % (a) y=vadsohn(s,fs); 0055 % 0056 % (b) [y1,z]=vadsohn(s(1:1000),fs); 0057 % [y2,z]=vadsohn(s(1001:2000),z); 0058 % y3=vadsohn(s(2001:end),z); 0059 % y=[y1; y2; y3]; 0060 % 0061 % Note that in all cases the number of output samples will be less than the number of input samples if 0062 % the input length is not an exact number of frames. In addition, if two output arguments 0063 % are specified, the last partial frame samples will be retained for overlap adding 0064 % with the output from the next call to specsub(). 0065 % 0066 % Refs: 0067 % [1] J. Sohn, N. S. Kim, and W. Sung. 0068 % A statistical model-based voice activity detection. 0069 % IEEE Signal Processing Lett., 6 (1): 1–3, 1999. doi: 10.1109/97.736233. 0070 % [2] Ephraim, Y. & Malah, D. 0071 % Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator 0072 % IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984 0073 % [3] Rainer Martin. 0074 % Noise power spectral density estimation based on optimal smoothing and minimum statistics. 0075 % IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001. 0076 0077 % Copyright (C) Mike Brookes 2004 0078 % Version: $Id: vadsohn.m 713 2011-10-16 14:45:43Z dmb $ 0079 % 0080 % VOICEBOX is a MATLAB toolbox for speech processing. 0081 % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 0082 % 0083 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0084 % This program is free software; you can redistribute it and/or modify 0085 % it under the terms of the GNU General Public License as published by 0086 % the Free Software Foundation; either version 2 of the License, or 0087 % (at your option) any later version. 0088 % 0089 % This program is distributed in the hope that it will be useful, 0090 % but WITHOUT ANY WARRANTY; without even the implied warranty of 0091 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 0092 % GNU General Public License for more details. 0093 % 0094 % You can obtain a copy of the GNU General Public License from 0095 % http://www.gnu.org/copyleft/gpl.html or by writing to 0096 % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. 0097 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0098 if nargin<3 0099 m=''; 0100 end 0101 if isstruct(fsz) 0102 fs=fsz.fs; 0103 qq=fsz.qq; 0104 qp=fsz.qp; 0105 ze=fsz.ze; 0106 s=zeros(length(fsz.si)+length(si(:)),1); % allocate space for speech 0107 s(1:length(fsz.si))=fsz.si; 0108 s(length(fsz.si)+1:end)=si(:); 0109 else 0110 fs=fsz; % sample frequency 0111 s=si(:); 0112 % default algorithm constants 0113 0114 qq.of=2; % overlap factor = (fft length)/(frame increment) 0115 qq.pr=0.7; % Speech probability threshold 0116 qq.ts=0.1; % mean talkspurt length (100 ms) 0117 qq.tn=0.05; % mean silence length (50 ms) 0118 qq.ti=10e-3; % desired frame increment (10 ms) 0119 qq.ri=0; % round ni to the nearest power of 2 0120 qq.ta=0.396; % Time const for smoothing SNR estimate = -tinc/log(0.98) from [2] 0121 qq.gx=1000; % maximum posterior SNR = 30dB 0122 qq.xn=0; % minimum prior SNR = -Inf dB 0123 if nargin>=4 && ~isempty(pp) 0124 qp=pp; % save for estnoisem call 0125 qqn=fieldnames(qq); 0126 for i=1:length(qqn) 0127 if isfield(pp,qqn{i}) 0128 qq.(qqn{i})=pp.(qqn{i}); 0129 end 0130 end 0131 else 0132 qp=struct; % make an empty structure 0133 end 0134 end 0135 % derived algorithm constants 0136 if qq.ri 0137 ni=pow2(nextpow2(ti*fs*sqrt(0.5))); 0138 else 0139 ni=round(qq.ti*fs); % frame increment in samples 0140 end 0141 tinc=ni/fs; % true frame increment time 0142 a=exp(-tinc/qq.ta); % SNR smoothing coefficient 0143 gmax=qq.gx; % max posterior SNR = 30 dB 0144 kk=sqrt(2*pi); % sqrt(8)*Gamma(1.5) - required constant 0145 xn=qq.xn; % floor for prior SNR, xi 0146 a01=tinc/qq.tn; % a01=P(silence->speech) 0147 a00=1-a01; % a00=P(silence->silence) 0148 a10=tinc/qq.ts; % a10=P(speech->silence) 0149 a11=1-a10; % a11=P(speech->speech) 0150 b11=a11/a10; 0151 b01=a01/a00; 0152 b00=a01-a00*a11/a10; 0153 b10=a11-a10*a01/a00; 0154 prat=a10/a01; % P(silence)/P(speech) 0155 lprat=log(prat); 0156 % calculate power spectrum in frames 0157 0158 no=round(qq.of); % integer overlap factor 0159 nf=ni*no; % fft length 0160 nd=floor(ni*(no-1)/2); % output delay relative to start of frame 0161 w=hamming(nf+1)'; w(end)=[]; % for now always use hamming window 0162 w=w/sqrt(sum(w(1:ni:nf).^2)); % normalize to give overall gain of 1 0163 ns=length(s); 0164 y=enframe(s,w,ni); 0165 yf=rfft(y,nf,2); 0166 if ~size(yf,1) % no data frames 0167 vs=[]; 0168 nr=0; 0169 nb=0; 0170 else 0171 yp=yf.*conj(yf); % power spectrum of input speech 0172 [nr,nf2]=size(yp); % number of frames 0173 nb=ni*nr; 0174 isz=isstruct(fsz); 0175 if isz 0176 [dp,ze]=estnoisem(yp,ze); % estimate the noise using minimum statistics 0177 xu=fsz.xu; % previously saved unsmoothed SNR 0178 lggami=fsz.gg; 0179 nv=fsz.nv; 0180 else 0181 [dp,ze]=estnoisem(yp,tinc,qp); % estimate the noise using minimum statistics 0182 xu=1; % dummy unsmoothed SNR from previous frame [2](53)++ 0183 lggami=0; % initial prob ratio 0184 nv=0; % number of previous speech samples 0185 end 0186 gam=min(yp./dp,gmax); % gamma = posterior SNR [2](10) 0187 prsp=zeros(nr,1); % create space for prob ratio vector 0188 for i=1:nr % loop for each frame 0189 gami=gam(i,:); % gamma(i) = a posteriori SNR [2](10) 0190 xi=a*xu+(1-a)*max(gami-1,xn); % xi = smoothed a priori SNR [2](53) 0191 xi1=1+xi; 0192 v=0.5*xi.*gami./xi1; % note that this is 0.5*vk in [2] 0193 lamk=2*v-log(xi1); % log likelihood ratio [1](3) 0194 lami=sum(lamk(2:nf2))/(nf2-1); % mean log LR omitting DC term [1](4) 0195 if lggami<0 % avoid overflow 0196 lggami=lprat+lami+log(b11+b00/(a00+a10*exp(lggami))); 0197 else 0198 lggami=lprat+lami+log(b01+b10/(a10+a00*exp(-lggami))); 0199 end 0200 if lggami<0 0201 gg=exp(lggami); 0202 prsp(i)=gg/(1+gg); 0203 else 0204 prsp(i)=1/(1+exp(-lggami)); 0205 end 0206 gi=(0.277+2*v)./gami; % accurate to 0.02 dB for v>0.5 0207 mv=v<0.5; 0208 if any(mv) 0209 vmv=v(mv); 0210 gi(mv)=kk*sqrt(vmv).*((0.5+vmv).*besseli(0,vmv)+vmv.*besseli(1,vmv))./(gami(mv).*exp(vmv)); 0211 end 0212 xu=gami.*gi.^2; % unsmoothed prior SNR 0213 end 0214 ncol=any(m=='a')+any(m=='b'); % number of output data columns 0215 if ~ncol 0216 m(end+1)='a'; % force at least one output 0217 ncol=1; 0218 end 0219 if any(m=='n') || any(m=='t') % frame-based outputs 0220 vs=zeros(nr,2+ncol); 0221 vs(:,1)=nd+1+nv+(0:nr-1)'*ni; % starting sample of each frame 0222 vs(:,2)=ni-1+vs(:,1); % ending sample of each frame 0223 if any(m=='t') 0224 vs(:,1:2)=vs(:,1:2)/fs; % convert to seconds 0225 end 0226 if any(m=='a') 0227 vs(:,3)=prsp>qq.pr; 0228 end 0229 if any(m=='b') 0230 vs(:,end)=prsp; 0231 end 0232 else 0233 na=nd*(1-isz); % include preamble if no input state supplied 0234 nc=(nargout<=1)*(ns-nd-nb); % include tail if no output state desired 0235 vs=zeros(na+nb+nc,ncol); 0236 vs(na+(1:nb),ncol)=reshape(repmat(prsp',ni,1),nb,1); 0237 vs(1:na,ncol)=vs(na+1,ncol); % replicate start 0238 vs(na+nb+1:end,ncol)=vs(na+nb,ncol); % replicate end 0239 if any(m=='a') 0240 vs(:,1)=vs(:,ncol)>qq.pr; 0241 end 0242 end 0243 end 0244 if nargout>1 0245 zo.si=s(nd+nb+1:end); % save the tail end of the input speech signal 0246 zo.fs=fs; % save sample frequency 0247 zo.qq=qq; % save local parameters 0248 zo.qp=qp; % save estnoisem parameters 0249 zo.ze=ze; % save state of noise estimation 0250 zo.xu=xu; % unsmoothed prior SNR [2](53) 0251 zo.gg=lggami; % posterior prob ratio: P(speech|s)/P(silence|s) [1](11) 0252 zo.nv=nv+nd+nb; % number of previous speech samples 0253 end 0254 if ~nargout && nr>0 0255 ax=subplot(212); 0256 plot((nv+nd+[0 nr*ni])/fs,[qq.pr qq.pr],'r:',(nv+nd+ni/2+(0:nr-1)*ni)/fs,prsp,'b-' ); 0257 set(gca,'ylim',[-0.05 1.05]); 0258 xlabel('Time (s)'); 0259 ylabel('Pr(speech)'); 0260 ax(2)=subplot(211); 0261 plot((nv+(1:ns))/fs,s); 0262 ylabel('Speech'); 0263 title('Sohn VAD'); 0264 linkaxes(ax,'x'); 0265 end