Description of v

V_VADSOHN implements a voice activity detector [VS,ZO]=(S,FSZ,M,P)

 Inputs:
   si      input speech signal
   fsz     sample frequency in Hz
           Alternatively, the input state from a previous call (see below)
   m       mode [default = 'a']:
               'n'   output frame start/end in samples
               't'   output frame start/end in seconds
               'a'   output activity decision [default]
               'b'   output activity likelihood ratio
               'p'   plot graph [default if no outputs]
   pp      algorithm parameters [optional]

 Outputs:
   vs(:,n)     outputs in the order [t1,t2,a,b] as selected by m options
               if 'n' or 't' is specified in m, vs has one row per frame and
               t1 and t2 give the frame start end end times (in samples or seconds).
               otherwise, vs has one row per sample.
   zo          output state (used as input for a subsequent call).

 The algorithm operation is controlled by a small number of parameters:

        pp.of          % overlap factor = (fft length)/(frame increment) [2]
        pp.ti          % desired output frame increment (10 ms)
        pp.tj          % internal frame increment (10 ms)
        pp.ri          % set to 1 to round tj to the nearest power of 2 samples [0]
        pp.ta          % Time const for smoothing SNR estimate [0.396 seconds]
        pp.gx          % maximum posterior SNR as a power ratio [1000 = 30dB]
        pp.gz          % minimum posterior SNR as a power ratio [0.0001 = -40dB]
        pp.xn          % minimum prior SNR [0]
        pp.pr          % Speech probability threshold [0.7]
        pp.ts          % mean talkspurt length (100 ms)
        pp.tn          % mean silence length (50 ms)
        pp.ne          % noise estimation: 0=min statistics, 1=MMSE [0]

 In addition it is possible to specify parameters for the noise estimation algorithm
 which implements reference [3] from which equation numbers are given in parentheses.
 They are as follows:

        pp.taca      % (11): smoothing time constant for alpha_c [0.0449 seconds]
        pp.tamax     % (3): max smoothing time constant [0.392 seconds]
        pp.taminh    % (3): min smoothing time constant (upper limit) [0.0133 seconds]
        pp.tpfall    % (12): time constant for P to fall [0.064 seconds]
        pp.tbmax     % (20): max smoothing time constant [0.0717 seconds]
        pp.qeqmin    % (23): minimum value of Qeq [2]
        pp.qeqmax    % max value of Qeq per frame [14]
        pp.av        % (23)+13 lines: fudge factor for bc calculation  [2.12]
        pp.td        % time to take minimum over [1.536 seconds]
        pp.nu        % number of subwindows to use [3]
        pp.qith      % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
        pp.nsmdb     % corresponding noise slope thresholds in dB/second   [47 31.4 15.7 4.1]


 If convenient, you can call v_vadsohn in chunks of arbitrary size. Thus the following are equivalent:

                   (a) y=v_vadsohn(s,fs);

                   (b) [y1,z]=v_vadsohn(s(1:1000),fs);
                       [y2,z]=v_vadsohn(s(1001:2000),z);
                       y3=v_vadsohn(s(2001:end),z);
                       y=[y1; y2; y3];

 Note that in all cases the number of output samples will be less than the number of input samples if
 the input length is not an exact number of frames. In addition, if two output arguments
 are specified, the last partial frame samples will be retained for overlap adding
 with the output from the next call to v_specsub().

 Refs:
    [1] J. Sohn, N. S. Kim, and W. Sung.
        A statistical model-based voice activity detection.
        IEEE Signal Processing Lett., 6 (1): 1-3, 1999. doi: 10.1109/97.736233.
    [2] Ephraim, Y. & Malah, D.
        Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
        IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984
    [3] Rainer Martin.
        Noise power spectral density estimation based on optimal smoothing and minimum statistics.
        IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.

0001 function [vs,zo]=v_vadsohn(si,fsz,m,pp)
0002 %V_VADSOHN implements a voice activity detector [VS,ZO]=(S,FSZ,M,P)
0003 %
0004 % Inputs:
0005 %   si      input speech signal
0006 %   fsz     sample frequency in Hz
0007 %           Alternatively, the input state from a previous call (see below)
0008 %   m       mode [default = 'a']:
0009 %               'n'   output frame start/end in samples
0010 %               't'   output frame start/end in seconds
0011 %               'a'   output activity decision [default]
0012 %               'b'   output activity likelihood ratio
0013 %               'p'   plot graph [default if no outputs]
0014 %   pp      algorithm parameters [optional]
0015 %
0016 % Outputs:
0017 %   vs(:,n)     outputs in the order [t1,t2,a,b] as selected by m options
0018 %               if 'n' or 't' is specified in m, vs has one row per frame and
0019 %               t1 and t2 give the frame start end end times (in samples or seconds).
0020 %               otherwise, vs has one row per sample.
0021 %   zo          output state (used as input for a subsequent call).
0022 %
0023 % The algorithm operation is controlled by a small number of parameters:
0024 %
0025 %        pp.of          % overlap factor = (fft length)/(frame increment) [2]
0026 %        pp.ti          % desired output frame increment (10 ms)
0027 %        pp.tj          % internal frame increment (10 ms)
0028 %        pp.ri          % set to 1 to round tj to the nearest power of 2 samples [0]
0029 %        pp.ta          % Time const for smoothing SNR estimate [0.396 seconds]
0030 %        pp.gx          % maximum posterior SNR as a power ratio [1000 = 30dB]
0031 %        pp.gz          % minimum posterior SNR as a power ratio [0.0001 = -40dB]
0032 %        pp.xn          % minimum prior SNR [0]
0033 %        pp.pr          % Speech probability threshold [0.7]
0034 %        pp.ts          % mean talkspurt length (100 ms)
0035 %        pp.tn          % mean silence length (50 ms)
0036 %        pp.ne          % noise estimation: 0=min statistics, 1=MMSE [0]
0037 %
0038 % In addition it is possible to specify parameters for the noise estimation algorithm
0039 % which implements reference [3] from which equation numbers are given in parentheses.
0040 % They are as follows:
0041 %
0042 %        pp.taca      % (11): smoothing time constant for alpha_c [0.0449 seconds]
0043 %        pp.tamax     % (3): max smoothing time constant [0.392 seconds]
0044 %        pp.taminh    % (3): min smoothing time constant (upper limit) [0.0133 seconds]
0045 %        pp.tpfall    % (12): time constant for P to fall [0.064 seconds]
0046 %        pp.tbmax     % (20): max smoothing time constant [0.0717 seconds]
0047 %        pp.qeqmin    % (23): minimum value of Qeq [2]
0048 %        pp.qeqmax    % max value of Qeq per frame [14]
0049 %        pp.av        % (23)+13 lines: fudge factor for bc calculation  [2.12]
0050 %        pp.td        % time to take minimum over [1.536 seconds]
0051 %        pp.nu        % number of subwindows to use [3]
0052 %        pp.qith      % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
0053 %        pp.nsmdb     % corresponding noise slope thresholds in dB/second   [47 31.4 15.7 4.1]
0054 %
0055 %
0056 % If convenient, you can call v_vadsohn in chunks of arbitrary size. Thus the following are equivalent:
0057 %
0058 %                   (a) y=v_vadsohn(s,fs);
0059 %
0060 %                   (b) [y1,z]=v_vadsohn(s(1:1000),fs);
0061 %                       [y2,z]=v_vadsohn(s(1001:2000),z);
0062 %                       y3=v_vadsohn(s(2001:end),z);
0063 %                       y=[y1; y2; y3];
0064 %
0065 % Note that in all cases the number of output samples will be less than the number of input samples if
0066 % the input length is not an exact number of frames. In addition, if two output arguments
0067 % are specified, the last partial frame samples will be retained for overlap adding
0068 % with the output from the next call to v_specsub().
0069 %
0070 % Refs:
0071 %    [1] J. Sohn, N. S. Kim, and W. Sung.
0072 %        A statistical model-based voice activity detection.
0073 %        IEEE Signal Processing Lett., 6 (1): 1-3, 1999. doi: 10.1109/97.736233.
0074 %    [2] Ephraim, Y. & Malah, D.
0075 %        Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
0076 %        IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984
0077 %    [3] Rainer Martin.
0078 %        Noise power spectral density estimation based on optimal smoothing and minimum statistics.
0079 %        IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.
0080 
0081 %      Copyright (C) Mike Brookes 2004
0082 %      Version: $Id: v_vadsohn.m 10865 2018-09-21 17:22:45Z dmb $
0083 %
0084 %   VOICEBOX is a MATLAB toolbox for speech processing.
0085 %   Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
0086 %
0087 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0088 %   This program is free software; you can redistribute it and/or modify
0089 %   it under the terms of the GNU General Public License as published by
0090 %   the Free Software Foundation; either version 2 of the License, or
0091 %   (at your option) any later version.
0092 %
0093 %   This program is distributed in the hope that it will be useful,
0094 %   but WITHOUT ANY WARRANTY; without even the implied warranty of
0095 %   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
0096 %   GNU General Public License for more details.
0097 %
0098 %   You can obtain a copy of the GNU General Public License from
0099 %   http://www.gnu.org/copyleft/gpl.html or by writing to
0100 %   Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
0101 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0102 if nargin<3 || isempty(m)
0103     m='';
0104 end
0105 if isstruct(fsz)
0106     fs=fsz.fs;
0107     qq=fsz.qq;
0108     qp=fsz.qp;
0109     ze=fsz.ze;
0110     s=zeros(length(fsz.si)+length(si(:)),1); % allocate space for speech
0111     s(1:length(fsz.si))=fsz.si;
0112     s(length(fsz.si)+1:end)=si(:);
0113 else
0114     fs=fsz;     % sample frequency
0115     s=si(:);
0116     % default algorithm constants
0117 
0118     qq.of=2;        % overlap factor = (fft length)/(frame increment)
0119     qq.pr=0.7;      % Speech probability threshold
0120     qq.ts=0.1;      % mean talkspurt length (100 ms)
0121     qq.tn=0.05;     % mean silence length (50 ms)
0122     qq.ti=10e-3;    % desired output frame increment (10 ms)
0123     qq.tj=10e-3;    % internal frame increment (10 ms)
0124     qq.ri=0;        % round ni to the nearest power of 2
0125     qq.ta=0.396;    % Time const for smoothing SNR estimate = -tinc/log(0.98) from [2]
0126     qq.gx=1000;     % maximum posterior SNR = 30dB
0127     qq.gz=1e-4;     % minimum posterior SNR = -40dB
0128     qq.xn=0;        % minimum prior SNR = -Inf dB
0129     qq.ne=0;          % noise estimation: 0=min statistics, 1=MMSE [0]
0130     if nargin>=4 && ~isempty(pp)
0131         qp=pp;      % save for v_estnoisem call
0132         qqn=fieldnames(qq);
0133         for i=1:length(qqn)
0134             if isfield(pp,qqn{i})
0135                 qq.(qqn{i})=pp.(qqn{i});
0136             end
0137         end
0138     else
0139         qp=struct;  % make an empty structure
0140     end
0141 end
0142 % derived algorithm constants
0143 qq.tj=min([qq.tj 0.5*qq.ts 0.5*qq.tn]);     % at least two frames per talk/silence spurt
0144 nj=max(round(qq.ti/qq.tj),1);               % number of internal frames per output frame
0145 if qq.ri
0146     ni=pow2(nextpow2(qq.ti*fs*sqrt(0.5)/nj)); % internal frame increment in samples
0147 else
0148     ni=round(qq.ti*fs/nj);    % internal frame increment in samples
0149 end
0150 tinc=ni/fs;             % true frame increment time
0151 a=exp(-tinc/qq.ta);     % SNR smoothing coefficient
0152 gmax=qq.gx;             % max posterior SNR = 30 dB
0153 kk=sqrt(2*pi);          % sqrt(8)*Gamma(1.5) - required constant
0154 xn=qq.xn;                % floor for prior SNR, xi
0155 gz=qq.gz;               % floor for posterior SNR
0156 a01=tinc/qq.tn;         % a01=P(silence->speech)
0157 a00=1-a01;              % a00=P(silence->silence)
0158 a10=tinc/qq.ts;         % a10=P(speech->silence)
0159 a11=1-a10;              % a11=P(speech->speech)
0160 b11=a11/a10;
0161 b01=a01/a00;
0162 b00=a01-a00*a11/a10;
0163 b10=a11-a10*a01/a00;
0164 prat=a10/a01;       % P(silence)/P(speech)
0165 lprat=log(prat);
0166 % calculate power spectrum in frames
0167 
0168 no=round(qq.of);                                   % integer overlap factor
0169 nf=ni*no;           % fft length
0170 nd=floor(ni*(no-1)/2); % output delay relative to start of frame
0171 w=hamming(nf+1)'; w(end)=[];  % for now always use hamming window
0172 w=w/sqrt(sum(w(1:ni:nf).^2));          % normalize to give overall gain of 1
0173 ns=length(s);
0174 y=v_enframe(s,w,ni);
0175 yf=v_rfft(y,nf,2);
0176 if ~size(yf,1)                                  % no data frames
0177     vs=[];
0178     nr=0;
0179     nb=0;
0180 else
0181     yp=yf.*conj(yf);                % power spectrum of input speech
0182     [nr,nf2]=size(yp);              % number of frames
0183     nb=ni*nr;
0184     isz=isstruct(fsz);
0185     if isz
0186         if qq.ne>0
0187             [dp,ze]=v_estnoiseg(yp,ze);    % estimate the noise using MMSE
0188         else
0189             [dp,ze]=v_estnoisem(yp,ze);    % estimate the noise using minimum statistics
0190         end
0191         xu=fsz.xu;                  % previously saved unsmoothed SNR
0192         lggami=fsz.gg;
0193         nv=fsz.nv;
0194     else
0195         if qq.ne>0
0196             [dp,ze]=v_estnoiseg(yp,tinc,qp);    % estimate the noise using MMSE
0197         else
0198             [dp,ze]=v_estnoisem(yp,tinc,qp);    % estimate the noise using minimum statistics
0199         end
0200         xu=1;                               % dummy unsmoothed SNR from previous frame [2](53)++
0201         lggami=0;                            % initial prob ratio
0202         nv=0;                               % number of previous speech samples
0203     end
0204     gam=max(min(yp./dp,gmax),gz);       % gamma = posterior SNR [2](10)
0205     prsp=zeros(nr,1);                   % create space for prob ratio vector
0206     for i=1:nr                          % loop for each frame
0207         gami=gam(i,:);                  % gamma(i) = a posteriori SNR [2](10)
0208         xi=a*xu+(1-a)*max(gami-1,xn);   % xi = smoothed a priori SNR [2](53)
0209         xi1=1+xi;
0210         v=0.5*xi.*gami./xi1;            % note that this is 0.5*vk in [2]
0211         lamk=2*v-log(xi1);              % log likelihood ratio for each frequency bin [1](3)
0212         lami=sum(lamk(2:nf2))/(nf2-1);  % mean log LR over frequency omitting DC term [1](4)
0213         if lggami<0                     % avoid overflow in calculating [1](11)
0214             lggami=lprat+lami+log(b11+b00/(a00+a10*exp(lggami)));
0215         else
0216             lggami=lprat+lami+log(b01+b10/(a10+a00*exp(-lggami)));
0217         end
0218         if lggami<0
0219             gg=exp(lggami);
0220             prsp(i)=gg/(1+gg);
0221         else
0222             prsp(i)=1/(1+exp(-lggami));
0223         end
0224         gi=(0.277+2*v)./gami;           % accurate to 0.02 dB for v>0.5
0225         mv=v<0.5;
0226         if any(mv)                      % only calculate Bessel functions for v<0.5
0227             vmv=v(mv);
0228             gi(mv)=kk*sqrt(vmv).*((0.5+vmv).*besseli(0,vmv)+vmv.*besseli(1,vmv))./(gami(mv).*exp(vmv)); % [2](7)
0229         end
0230         xu=gami.*gi.^2;                 % unsmoothed prior SNR % [2](53)
0231     end
0232     ncol=any(m=='a')+any(m=='b');       % number of output data columns
0233     if ~ncol
0234         m(end+1)='a';   % force at least one output
0235         ncol=1;
0236     end
0237     nw=floor(nr/nj);                % number of output or plotted frames
0238     if any(m=='n') || any(m=='t')       % frame-based outputs
0239         vs=zeros(nw,2+ncol);            % space for outputs
0240         vs(:,1)=nd+1+nv+(0:nw-1)'*ni*nj;   % starting sample of each frame
0241         vs(:,2)=ni*nj-1+vs(:,1);           % ending sample of each frame
0242         if any(m=='t')
0243             vs(:,1:2)=vs(:,1:2)/fs;     % convert to seconds
0244         end
0245         if any(m=='a')
0246             vs(:,3)=any(reshape(prsp(1:nw*nj)>qq.pr,nj,[]),1).';
0247         end
0248         if any(m=='b')
0249             vs(:,end)=max(reshape(prsp(1:nw*nj),nj,[]),[],1).';
0250         end
0251     else
0252         na=nd*(1-isz);                  % include preamble if no input state supplied
0253         nc=(nargout<=1)*(ns-nd-nb);     % include tail if no output state desired
0254         vs=zeros(na+nb+nc,ncol);
0255         vs(na+(1:nb),ncol)=reshape(repmat(prsp',ni,1),nb,1);
0256         vs(1:na,ncol)=vs(na+1,ncol);        % replicate start
0257         vs(na+nb+1:end,ncol)=vs(na+nb,ncol); % replicate end
0258         if any(m=='a')
0259             vs(:,1)=vs(:,ncol)>qq.pr;
0260         end
0261     end
0262 end
0263 if nargout>1
0264     zo.si=s(nd+nb+1:end);      % save the tail end of the input speech signal
0265     zo.fs=fs;                       % save sample frequency
0266     zo.qq=qq;                       % save local parameters
0267     zo.qp=qp;                       % save v_estnoisem parameters
0268     zo.ze=ze;                       % save state of noise estimation
0269     zo.xu=xu;                       % unsmoothed prior SNR  [2](53)
0270     zo.gg=lggami;                    % posterior prob ratio: P(speech|s)/P(silence|s) [1](11)
0271     zo.nv=nv+nd+nb;                      % number of previous speech samples
0272 end
0273 if (~nargout || any(m=='p')) && nr>0
0274     ax=subplot(212);
0275     plot((nv+nd+[0 nr*ni])/fs,[qq.pr qq.pr],'r:',(nv+nd+ni*nj/2+(0:nw-1)*ni*nj)/fs,max(reshape(prsp(1:nw*nj),nj,[]),[],1).','b-' );
0276     set(gca,'ylim',[-0.05 1.05]);
0277     xlabel('Time (s)');
0278     ylabel('Pr(speech)');
0279     ax(2)=subplot(211);
0280     plot((nv+(1:ns))/fs,s);
0281     ylabel('Speech');
0282     title('Sohn VAD');
0283     linkaxes(ax,'x');
0284 end

v_vadsohn

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SOURCE CODE