Home > voicebox > vadsohn.m

vadsohn

PURPOSE ^

VADSOHN implements a voice activity detector [VS,ZO]=(S,FSZ,M,P)

SYNOPSIS ^

function [vs,zo]=vadsohn(si,fsz,m,pp)

DESCRIPTION ^

VADSOHN implements a voice activity detector [VS,ZO]=(S,FSZ,M,P)

 Inputs:
   si      input speech signal
   fsz     sample frequency in Hz
           Alternatively, the input state from a previous call (see below)
   m       mode [default = 'a']:
               'n'   output frame start/end in samples
               't'   output frame start/end in seconds
               'a'   output activity decision
               'b'   output activity likelihood ratio
   pp      algorithm parameters [optional]

 Outputs:
   vs(:,n)     outputs in the order [t1,t2,a,b] as selected by m options
               if 'n' or 't' is specified in m, vs has one row per frame and
               t1 and t2 give the frame start end end times (in samples or seconds).
               otherwise, vs has one row per sample.
   zo          output state (used as input for a subsequent call).

 The algorithm operation is controlled by a small number of parameters:

        pp.of          % overlap factor = (fft length)/(frame increment) [2]
        pp.ti          % desired frame increment [0.010 seconds]
        pp.ri          % set to 1 to round ti to the nearest power of 2 samples [0]
        pp.ta          % Time const for smoothing SNR estimate [0.396 seconds]
        pp.gx          % maximum posterior SNR as a power ratio [1000]
        pp.xn          % minimum prior SNR [0]
        pp.pr          % Speech probability threshold [0.7]
        pp.ts          % mean talkspurt length (100 ms)
        pp.tn          % mean silence length (50 ms)

 In addition it is possible to specify parameters for the noise estimation algorithm
 which implements reference [3] from which equation numbers are given in parentheses.
 They are as follows:

        pp.taca      % (11): smoothing time constant for alpha_c [0.0449 seconds]
        pp.tamax     % (3): max smoothing time constant [0.392 seconds]
        pp.taminh    % (3): min smoothing time constant (upper limit) [0.0133 seconds]
        pp.tpfall    % (12): time constant for P to fall [0.064 seconds]
        pp.tbmax     % (20): max smoothing time constant [0.0717 seconds]
        pp.qeqmin    % (23): minimum value of Qeq [2]
        pp.qeqmax    % max value of Qeq per frame [14]
        pp.av        % (23)+13 lines: fudge factor for bc calculation  [2.12]
        pp.td        % time to take minimum over [1.536 seconds]
        pp.nu        % number of subwindows to use [3]
        pp.qith      % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
        pp.nsmdb     % corresponding noise slope thresholds in dB/second   [47 31.4 15.7 4.1]


 If convenient, you can call vadsohn in chunks of arbitrary size. Thus the following are equivalent:

                   (a) y=vadsohn(s,fs);

                   (b) [y1,z]=vadsohn(s(1:1000),fs);
                       [y2,z]=vadsohn(s(1001:2000),z);
                       y3=vadsohn(s(2001:end),z);
                       y=[y1; y2; y3];

 Note that in all cases the number of output samples will be less than the number of input samples if
 the input length is not an exact number of frames. In addition, if two output arguments
 are specified, the last partial frame samples will be retained for overlap adding
 with the output from the next call to specsub().

 Refs:
    [1] J. Sohn, N. S. Kim, and W. Sung.
        A statistical model-based voice activity detection.
        IEEE Signal Processing Lett., 6 (1): 1–3, 1999. doi: 10.1109/97.736233.
    [2] Ephraim, Y. & Malah, D.
        Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
        IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984
    [3] Rainer Martin.
        Noise power spectral density estimation based on optimal smoothing and minimum statistics.
        IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SOURCE CODE ^

0001 function [vs,zo]=vadsohn(si,fsz,m,pp)
0002 %VADSOHN implements a voice activity detector [VS,ZO]=(S,FSZ,M,P)
0003 %
0004 % Inputs:
0005 %   si      input speech signal
0006 %   fsz     sample frequency in Hz
0007 %           Alternatively, the input state from a previous call (see below)
0008 %   m       mode [default = 'a']:
0009 %               'n'   output frame start/end in samples
0010 %               't'   output frame start/end in seconds
0011 %               'a'   output activity decision
0012 %               'b'   output activity likelihood ratio
0013 %   pp      algorithm parameters [optional]
0014 %
0015 % Outputs:
0016 %   vs(:,n)     outputs in the order [t1,t2,a,b] as selected by m options
0017 %               if 'n' or 't' is specified in m, vs has one row per frame and
0018 %               t1 and t2 give the frame start end end times (in samples or seconds).
0019 %               otherwise, vs has one row per sample.
0020 %   zo          output state (used as input for a subsequent call).
0021 %
0022 % The algorithm operation is controlled by a small number of parameters:
0023 %
0024 %        pp.of          % overlap factor = (fft length)/(frame increment) [2]
0025 %        pp.ti          % desired frame increment [0.010 seconds]
0026 %        pp.ri          % set to 1 to round ti to the nearest power of 2 samples [0]
0027 %        pp.ta          % Time const for smoothing SNR estimate [0.396 seconds]
0028 %        pp.gx          % maximum posterior SNR as a power ratio [1000]
0029 %        pp.xn          % minimum prior SNR [0]
0030 %        pp.pr          % Speech probability threshold [0.7]
0031 %        pp.ts          % mean talkspurt length (100 ms)
0032 %        pp.tn          % mean silence length (50 ms)
0033 %
0034 % In addition it is possible to specify parameters for the noise estimation algorithm
0035 % which implements reference [3] from which equation numbers are given in parentheses.
0036 % They are as follows:
0037 %
0038 %        pp.taca      % (11): smoothing time constant for alpha_c [0.0449 seconds]
0039 %        pp.tamax     % (3): max smoothing time constant [0.392 seconds]
0040 %        pp.taminh    % (3): min smoothing time constant (upper limit) [0.0133 seconds]
0041 %        pp.tpfall    % (12): time constant for P to fall [0.064 seconds]
0042 %        pp.tbmax     % (20): max smoothing time constant [0.0717 seconds]
0043 %        pp.qeqmin    % (23): minimum value of Qeq [2]
0044 %        pp.qeqmax    % max value of Qeq per frame [14]
0045 %        pp.av        % (23)+13 lines: fudge factor for bc calculation  [2.12]
0046 %        pp.td        % time to take minimum over [1.536 seconds]
0047 %        pp.nu        % number of subwindows to use [3]
0048 %        pp.qith      % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
0049 %        pp.nsmdb     % corresponding noise slope thresholds in dB/second   [47 31.4 15.7 4.1]
0050 %
0051 %
0052 % If convenient, you can call vadsohn in chunks of arbitrary size. Thus the following are equivalent:
0053 %
0054 %                   (a) y=vadsohn(s,fs);
0055 %
0056 %                   (b) [y1,z]=vadsohn(s(1:1000),fs);
0057 %                       [y2,z]=vadsohn(s(1001:2000),z);
0058 %                       y3=vadsohn(s(2001:end),z);
0059 %                       y=[y1; y2; y3];
0060 %
0061 % Note that in all cases the number of output samples will be less than the number of input samples if
0062 % the input length is not an exact number of frames. In addition, if two output arguments
0063 % are specified, the last partial frame samples will be retained for overlap adding
0064 % with the output from the next call to specsub().
0065 %
0066 % Refs:
0067 %    [1] J. Sohn, N. S. Kim, and W. Sung.
0068 %        A statistical model-based voice activity detection.
0069 %        IEEE Signal Processing Lett., 6 (1): 1–3, 1999. doi: 10.1109/97.736233.
0070 %    [2] Ephraim, Y. & Malah, D.
0071 %        Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
0072 %        IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984
0073 %    [3] Rainer Martin.
0074 %        Noise power spectral density estimation based on optimal smoothing and minimum statistics.
0075 %        IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.
0076 
0077 %      Copyright (C) Mike Brookes 2004
0078 %      Version: $Id: vadsohn.m 713 2011-10-16 14:45:43Z dmb $
0079 %
0080 %   VOICEBOX is a MATLAB toolbox for speech processing.
0081 %   Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
0082 %
0083 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0084 %   This program is free software; you can redistribute it and/or modify
0085 %   it under the terms of the GNU General Public License as published by
0086 %   the Free Software Foundation; either version 2 of the License, or
0087 %   (at your option) any later version.
0088 %
0089 %   This program is distributed in the hope that it will be useful,
0090 %   but WITHOUT ANY WARRANTY; without even the implied warranty of
0091 %   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
0092 %   GNU General Public License for more details.
0093 %
0094 %   You can obtain a copy of the GNU General Public License from
0095 %   http://www.gnu.org/copyleft/gpl.html or by writing to
0096 %   Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
0097 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0098 if nargin<3
0099     m='';
0100 end
0101 if isstruct(fsz)
0102     fs=fsz.fs;
0103     qq=fsz.qq;
0104     qp=fsz.qp;
0105     ze=fsz.ze;
0106     s=zeros(length(fsz.si)+length(si(:)),1); % allocate space for speech
0107     s(1:length(fsz.si))=fsz.si;
0108     s(length(fsz.si)+1:end)=si(:);
0109 else
0110     fs=fsz;     % sample frequency
0111     s=si(:);
0112     % default algorithm constants
0113 
0114     qq.of=2;   % overlap factor = (fft length)/(frame increment)
0115     qq.pr=0.7;    % Speech probability threshold
0116     qq.ts=0.1;  % mean talkspurt length (100 ms)
0117     qq.tn=0.05; % mean silence length (50 ms)
0118     qq.ti=10e-3;   % desired frame increment (10 ms)
0119     qq.ri=0;       % round ni to the nearest power of 2
0120     qq.ta=0.396;    % Time const for smoothing SNR estimate = -tinc/log(0.98) from [2]
0121     qq.gx=1000;     % maximum posterior SNR = 30dB
0122     qq.xn=0;        % minimum prior SNR = -Inf dB
0123     if nargin>=4 && ~isempty(pp)
0124         qp=pp;      % save for estnoisem call
0125         qqn=fieldnames(qq);
0126         for i=1:length(qqn)
0127             if isfield(pp,qqn{i})
0128                 qq.(qqn{i})=pp.(qqn{i});
0129             end
0130         end
0131     else
0132         qp=struct;  % make an empty structure
0133     end
0134 end
0135 % derived algorithm constants
0136 if qq.ri
0137     ni=pow2(nextpow2(ti*fs*sqrt(0.5)));
0138 else
0139     ni=round(qq.ti*fs);    % frame increment in samples
0140 end
0141 tinc=ni/fs;          % true frame increment time
0142 a=exp(-tinc/qq.ta); % SNR smoothing coefficient
0143 gmax=qq.gx;           % max posterior SNR = 30 dB
0144 kk=sqrt(2*pi);   % sqrt(8)*Gamma(1.5) - required constant
0145 xn=qq.xn;           % floor for prior SNR, xi
0146 a01=tinc/qq.tn;     % a01=P(silence->speech)
0147 a00=1-a01;          % a00=P(silence->silence)
0148 a10=tinc/qq.ts;     % a10=P(speech->silence)
0149 a11=1-a10;          % a11=P(speech->speech)
0150 b11=a11/a10;
0151 b01=a01/a00;
0152 b00=a01-a00*a11/a10;
0153 b10=a11-a10*a01/a00;
0154 prat=a10/a01;       % P(silence)/P(speech)
0155 lprat=log(prat);
0156 % calculate power spectrum in frames
0157 
0158 no=round(qq.of);                                   % integer overlap factor
0159 nf=ni*no;           % fft length
0160 nd=floor(ni*(no-1)/2); % output delay relative to start of frame
0161 w=hamming(nf+1)'; w(end)=[];  % for now always use hamming window
0162 w=w/sqrt(sum(w(1:ni:nf).^2));          % normalize to give overall gain of 1
0163 ns=length(s);
0164 y=enframe(s,w,ni);
0165 yf=rfft(y,nf,2);
0166 if ~size(yf,1)                                  % no data frames
0167     vs=[];
0168     nr=0;
0169     nb=0;
0170 else
0171     yp=yf.*conj(yf);                % power spectrum of input speech
0172     [nr,nf2]=size(yp);              % number of frames
0173     nb=ni*nr;
0174     isz=isstruct(fsz);
0175     if isz
0176         [dp,ze]=estnoisem(yp,ze);   % estimate the noise using minimum statistics
0177         xu=fsz.xu;                  % previously saved unsmoothed SNR
0178         lggami=fsz.gg;
0179         nv=fsz.nv;
0180     else
0181         [dp,ze]=estnoisem(yp,tinc,qp);      % estimate the noise using minimum statistics
0182         xu=1;                               % dummy unsmoothed SNR from previous frame [2](53)++
0183         lggami=0;                            % initial prob ratio
0184         nv=0;                               % number of previous speech samples
0185     end
0186     gam=min(yp./dp,gmax);               % gamma = posterior SNR [2](10)
0187     prsp=zeros(nr,1);                    % create space for prob ratio vector
0188     for i=1:nr                          % loop for each frame
0189         gami=gam(i,:);                  % gamma(i) = a posteriori SNR [2](10)
0190         xi=a*xu+(1-a)*max(gami-1,xn);   % xi = smoothed a priori SNR [2](53)
0191         xi1=1+xi;
0192         v=0.5*xi.*gami./xi1;            % note that this is 0.5*vk in [2]
0193         lamk=2*v-log(xi1);              % log likelihood ratio [1](3)
0194         lami=sum(lamk(2:nf2))/(nf2-1);  % mean log LR omitting DC term [1](4)
0195         if lggami<0                     % avoid overflow
0196             lggami=lprat+lami+log(b11+b00/(a00+a10*exp(lggami)));
0197         else
0198             lggami=lprat+lami+log(b01+b10/(a10+a00*exp(-lggami)));
0199         end
0200         if lggami<0
0201             gg=exp(lggami);
0202             prsp(i)=gg/(1+gg);
0203         else
0204             prsp(i)=1/(1+exp(-lggami));
0205         end
0206         gi=(0.277+2*v)./gami;           % accurate to 0.02 dB for v>0.5
0207         mv=v<0.5;
0208         if any(mv)
0209             vmv=v(mv);
0210             gi(mv)=kk*sqrt(vmv).*((0.5+vmv).*besseli(0,vmv)+vmv.*besseli(1,vmv))./(gami(mv).*exp(vmv));
0211         end
0212         xu=gami.*gi.^2;                 % unsmoothed prior SNR
0213     end
0214     ncol=any(m=='a')+any(m=='b');       % number of output data columns
0215     if ~ncol
0216         m(end+1)='a';   % force at least one output
0217         ncol=1;
0218     end
0219     if any(m=='n') || any(m=='t')       % frame-based outputs
0220         vs=zeros(nr,2+ncol);
0221         vs(:,1)=nd+1+nv+(0:nr-1)'*ni;   % starting sample of each frame
0222         vs(:,2)=ni-1+vs(:,1);           % ending sample of each frame
0223         if any(m=='t')
0224             vs(:,1:2)=vs(:,1:2)/fs;     % convert to seconds
0225         end
0226         if any(m=='a')
0227             vs(:,3)=prsp>qq.pr;
0228         end
0229         if any(m=='b')
0230             vs(:,end)=prsp;
0231         end
0232     else
0233         na=nd*(1-isz);                  % include preamble if no input state supplied
0234         nc=(nargout<=1)*(ns-nd-nb);     % include tail if no output state desired
0235         vs=zeros(na+nb+nc,ncol);
0236         vs(na+(1:nb),ncol)=reshape(repmat(prsp',ni,1),nb,1);
0237         vs(1:na,ncol)=vs(na+1,ncol);        % replicate start
0238         vs(na+nb+1:end,ncol)=vs(na+nb,ncol); % replicate end
0239         if any(m=='a')
0240             vs(:,1)=vs(:,ncol)>qq.pr;
0241         end
0242     end
0243 end
0244 if nargout>1
0245     zo.si=s(nd+nb+1:end);      % save the tail end of the input speech signal
0246     zo.fs=fs;                       % save sample frequency
0247     zo.qq=qq;                       % save local parameters
0248     zo.qp=qp;                       % save estnoisem parameters
0249     zo.ze=ze;                       % save state of noise estimation
0250     zo.xu=xu;                       % unsmoothed prior SNR  [2](53)
0251     zo.gg=lggami;                    % posterior prob ratio: P(speech|s)/P(silence|s) [1](11)
0252     zo.nv=nv+nd+nb;                      % number of previous speech samples
0253 end
0254 if ~nargout && nr>0
0255     ax=subplot(212);
0256     plot((nv+nd+[0 nr*ni])/fs,[qq.pr qq.pr],'r:',(nv+nd+ni/2+(0:nr-1)*ni)/fs,prsp,'b-' );
0257     set(gca,'ylim',[-0.05 1.05]);
0258     xlabel('Time (s)');
0259     ylabel('Pr(speech)');
0260     ax(2)=subplot(211);
0261     plot((nv+(1:ns))/fs,s);
0262     ylabel('Speech');
0263     title('Sohn VAD');
0264     linkaxes(ax,'x');
0265 end

Generated on Thu 23-Feb-2012 09:25:48 by m2html © 2003