Home > voicebox > activlev.m

activlev

PURPOSE ^

ACTIVLEV Measure active speech level as in ITU-T P.56 [LEV,AF,FSO]=(sp,FS,MODE)

SYNOPSIS ^

function [lev,af,fso,vad]=activlev(sp,fs,mode)

DESCRIPTION ^

ACTIVLEV Measure active speech level as in ITU-T P.56 [LEV,AF,FSO]=(sp,FS,MODE)

Usage: (1) lev=activlev(s,fs);     % speech level in units of power
       (2) db=activlev(s,fs,'d');  % speech level in dB
       (3) s=activlev(s,fs,'n');   % normalize active level to 0 dB

Inputs: sp     is the speech signal (with better than 20dB SNR)
        FS     is the sample frequency in Hz (see also FSO below)
        MODE   is a combination of the following:
               0 - omit high pass filter completely (i.e. include DC)
               3 - high pass filter at 30 Hz instead of 200 Hz (but allows mains hum to pass)
               4 - high pass filter at 40 Hz instead of 200 Hz (but allows mains hum to pass)
               1 - use cheybyshev 1 filter
               2 - use chebyshev 2 filter (default)
               e - use elliptic filter
               h - omit low pass filter at 5.5, 12 or 18 kHz
               w - use wideband filter frequencies: 70 Hz to 12 kHz
               W - use ultra wideband filter frequencies: 30 Hz to 18 kHz
               d - give outputs in dB rather than power
               n - output a normalized speech signal as the first argument
               N - output a normalized filtered speech signal as the first argument
               l - give both active and long-term power levels
               a - include A-weighting filter
               i - include ITU-R-BS.468/ITU-T-J.16 weighting filter
               z - do NOT zero-pad the signal by 0.35 s

Outputs:
    If the "n" option is specified, a speech signal normalized to 0dB will be given as
    the first output followed by the other outputs.
        LEV    gives the speech level in units of power (or dB if mode='d')
               if mode='l' is specified, LEV is a row vector with the "long term
               level" as its second element (this is just the mean power)
        AF     is the activity factor (or duty cycle) in the range 0 to 1
        FSO    is a column vector of intermediate information that allows
               you to process a speech signal in chunks. Thus:
                       fso=fs;
                       for i=1:inc:nsamp
                           [lev,af,fso]=activlev(sp(i:min(i+inc-1,nsamp)),fso,['z' mode]);
                       end
                       lev=activlev([],fso)
               is equivalent to:
                       lev=activlev(sp(1:nsamp),fs,mode)
               but is much slower. The two methods will not give identical results
               because they will use slightly different thresholds. Note you need
               the 'z' option for all calls except the last.
        VAD    is a boolean vector the same length as sp that acts as an approximate voice activity detector

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SOURCE CODE ^

0001 function [lev,af,fso,vad]=activlev(sp,fs,mode)
0002 %ACTIVLEV Measure active speech level as in ITU-T P.56 [LEV,AF,FSO]=(sp,FS,MODE)
0003 %
0004 %Usage: (1) lev=activlev(s,fs);     % speech level in units of power
0005 %       (2) db=activlev(s,fs,'d');  % speech level in dB
0006 %       (3) s=activlev(s,fs,'n');   % normalize active level to 0 dB
0007 %
0008 %Inputs: sp     is the speech signal (with better than 20dB SNR)
0009 %        FS     is the sample frequency in Hz (see also FSO below)
0010 %        MODE   is a combination of the following:
0011 %               0 - omit high pass filter completely (i.e. include DC)
0012 %               3 - high pass filter at 30 Hz instead of 200 Hz (but allows mains hum to pass)
0013 %               4 - high pass filter at 40 Hz instead of 200 Hz (but allows mains hum to pass)
0014 %               1 - use cheybyshev 1 filter
0015 %               2 - use chebyshev 2 filter (default)
0016 %               e - use elliptic filter
0017 %               h - omit low pass filter at 5.5, 12 or 18 kHz
0018 %               w - use wideband filter frequencies: 70 Hz to 12 kHz
0019 %               W - use ultra wideband filter frequencies: 30 Hz to 18 kHz
0020 %               d - give outputs in dB rather than power
0021 %               n - output a normalized speech signal as the first argument
0022 %               N - output a normalized filtered speech signal as the first argument
0023 %               l - give both active and long-term power levels
0024 %               a - include A-weighting filter
0025 %               i - include ITU-R-BS.468/ITU-T-J.16 weighting filter
0026 %               z - do NOT zero-pad the signal by 0.35 s
0027 %
0028 %Outputs:
0029 %    If the "n" option is specified, a speech signal normalized to 0dB will be given as
0030 %    the first output followed by the other outputs.
0031 %        LEV    gives the speech level in units of power (or dB if mode='d')
0032 %               if mode='l' is specified, LEV is a row vector with the "long term
0033 %               level" as its second element (this is just the mean power)
0034 %        AF     is the activity factor (or duty cycle) in the range 0 to 1
0035 %        FSO    is a column vector of intermediate information that allows
0036 %               you to process a speech signal in chunks. Thus:
0037 %                       fso=fs;
0038 %                       for i=1:inc:nsamp
0039 %                           [lev,af,fso]=activlev(sp(i:min(i+inc-1,nsamp)),fso,['z' mode]);
0040 %                       end
0041 %                       lev=activlev([],fso)
0042 %               is equivalent to:
0043 %                       lev=activlev(sp(1:nsamp),fs,mode)
0044 %               but is much slower. The two methods will not give identical results
0045 %               because they will use slightly different thresholds. Note you need
0046 %               the 'z' option for all calls except the last.
0047 %        VAD    is a boolean vector the same length as sp that acts as an approximate voice activity detector
0048 
0049 %For completeness we list here the contents of the FSO structure:
0050 %
0051 %   ffs : sample frequency
0052 %   fmd : mode string
0053 %    nh : hangover time in samples
0054 %    ae : smoothing filter coefs
0055 %    abl: HP filter numerator and denominator coefficient
0056 %    bh : LP filter numerator coefficient
0057 %    ah : LP filter denominator coefficients
0058 %    ze : smoothing filter state
0059 %    zl : HP filter state
0060 %    zh : LP filter state
0061 %    zx : hangover max filter state
0062 %  emax : maximum envelope exponent + 1
0063 %   ssq : signal sum of squares
0064 %    ns : number of signal samples
0065 %    ss : sum of speech samples (not actually used here)
0066 %    kc : cumulative occupancy counts
0067 %    aw : weighting filter denominator
0068 %    bw : weighting filter numerator
0069 %    zw : weighting filter state
0070 %
0071 % This routine implements "Method B" from [1],[2] to calculate the active
0072 % speech level which is defined to be the speech energy divided by the
0073 % duration of speech activity. Speech is designated as "active" based on an
0074 % adaptive threshold applied to the smoothed rectified speech signal. A
0075 % bandpass filter is first applied to the input speech whose -0.25 dB points
0076 % are at 200 Hz & 5.5 kHz by default but this can be changed to 70 Hz & 5.5 kHz
0077 % or to 30 Hz & 18 kHz by specifying the 'w' or 'W' options; these
0078 % correspond respectively to Annexes B and C in [2].
0079 %
0080 % References:
0081 % [1]    ITU-T. Objective measurement of active speech level. Recommendation P.56, Mar. 1993.
0082 % [2]    ITU-T. Objective measurement of active speech level. Recommendation P.56, Dec. 2011.
0083 
0084 %      Copyright (C) Mike Brookes 2008-2016
0085 %      Version: $Id: activlev.m 9407 2017-02-07 13:25:55Z dmb $
0086 %
0087 %   VOICEBOX is a MATLAB toolbox for speech processing.
0088 %   Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
0089 %
0090 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0091 %   This program is free software; you can redistribute it and/or modify
0092 %   it under the terms of the GNU General Public License as published by
0093 %   the Free Software Foundation; either version 2 of the License, or
0094 %   (at your option) any later version.
0095 %
0096 %   This program is distributed in the hope that it will be useful,
0097 %   but WITHOUT ANY WARRANTY; without even the implied warranty of
0098 %   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
0099 %   GNU General Public License for more details.
0100 %
0101 %   You can obtain a copy of the GNU General Public License from
0102 %   http://www.gnu.org/copyleft/gpl.html or by writing to
0103 %   Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
0104 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0105 
0106 persistent nbin thresh c25zp c15zp e5zp
0107 if isempty(nbin)
0108     nbin=20;    % 60 dB range at 3dB per bin
0109     thresh=15.9;    % threshold in dB
0110     % High pass s-domain zeros and poles of filters with passband ripple<0.25dB, stopband<-50dB, w0=1
0111     %    w0=fzero(@ch2,0.5); [c2z,c2p,k]=cheby2(5,50,w0,'high','s');
0112     %    function v=ch2(w); [c2z,c2p,k]=cheby2(5,50,w,'high','s'); v= 20*log10(prod(abs(1i-c2z))/prod(abs(1i-c2p)))+0.25;
0113     c25zp=[0.37843443673309i 0.23388534441447i; -0.20640255179496+0.73942185906851i -0.54036889596392+0.45698784092898i];
0114     c25zp=[[0; -0.66793268833792] c25zp conj(c25zp)];
0115     %       [c1z,c1p,c1k] = cheby1(5,0.25,1,'high','s');
0116     c15zp=[-0.659002835294875+1.195798636925079i -0.123261821596263+0.947463030958881i];
0117     c15zp=[zeros(1,5); -2.288586431066945 c15zp conj(c15zp)];
0118     %      [ez,ep,ek] = ellip(5,0.25,50,1,'high','s')
0119     e5zp=[0.406667680649209i 0.613849362744881i; -0.538736390607201+1.130245082677107i -0.092723126159100+0.958193646330194i];
0120     e5zp=[[0; -1.964538608244084]  e5zp conj(e5zp)];
0121     %    w=linspace(0.2,2,100);
0122     %    figure(1); plot(w,20*log10(abs(freqs(real(poly(c15zp(1,:))),real(poly(c15zp(2,:))),w)))); title('Chebyshev 1');
0123     %    figure(2); plot(w,20*log10(abs(freqs(real(poly(c25zp(1,:))),real(poly(c25zp(2,:))),w)))); title('Chebyshev 2');
0124     %    figure(3); plot(w,20*log10(abs(freqs(real(poly(e5zp(1,:))),real(poly(e5zp(2,:))),w)))); title('Elliptic');
0125 end
0126 
0127 if ~isstruct(fs)                        % no state vector given
0128     if nargin<3
0129         mode=' ';
0130     end
0131     fso.ffs=fs;                           % sample frequency
0132     
0133     ti=1/fs;
0134     g=exp(-ti/0.03);                    % pole position for envelope filter
0135     fso.ae=[1 -2*g g^2]/(1-g)^2;        % envelope filter coefficients (DC gain = 1)
0136     fso.ze=zeros(2,1);
0137     fso.nh=ceil(0.2/ti)+1;              % hangover time in samples
0138     fso.zx=-Inf;                        % initial value for maxfilt()
0139     fso.emax=-Inf;                      % maximum exponent
0140     fso.ns=0;
0141     fso.ssq=0;
0142     fso.ss=0;
0143     fso.kc=zeros(nbin,1);               % cumulative occupancy counts
0144     % s-plane zeros and poles of high pass 5'th order filter -0.25dB at w=1 and -50dB stopband
0145     if any(mode=='1')
0146         szp=c15zp;              % Chebyshev 1
0147     elseif any(mode=='e')
0148         szp=e5zp;               % Elliptic
0149     else
0150         szp=c25zp;              % Chebyshev 2
0151     end
0152     flh=[200 5500];             % default frequency range +- 0.25 dB
0153     if any(mode=='w')
0154         flh=[70 12000];         % super-wideband (Annex B of [2])
0155     elseif any(mode=='W')
0156         flh=[30 18000];         % full band (Annex C of [2])
0157     end
0158     if any(mode=='3')
0159         flh(1)=30;              % force a 30 Hz HPF cutoff
0160     end
0161     if any(mode=='4')
0162         flh(1)=40;              % force a 40 Hz HPF cutoff
0163     end
0164     if any(mode=='r')              % included for backward compatibility
0165         mode=['0h' mode];        % abolish both filters
0166     elseif fs<flh(2)*2.2
0167         mode=['h' mode];           % abolish lowpass filter at low sample rates
0168     end
0169     fso.fmd=mode;                % save mode flags
0170     if all(mode~='0')           % implement the HPF as biquads to avoid rounding errors
0171         zl=2./(1-szp*tan(flh(1)*pi/fs))-1;      % Transform s-domain poles/zeros with bilinear transform
0172         abl=[ones(2,1) -zl(:,1) -2*real(zl(:,2:3))  abs(zl(:,2:3)).^2];     % biquad coefficients
0173         hfg=(abl*[1 -1 0 0 0 0]').*(abl*[1 0 -1 0 1 0]').*(abl*[1 0 0 -1 0 1]');
0174         abl=abl(:,[1 2 1 3 5 1 4 6]);               % reorder into biquads
0175         abl(1,1:2)= abl(1,1:2)*hfg(2)/hfg(1);       % force Nyquist gain to equal 1
0176         fso.abl=abl;
0177         fso.zl=zeros(5,1);                          % space for HPF filter state
0178     end
0179     if all(mode~='h')
0180         zh=2./(szp/tan(flh(2)*pi/fs)-1)+1;     % Transform s-domain poles/zeros with bilinear transform
0181         ah=real(poly(zh(2,:)));
0182         bh=real(poly(zh(1,:)));
0183         fso.bh=bh*sum(ah)/sum(bh);
0184         fso.ah=ah;
0185         fso.zh=zeros(5,1);
0186     end
0187     if any(mode=='a')
0188         [fso.bw,fso.aw]=stdspectrum(2,'z',fs);
0189         fso.zw=zeros(length(fso.aw)-1,1);
0190     elseif any(mode=='i')
0191         [fso.bw,fso.aw]=stdspectrum(8,'z',fs);
0192         fso.zw=zeros(length(fso.aw)-1,1);
0193     end
0194 else
0195     fso=fs;             % use existing structure
0196 end
0197 md=fso.fmd;
0198 if nargin<3
0199     mode=fso.fmd;
0200 end
0201 nsp=length(sp); % original length of speech
0202 if all(mode~='z')
0203     nz=ceil(0.35*fso.ffs); % number of zeros to append
0204     sp=[sp(:);zeros(nz,1)];
0205 else
0206     nz=0;
0207 end
0208 ns=length(sp);
0209 if ns                       % process this speech chunk
0210     % apply the input filters to the speech
0211     if all(md~='0')         % implement the HPF as biquads to avoid rounding errors
0212         [sq,fso.zl(1)]=filter(fso.abl(1,1:2),fso.abl(2,1:2),sp(:),fso.zl(1));       % highpass filter: real pole/zero
0213         [sq,fso.zl(2:3)]=filter(fso.abl(1,3:5),fso.abl(2,3:5),sq(:),fso.zl(2:3));      % highpass filter: biquad 1
0214         [sq,fso.zl(4:5)]=filter(fso.abl(1,6:8),fso.abl(2,6:8),sq(:),fso.zl(4:5));      % highpass filter: biquad 2
0215     else
0216         sq=sp(:);
0217     end
0218     if all(md~='h')
0219         [sq,fso.zh]=filter(fso.bh,fso.ah,sq(:),fso.zh);     % lowpass filter
0220     end
0221     if any(md=='a') || any(md=='i')
0222         [sq,fso.zw]=filter(fso.bw,fso.aw,sq(:),fso.zw);     % weighting filter
0223     end
0224     fso.ns=fso.ns+ns;                               % count the number of speech samples
0225     fso.ss=fso.ss+sum(sq);                          % sum of speech samples
0226     fso.ssq=fso.ssq+sum(sq.*sq);                    % sum of squared speech samples
0227     [s,fso.ze]=filter(1,fso.ae,abs(sq(:)),fso.ze);     % envelope filter
0228     [qf,qe]=log2(s.^2);                             % take efficient log2 function, 2^qe is upper limit of bin
0229     qe(qf==0)=-Inf;                                 % fix zero values
0230     [qe,qk,fso.zx]=maxfilt(qe,1,fso.nh,1,fso.zx);      % apply the 0.2 second hangover
0231     oemax=fso.emax;
0232     fso.emax=max(oemax,max(qe)+1);
0233     if fso.emax==-Inf
0234         fso.kc(1)=fso.kc(1)+ns;
0235     else
0236         qe=min(fso.emax-qe,nbin);   % force in the range 1:nbin. Bin k has 2^(emax-k-1)<=s^2<=2^(emax-k)
0237         wqe=ones(length(qe),1);
0238         % below: could use kc=cumsum(accumarray(qe,wqe,nbin)) but unsure about backwards compatibility
0239         kc=cumsum(full(sparse(qe,wqe,wqe,nbin,1)));     % cumulative occupancy counts
0240         esh=fso.emax-oemax;                             % amount to shift down previous bin counts
0241         if esh<nbin-1                                   % if any of the previous bins are worth keeping
0242             kc(esh+1:nbin-1)=kc(esh+1:nbin-1)+fso.kc(1:nbin-esh-1);
0243             kc(nbin)=kc(nbin)+sum(fso.kc(nbin-esh:nbin));
0244         else
0245             kc(nbin)=kc(nbin)+sum(fso.kc); % otherwise just add all old counts into the last (lowest) bin
0246         end
0247         fso.kc=kc;
0248     end
0249 end
0250 if fso.ns                       % now calculate the output values
0251     if fso.ssq>0
0252         aj=10*log10(fso.ssq*(fso.kc).^(-1));
0253         % equivalent to cj=20*log10(sqrt(2).^(fso.emax-(1:nbin)-1));
0254         cj=10*log10(2)*(fso.emax-(1:nbin)-1);               % lower limit of bin j in dB
0255         mj=aj'-cj-thresh;
0256         %  jj=find(mj*sign(mj(1))<=0); % Find threshold
0257         jj=find(mj(1:end-1)<0 &  mj(2:end)>=0,1);           % find +ve transition through threshold
0258         if isempty(jj)                                      % if we never cross the threshold
0259             if mj(end)<=0                                   % if we end up below if
0260                 jj=length(mj)-1;            % take the threshold to be the bottom of the last (lowest) bin
0261                 jf=1;
0262             else                            % if we are always above it
0263                 jj=1;                       % take the threshold to be the bottom of the first (highest) bin
0264                 jf=0;
0265             end
0266         else
0267             jf=1/(1-mj(jj+1)/mj(jj));       % fractional part of j using linear interpolation
0268         end
0269         lev=aj(jj)+jf*(aj(jj+1)-aj(jj));    % active level in decibels
0270         lp=10.^(lev/10);                    % active level in power
0271         if any(md=='d')                     % 'd' option -> output in dB
0272             lev=[lev 10*log10(fso.ssq/fso.ns)];
0273         else                                % ~'d' option -> output in power
0274             lev=[lp fso.ssq/fso.ns];
0275         end
0276         af=fso.ssq/(fso.ns*lp);
0277     else                        % if all samples are equal to zero
0278         af=0;
0279         if any(md=='d')         % 'd' option -> output in dB
0280             lev=[-Inf -Inf];    % active level is 0 dB
0281         else                    % ~'d' option -> output in power
0282             lev=[0 0];          % active level is 0 power
0283         end
0284     end
0285     if all(md~='l')
0286         lev=lev(1);         % only output the first element of lev unless 'l' option
0287     end
0288 end
0289 if nargout>3
0290     vad=maxfilt(s(1:nsp),1,fso.nh,1);
0291     vad=vad>(sqrt(lp)/10^(thresh/20));
0292 end
0293 if ~nargout
0294     vad=maxfilt(s,1,fso.nh,1);
0295     vad=vad>(sqrt(lp)/10^(thresh/20));
0296     levdb=10*log10(lp);
0297     clf;
0298     subplot(2,2,[1 2]);
0299     tax=(1:ns)/fso.ffs;
0300     plot(tax,sp,'-y',tax,s,'-r',tax,(vad>0)*sqrt(lp),'-b');
0301     xlabel('Time (s)');
0302     title(sprintf('Active Level = %.2g dB, Activity = %.0f%% (ITU-T P.56)',levdb,100*af));
0303     axisenlarge([-1 -1 -1.4 -1.05]);
0304     if nz>0
0305         hold on
0306         ylim=get(gca,'ylim');
0307         plot(tax(end-nz)*[1 1],ylim,':k');
0308         hold off
0309     end
0310     ylabel('Amplitude');
0311     legend('Signal','Smoothed envelope','VAD * Active-Level','Location','SouthEast');
0312     subplot(2,2,4);
0313     plot(cj,repmat(levdb,nbin,1),'k:',cj,aj(:),'-b',cj,cj,'-r',levdb-thresh*ones(1,2),[levdb-thresh levdb],'-r');
0314     xlabel('Threshold (dB)');
0315     ylabel('Active Level (dB)');
0316     legend('Active Level','Speech>Thresh','Threshold','Location','NorthWest');
0317     texthvc(levdb-thresh,levdb-0.5*thresh,sprintf('%.1f dB ',thresh),'rmr');
0318     axisenlarge([-1 -1.05]);
0319     ylim=get(gca,'ylim');
0320     set(gca,'ylim',[levdb-1.2*thresh max(ylim(2),levdb+1.9*thresh)]);
0321     kch=filter([1 -1],1,kc);
0322     subplot(2,2,3);
0323     bar(5*log10(2)+cj(end:-1:1),kch(end:-1:1)*100/kc(end));
0324     set(gca,'xlim',[cj(end) cj(1)+10*log10(2)]);
0325     ylim=get(gca,'ylim');
0326     hold on
0327     plot(lev([1 1]),ylim,'k:',lev([1 1])-thresh,ylim,'r:');
0328     hold off
0329     texthvc(lev(1),ylim(2),sprintf(' Act\n Lev'),'ltk');
0330     texthvc(lev(1)-thresh,ylim(2),sprintf('Threshold '),'rtr');
0331     xlabel('Frame power (dB)')
0332     ylabel('% frames');
0333 elseif any(md=='n') || any(md=='N') % output normalized speech waveform
0334     fsx=fso; % shift along other outputs
0335     fso=af;
0336     af=lev;
0337     if any(md=='n')
0338         sq=sp; % 'n' -> use unfiltered speech
0339     end
0340     if fsx.ns>0 && fsx.ssq>0 % if there has been any non-zero speech
0341         lev=sq(1:nsp)/sqrt(lp);
0342     else
0343         lev=sq(1:nsp);
0344     end
0345 end

Generated on Fri 24-Feb-2017 15:52:35 by m2html © 2003