Description of v

V_SPECSUB performs speech enhancement using spectral subtraction [SS,ZO]=(S,FSZ,P)

 Usage: (1) y=v_specsub(x,fs);   % enhance the speech using default parameters

 Inputs:
   si      input speech signal
   fsz     sample frequency in Hz
           Alternatively, the input state from a previous call (see below)
   pp      algorithm parameters [optional]

 Outputs:
   ss        output enhanced speech
   gg(t,f,i) selected time-frequency values (see pp.tf below)
   tt        centre of frames (in seconds)
   ff        centre of frequency bins (in Hz)
   zo        output state (or the 2nd argument if gg,tt,ff are omitted)

 The algorithm operation is controlled by a small number of parameters:

        pp.of          % overlap factor = (fft length)/(frame increment) [2]
        pp.ti          % desired frame increment [0.016 seconds]
        pp.ri          % set to 1 to round ti to the nearest power of 2 samples [0]
        pp.g           % subtraction domain: 1=magnitude, 2=power [1]
        pp.e           % gain exponent [1]
        pp.am          % max oversubtraction factor [3]
        pp.b           % max noise attenutaion in power domain [0.01]
        pp.al          % SNR for oversubtraction=am (set this to Inf for fixed a) [-5 dB]
        pp.ah          % SNR for oversubtraction=1 [20 dB]
        pp.ne          % noise estimation: 0=min statistics, 1=MMSE [0]
        pp.bt          % threshold for binary gain or -1 for continuous gain [-1]
        pp.mx          % input mixture gain [0]
        pp.gh          % maximum gain for noise floor [1]
        pp.rf          % round output signal to an exact number of frames [0]
        pp.tf          % selects time-frequency planes to output in the gg() variable ['g']
                           'i' = input power spectrum
                           'I' = input complex spectrum
                           'n' = noise power spectrum
                           'g' = gain
                           'o' = output power spectrum
                           'O' = output complex spectrum

 Following [1], the magnitude-domain gain in each time-frequency bin is given by
                          gain=mx+(1-mx)*max((1-(a*N/X)^(g/2))^(e/g),min(gh,(b*N/X)^(e/2)))
 where N and X are the powers of the noise and noisy speech respectively.
 The oversubtraction factor varies linearly between a=am for a frame SNR of al down to
 a=1 for a frame SNR of ah. To obtain a fixed value of a for all values of SNR, set al=Inf.
 Common exponent combinations are:
                      g=1  e=1    Magnitude Domain spectral subtraction
                      g=2  e=1    Power Domain spectral subtraction
                      g=2  e=2    Wiener filtering
 Many authors use the parameters alpha=a^(g/2), beta=b^(g/2) and gamma2=e/g instead of a, b and e
 but this increases interdependence amongst the parameters.
 If bt>=0 then the max(...) expression above is thresholded to become 0 or 1.

 In addition it is possible to specify parameters for the noise estimation algorithm
 which implements reference [2] or [3] according to the setting of pp.ne

 Minimum statistics noise estimate [2]: pp.ne=0 
        pp.taca      % (11): smoothing time constant for alpha_c [0.0449 seconds]
        pp.tamax     % (3): max smoothing time constant [0.392 seconds]
        pp.taminh    % (3): min smoothing time constant (upper limit) [0.0133 seconds]
        pp.tpfall    % (12): time constant for P to fall [0.064 seconds]
        pp.tbmax     % (20): max smoothing time constant [0.0717 seconds]
        pp.qeqmin    % (23): minimum value of Qeq [2]
        pp.qeqmax    % max value of Qeq per frame [14]
        pp.av        % (23)+13 lines: fudge factor for bc calculation  [2.12]
        pp.td        % time to take minimum over [1.536 seconds]
        pp.nu        % number of subwindows to use [3]
        pp.qith      % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
        pp.nsmdb     % corresponding noise slope thresholds in dB/second   [47 31.4 15.7 4.1]

 MMSE noise estimate [3]: pp.ne=1 
        pp.tax      % smoothing time constant for noise power estimate [0.0717 seconds](8)
        pp.tap      % smoothing time constant for smoothed speech prob [0.152 seconds](23)
        pp.psthr    % threshold for smoothed speech probability [0.99] (24)
        pp.pnsaf    % noise probability safety value [0.01] (24)
        pp.pspri    % prior speech probability [0.5] (18)
        pp.asnr     % active SNR in dB [15] (18)
        pp.psini    % initial speech probability [0.5] (23)
        pp.tavini   % assumed speech absent time at start [0.064 seconds]

 If convenient, you can call v_specsub in chunks of arbitrary size. Thus the following are equivalent:

                   (a) y=v_specsub(s,fs);

                   (b) [y1,z]=v_specsub(s(1:1000),fs);
                       [y2,z]=v_specsub(s(1001:2000),z);
                       y3=v_specsub(s(2001:end),z);
                       y=[y1; y2; y3];

 If the number of output arguments is either 2 or 5, the last partial frame of samples will
 be retained for overlap adding with the output from the next call to v_specsub().

 See also v_ssubmmse() for an alternative gain function

 Refs:
    [1] M. Berouti, R. Schwartz and J. Makhoul
        Enhancement of speech corrupted by acoustic noise
        Proc IEEE ICASSP, 1979, 4, 208-211
    [2] Rainer Martin.
        Noise power spectral density estimation based on optimal smoothing and minimum statistics.
        IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.
    [3] Gerkmann, T. & Hendriks, R. C.
        Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay
        IEEE Trans Audio, Speech, Language Processing, 2012, 20, 1383-1393

0001 function [ss,gg,tt,ff,zo]=v_specsub(si,fsz,pp)
0002 %V_SPECSUB performs speech enhancement using spectral subtraction [SS,ZO]=(S,FSZ,P)
0003 %
0004 % Usage: (1) y=v_specsub(x,fs);   % enhance the speech using default parameters
0005 %
0006 % Inputs:
0007 %   si      input speech signal
0008 %   fsz     sample frequency in Hz
0009 %           Alternatively, the input state from a previous call (see below)
0010 %   pp      algorithm parameters [optional]
0011 %
0012 % Outputs:
0013 %   ss        output enhanced speech
0014 %   gg(t,f,i) selected time-frequency values (see pp.tf below)
0015 %   tt        centre of frames (in seconds)
0016 %   ff        centre of frequency bins (in Hz)
0017 %   zo        output state (or the 2nd argument if gg,tt,ff are omitted)
0018 %
0019 % The algorithm operation is controlled by a small number of parameters:
0020 %
0021 %        pp.of          % overlap factor = (fft length)/(frame increment) [2]
0022 %        pp.ti          % desired frame increment [0.016 seconds]
0023 %        pp.ri          % set to 1 to round ti to the nearest power of 2 samples [0]
0024 %        pp.g           % subtraction domain: 1=magnitude, 2=power [1]
0025 %        pp.e           % gain exponent [1]
0026 %        pp.am          % max oversubtraction factor [3]
0027 %        pp.b           % max noise attenutaion in power domain [0.01]
0028 %        pp.al          % SNR for oversubtraction=am (set this to Inf for fixed a) [-5 dB]
0029 %        pp.ah          % SNR for oversubtraction=1 [20 dB]
0030 %        pp.ne          % noise estimation: 0=min statistics, 1=MMSE [0]
0031 %        pp.bt          % threshold for binary gain or -1 for continuous gain [-1]
0032 %        pp.mx          % input mixture gain [0]
0033 %        pp.gh          % maximum gain for noise floor [1]
0034 %        pp.rf          % round output signal to an exact number of frames [0]
0035 %        pp.tf          % selects time-frequency planes to output in the gg() variable ['g']
0036 %                           'i' = input power spectrum
0037 %                           'I' = input complex spectrum
0038 %                           'n' = noise power spectrum
0039 %                           'g' = gain
0040 %                           'o' = output power spectrum
0041 %                           'O' = output complex spectrum
0042 %
0043 % Following [1], the magnitude-domain gain in each time-frequency bin is given by
0044 %                          gain=mx+(1-mx)*max((1-(a*N/X)^(g/2))^(e/g),min(gh,(b*N/X)^(e/2)))
0045 % where N and X are the powers of the noise and noisy speech respectively.
0046 % The oversubtraction factor varies linearly between a=am for a frame SNR of al down to
0047 % a=1 for a frame SNR of ah. To obtain a fixed value of a for all values of SNR, set al=Inf.
0048 % Common exponent combinations are:
0049 %                      g=1  e=1    Magnitude Domain spectral subtraction
0050 %                      g=2  e=1    Power Domain spectral subtraction
0051 %                      g=2  e=2    Wiener filtering
0052 % Many authors use the parameters alpha=a^(g/2), beta=b^(g/2) and gamma2=e/g instead of a, b and e
0053 % but this increases interdependence amongst the parameters.
0054 % If bt>=0 then the max(...) expression above is thresholded to become 0 or 1.
0055 %
0056 % In addition it is possible to specify parameters for the noise estimation algorithm
0057 % which implements reference [2] or [3] according to the setting of pp.ne
0058 %
0059 % Minimum statistics noise estimate [2]: pp.ne=0
0060 %        pp.taca      % (11): smoothing time constant for alpha_c [0.0449 seconds]
0061 %        pp.tamax     % (3): max smoothing time constant [0.392 seconds]
0062 %        pp.taminh    % (3): min smoothing time constant (upper limit) [0.0133 seconds]
0063 %        pp.tpfall    % (12): time constant for P to fall [0.064 seconds]
0064 %        pp.tbmax     % (20): max smoothing time constant [0.0717 seconds]
0065 %        pp.qeqmin    % (23): minimum value of Qeq [2]
0066 %        pp.qeqmax    % max value of Qeq per frame [14]
0067 %        pp.av        % (23)+13 lines: fudge factor for bc calculation  [2.12]
0068 %        pp.td        % time to take minimum over [1.536 seconds]
0069 %        pp.nu        % number of subwindows to use [3]
0070 %        pp.qith      % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
0071 %        pp.nsmdb     % corresponding noise slope thresholds in dB/second   [47 31.4 15.7 4.1]
0072 %
0073 % MMSE noise estimate [3]: pp.ne=1
0074 %        pp.tax      % smoothing time constant for noise power estimate [0.0717 seconds](8)
0075 %        pp.tap      % smoothing time constant for smoothed speech prob [0.152 seconds](23)
0076 %        pp.psthr    % threshold for smoothed speech probability [0.99] (24)
0077 %        pp.pnsaf    % noise probability safety value [0.01] (24)
0078 %        pp.pspri    % prior speech probability [0.5] (18)
0079 %        pp.asnr     % active SNR in dB [15] (18)
0080 %        pp.psini    % initial speech probability [0.5] (23)
0081 %        pp.tavini   % assumed speech absent time at start [0.064 seconds]
0082 %
0083 % If convenient, you can call v_specsub in chunks of arbitrary size. Thus the following are equivalent:
0084 %
0085 %                   (a) y=v_specsub(s,fs);
0086 %
0087 %                   (b) [y1,z]=v_specsub(s(1:1000),fs);
0088 %                       [y2,z]=v_specsub(s(1001:2000),z);
0089 %                       y3=v_specsub(s(2001:end),z);
0090 %                       y=[y1; y2; y3];
0091 %
0092 % If the number of output arguments is either 2 or 5, the last partial frame of samples will
0093 % be retained for overlap adding with the output from the next call to v_specsub().
0094 %
0095 % See also v_ssubmmse() for an alternative gain function
0096 %
0097 % Refs:
0098 %    [1] M. Berouti, R. Schwartz and J. Makhoul
0099 %        Enhancement of speech corrupted by acoustic noise
0100 %        Proc IEEE ICASSP, 1979, 4, 208-211
0101 %    [2] Rainer Martin.
0102 %        Noise power spectral density estimation based on optimal smoothing and minimum statistics.
0103 %        IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.
0104 %    [3] Gerkmann, T. & Hendriks, R. C.
0105 %        Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay
0106 %        IEEE Trans Audio, Speech, Language Processing, 2012, 20, 1383-1393
0107 
0108 %      Copyright (C) Mike Brookes 2004
0109 %      Version: $Id: v_specsub.m 10865 2018-09-21 17:22:45Z dmb $
0110 %
0111 %   VOICEBOX is a MATLAB toolbox for speech processing.
0112 %   Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
0113 %
0114 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0115 %   This program is free software; you can redistribute it and/or modify
0116 %   it under the terms of the GNU Lesser General Public License as published by
0117 %   the Free Software Foundation; either version 3 of the License, or
0118 %   (at your option) any later version.
0119 %
0120 %   This program is distributed in the hope that it will be useful,
0121 %   but WITHOUT ANY WARRANTY; without even the implied warranty of
0122 %   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
0123 %   GNU Lesser General Public License for more details.
0124 %
0125 %   You can obtain a copy of the GNU Lesser General Public License from
0126 %   https://www.gnu.org/licenses/ .
0127 %
0128 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0129 if numel(si)>length(si)
0130     error('Input speech signal must be a vector not a matrix');
0131 end
0132 if isstruct(fsz)
0133     fs=fsz.fs;
0134     qq=fsz.qq;
0135     qp=fsz.qp;
0136     ze=fsz.ze;
0137     s=zeros(length(fsz.si)+length(si(:)),1); % allocate space for speech
0138     s(1:length(fsz.si))=fsz.si;
0139     s(length(fsz.si)+1:end)=si(:);
0140 else
0141     fs=fsz;     % sample frequency
0142     s=si(:);
0143     % default algorithm constants
0144 
0145     qq.of=2;   % overlap factor = (fft length)/(frame increment)
0146     qq.ti=16e-3;   % desired frame increment (16 ms)
0147     qq.ri=0;       % round ni to the nearest power of 2
0148     qq.g=1;        % subtraction domain: 1=magnitude, 2=power
0149     qq.e=1;        % gain exponent
0150     qq.am=3;      % max oversubtraction factor
0151     qq.b=0.01;      % noise floor
0152     qq.al=-5;       % SNR for maximum a (set to Inf for fixed a)
0153     qq.ah=20;       % SNR for minimum a
0154     qq.bt=-1;       % suppress binary masking
0155     qq.ne=0;        % noise estimation: 0=min statistics, 1=MMSE [0]
0156     qq.mx=0;        % no input mixing
0157     qq.gh=1;        % maximum gain
0158     qq.tf='g';      % output the gain time-frequency plane by default
0159     qq.rf=0;
0160     if nargin>=3 && ~isempty(pp)
0161         qp=pp;      % save for v_estnoisem call
0162         qqn=fieldnames(qq);
0163         for i=1:length(qqn)
0164             if isfield(pp,qqn{i})
0165                 qq.(qqn{i})=pp.(qqn{i});
0166             end
0167         end
0168     else
0169         qp=struct;  % make an empty structure
0170     end
0171 end
0172 % derived algorithm constants
0173 if qq.ri
0174     ni=pow2(nextpow2(qq.ti*fs*sqrt(0.5)));
0175 else
0176     ni=round(qq.ti*fs);    % frame increment in samples
0177 end
0178 tinc=ni/fs;          % true frame increment time
0179 tf=qq.tf;
0180 rf=qq.rf || nargout==2 || nargout==5;            % round down to an exact number of frames
0181 ne=qq.ne;           % noise estimation: 0=min statistics, 1=MMSE [0]
0182 
0183 % calculate power spectrum in frames
0184 
0185 no=round(qq.of);                                   % integer overlap factor
0186 nf=ni*no;           % fft length
0187 w=sqrt(hamming(nf+1))'; w(end)=[]; % for now always use sqrt hamming window
0188 w=w/sqrt(sum(w(1:ni:nf).^2));       % normalize to give overall gain of 1
0189 if rf>0
0190     rfm='';                         % truncated input to an exact number of frames
0191 else
0192     rfm='r';
0193 end
0194 [y,tt]=v_enframe(s,w,ni,rfm);
0195 tt=tt/fs;                           % frame times
0196 yf=v_rfft(y,nf,2);
0197 yp=yf.*conj(yf);        % power spectrum of input speech
0198 [nr,nf2]=size(yp);              % number of frames
0199 ff=(0:nf2-1)*fs/nf;
0200 if isstruct(fsz)
0201     if ne>0
0202         [dp,ze]=v_estnoiseg(yp,ze);       % estimate the noise using MMSE
0203     else
0204         [dp,ze]=v_estnoisem(yp,ze);       % estimate the noise using minimum statistics
0205     end
0206     ssv=fsz.ssv;
0207 else
0208     if ne>0
0209         [dp,ze]=v_estnoiseg(yp,tinc,qp);    % estimate the noise using MMSE
0210     else
0211         [dp,ze]=v_estnoisem(yp,tinc,qp);    % estimate the noise using minimum statistics
0212     end
0213     ssv=zeros(ni*(no-1),1);             % dummy saved overlap
0214 end
0215 if ~nr                                  % no data frames
0216     ss=[];
0217     gg=[];
0218 else
0219     mz=yp==0;   %  mask for zero power time-frequency bins (unlikely)
0220     if qq.al<Inf
0221         ypf=sum(yp,2);
0222         dpf=sum(dp,2);
0223         mzf=dpf==0;     % zero noise frames = very high SNR
0224         af=1+(qq.am-1)*(min(max(10*log10(ypf./(dpf+mzf)),qq.al),qq.ah)-qq.ah)/(qq.al-qq.ah);
0225         af(mzf)=1;      % fix the zero noise frames
0226     else
0227         af=repmat(qq.am,nr,1);
0228     end
0229     switch qq.g
0230         case 1   % magnitude domain subtraction
0231             v=sqrt(dp./(yp+mz));
0232             af=sqrt(af);
0233             bf=sqrt(qq.b);
0234         case 2   % power domain subtraction
0235             v=dp./(yp+mz);
0236             bf=qq.b;
0237         otherwise % arbitrary subtraction domain
0238             v=(dp./(yp+mz)).^(0.5*qq.g);
0239             af=af.^(0.5*qq.g);
0240             bf=qq.b^(0.5*qq.g);
0241     end
0242     af =repmat(af,1,nf2);       % replicate frame oversubtraction factors for each frequency
0243     mf=v>=(af+bf).^(-1);        % mask for noise floor limiting
0244     g=zeros(size(v));           % reserve space for gain matrix
0245     eg=qq.e/qq.g;               % gain exponent relative to subtraction domain
0246     gh=qq.gh;
0247     switch eg
0248         case 1                          % Normal case
0249             g(mf)=min(bf*v(mf),gh);      % never give a gain > 1
0250             g(~mf)=1-af(~mf).*v(~mf);
0251         case 0.5
0252             g(mf)=min(sqrt(bf*v(mf)),gh);
0253             g(~mf)=sqrt(1-af(~mf).*v(~mf));
0254         otherwise
0255             g(mf)=min((bf*v(mf)).^eg,gh);
0256             g(~mf)=(1-af(~mf).*v(~mf)).^eg;
0257     end
0258     if qq.bt>=0
0259         g=g>qq.bt;
0260     end
0261     g=qq.mx+(1-qq.mx)*g;   % mix in some of the input
0262     se=(v_irfft((yf.*g).',nf).').*repmat(w,nr,1);   % inverse dft and apply output window
0263     ss=zeros(ni*(nr+no-1),no);                      % space for overlapped output speech
0264     ss(1:ni*(no-1),end)=ssv;
0265     for i=1:no
0266         nm=nf*(1+floor((nr-i)/no));  % number of samples in this set
0267         ss(1+(i-1)*ni:nm+(i-1)*ni,i)=reshape(se(i:no:nr,:)',nm,1);
0268     end
0269     ss=sum(ss,2);
0270     if nargout>2 && ~isempty(tf)
0271         gg=zeros(nr,nf2,length(tf));  % make space
0272         for i=1:length(tf)
0273             switch tf(i)
0274                 case 'i'            % 'i' = input power spectrum
0275                     gg(:,:,i)=yp;
0276                 case 'I'            % 'i' = input power spectrum
0277                     gg(:,:,i)=yf;
0278                 case 'n'            % 'n' = noise power spectrum
0279                     gg(:,:,i)=dp;
0280                 case 'g'            % 'g' = gain
0281                     gg(:,:,i)=g;
0282                 case 'o'            % 'o' = output power spectrum
0283                     gg(:,:,i)=yp.*g.^2;
0284                 case 'O'            % 'o' = output power spectrum
0285                     gg(:,:,i)=yf.*g;
0286             end
0287         end
0288     end
0289 end
0290 if nargout==2 || nargout==5
0291     if nr
0292         zo.ssv=ss(end-ni*(no-1)+1:end);    % save the output tail for next time
0293         ss(end-ni*(no-1)+1:end)=[];
0294     else
0295         zo.ssv=ssv;  %
0296     end
0297     zo.si=s(length(ss)+1:end);      % save the tail end of the input speech signal
0298     zo.fs=fs;                       % save sample frequency
0299     zo.qq=qq;                       % save local parameters
0300     zo.qp=qp;                       % save v_estnoisem parameters
0301     zo.ze=ze;                       % save state of noise estimation
0302     if nargout==2
0303         gg=zo;                      % 2nd of two arguments is zo
0304     end
0305 elseif rf==0
0306     ss=ss(1:length(s));             % trim to the correct length if not an exact number of frames
0307 end
0308 if ~nargout && nr>0
0309     ffax=ff/1000;    ax=zeros(4,1);
0310     ax(1)=subplot(223);
0311     imagesc(tt,ffax,20*log10(g)');
0312     colorbar;
0313     axis('xy');
0314     if qq.al==Inf
0315         title(sprintf('Filter Gain (dB): a=%.2g, b=%.3g',qq.am,qq.b));
0316     else
0317         title(sprintf('Filter Gain (dB): a=%.2g (%.0f to %.0fdB), b=%.3g',qq.am,qq.al,qq.ah,qq.b));
0318     end
0319     xlabel('Time (s)');
0320     ylabel('Frequency (kHz)');
0321 
0322     ax(2)=subplot(222);
0323     imagesc(tt,ffax,10*log10(yp)');
0324     colorbar;
0325     axis('xy');
0326     title('Noisy Speech (dB)');
0327     xlabel('Time (s)');
0328     ylabel('Frequency (kHz)');
0329 
0330     ax(3)=subplot(224);
0331     imagesc(tt,ffax,10*log10(yp.*g.^2)');
0332     colorbar;
0333     axis('xy');
0334     title(sprintf('Enhanced Speech (dB): g=%.2g, e=%.3g',qq.g,qq.e));
0335     xlabel('Time (s)');
0336     ylabel('Frequency (kHz)');
0337 
0338     ax(4)=subplot(221);
0339     imagesc(tt,ffax,10*log10(dp)');
0340     colorbar;
0341     axis('xy');
0342     title('Noise Estimate (dB)');
0343     xlabel('Time (s)');
0344     ylabel('Frequency (kHz)');
0345     linkaxes(ax);
0346 end

v_specsub

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SOURCE CODE