Home > voicebox > specsub.m

specsub

PURPOSE ^

SPECSUB performs speech enhancement using spectral subtraction [SS,ZO]=(S,FSZ,P)

SYNOPSIS ^

function [ss,zo]=specsub(si,fsz,pp)

DESCRIPTION ^

SPECSUB performs speech enhancement using spectral subtraction [SS,ZO]=(S,FSZ,P)

 Inputs:
   si      input speech signal
   fsz     sample frequency in Hz
           Alternatively, the input state from a previous call (see below)
   pp      algorithm parameters [optional]

 Outputs:
   ss      output enhanced speech (length is rounded down to the nearest frame boundary)
   zo      output state

 The algorithm operation is controlled by a small number of parameters:

        pp.of          % overlap factor = (fft length)/(frame increment) [2]
        pp.ti          % desired frame increment [0.016 seconds]
        pp.ri          % set to 1 to round ti to the nearest power of 2 samples [0]
        pp.g           % subtraction domain: 1=magnitude, 2=power [1]
        pp.e           % gain exponent [1]
        pp.am          % max oversubtraction factor [3]
        pp.b           % max noise attenutaion in power domain [0.01]
        pp.al          % SNR for oversubtraction=am (set this to Inf for fixed a) [-5 dB]
        pp.ah          % SNR for oversubtraction=1 [20 dB]
        pp.bt          % threshold for binary gain or -1 for continuous gain [-1]
        pp.mx          % input mixture gain [0]
        pp.gh          % maximum gain for noise floor [1]

 Following [1], the magnitude-domain gain in each time-frequency bin is given by
                          gain=mx+(1-mx)*max((1-(a*N/X)^(g/2))^(e/g),min(gh,(b*N/X)^(e/2)))
 where N and X are the powers of the noise and noisy speech respectively.
 The oversubtraction factor varies linearly between a=am for a frame SNR of al down to
 a=1 for a frame SNR of ah. To obtain a fixed value of a for all values of SNR, set al=Inf.
 Common exponent combinations are:
                      g=1  e=1    Magnitude Domain spectral subtraction
                      g=2  e=1    Power Domain spectral subtraction
                      g=2  e=2    Wiener filtering
 Many authors use the parameters alpha=a^(g/2), beta=b^(g/2) and gamma2=e/g instead of a, b and e
 but this increases interdependence amongst the parameters.
 If bt>=0 then the max(...) expression above is thresholded to become 0 or 1.

 In addition it is possible to specify parameters for the noise estimation algorithm
 which implements reference [2] from which equation numbers are given in parentheses.
 They are as follows:

        pp.taca      % (11): smoothing time constant for alpha_c [0.0449 seconds]
        pp.tamax     % (3): max smoothing time constant [0.392 seconds]
        pp.taminh    % (3): min smoothing time constant (upper limit) [0.0133 seconds]
        pp.tpfall    % (12): time constant for P to fall [0.064 seconds]
        pp.tbmax     % (20): max smoothing time constant [0.0717 seconds]
        pp.qeqmin    % (23): minimum value of Qeq [2]
        pp.qeqmax    % max value of Qeq per frame [14]
        pp.av        % (23)+13 lines: fudge factor for bc calculation  [2.12]
        pp.td        % time to take minimum over [1.536 seconds]
        pp.nu        % number of subwindows to use [3]
        pp.qith      % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
        pp.nsmdb     % corresponding noise slope thresholds in dB/second   [47 31.4 15.7 4.1]


 If convenient, you can call specsub in chunks of arbitrary size. Thus the following are equivalent:

                   (a) y=specsub(s,fs);

                   (b) [y1,z]=specsub(s(1:1000),fs);
                       [y2,z]=specsub(s(1001:2000),z);
                       y3=specsub(s(2001:end),z);
                       y=[y1; y2; y3];

 Note that the number of output samples will be less than the number of input samples if
 the input length is not an exact number of frames. In addition, if two output arguments
 are specified, the last partial frame samples will be retained for overlap adding
 with the output from the next call to specsub().

 See also ssubmmse() for an alternative gain function

 Refs:
    [1] M. Berouti, R. Schwartz and J. Makhoul
        Enhancement of speech corrupted by acoustic noise
        Proc IEEE ICASSP, 1979, 4, 208-211
    [2] Rainer Martin.
        Noise power spectral density estimation based on optimal smoothing and minimum statistics.
        IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SOURCE CODE ^

0001 function [ss,zo]=specsub(si,fsz,pp)
0002 %SPECSUB performs speech enhancement using spectral subtraction [SS,ZO]=(S,FSZ,P)
0003 %
0004 % Inputs:
0005 %   si      input speech signal
0006 %   fsz     sample frequency in Hz
0007 %           Alternatively, the input state from a previous call (see below)
0008 %   pp      algorithm parameters [optional]
0009 %
0010 % Outputs:
0011 %   ss      output enhanced speech (length is rounded down to the nearest frame boundary)
0012 %   zo      output state
0013 %
0014 % The algorithm operation is controlled by a small number of parameters:
0015 %
0016 %        pp.of          % overlap factor = (fft length)/(frame increment) [2]
0017 %        pp.ti          % desired frame increment [0.016 seconds]
0018 %        pp.ri          % set to 1 to round ti to the nearest power of 2 samples [0]
0019 %        pp.g           % subtraction domain: 1=magnitude, 2=power [1]
0020 %        pp.e           % gain exponent [1]
0021 %        pp.am          % max oversubtraction factor [3]
0022 %        pp.b           % max noise attenutaion in power domain [0.01]
0023 %        pp.al          % SNR for oversubtraction=am (set this to Inf for fixed a) [-5 dB]
0024 %        pp.ah          % SNR for oversubtraction=1 [20 dB]
0025 %        pp.bt          % threshold for binary gain or -1 for continuous gain [-1]
0026 %        pp.mx          % input mixture gain [0]
0027 %        pp.gh          % maximum gain for noise floor [1]
0028 %
0029 % Following [1], the magnitude-domain gain in each time-frequency bin is given by
0030 %                          gain=mx+(1-mx)*max((1-(a*N/X)^(g/2))^(e/g),min(gh,(b*N/X)^(e/2)))
0031 % where N and X are the powers of the noise and noisy speech respectively.
0032 % The oversubtraction factor varies linearly between a=am for a frame SNR of al down to
0033 % a=1 for a frame SNR of ah. To obtain a fixed value of a for all values of SNR, set al=Inf.
0034 % Common exponent combinations are:
0035 %                      g=1  e=1    Magnitude Domain spectral subtraction
0036 %                      g=2  e=1    Power Domain spectral subtraction
0037 %                      g=2  e=2    Wiener filtering
0038 % Many authors use the parameters alpha=a^(g/2), beta=b^(g/2) and gamma2=e/g instead of a, b and e
0039 % but this increases interdependence amongst the parameters.
0040 % If bt>=0 then the max(...) expression above is thresholded to become 0 or 1.
0041 %
0042 % In addition it is possible to specify parameters for the noise estimation algorithm
0043 % which implements reference [2] from which equation numbers are given in parentheses.
0044 % They are as follows:
0045 %
0046 %        pp.taca      % (11): smoothing time constant for alpha_c [0.0449 seconds]
0047 %        pp.tamax     % (3): max smoothing time constant [0.392 seconds]
0048 %        pp.taminh    % (3): min smoothing time constant (upper limit) [0.0133 seconds]
0049 %        pp.tpfall    % (12): time constant for P to fall [0.064 seconds]
0050 %        pp.tbmax     % (20): max smoothing time constant [0.0717 seconds]
0051 %        pp.qeqmin    % (23): minimum value of Qeq [2]
0052 %        pp.qeqmax    % max value of Qeq per frame [14]
0053 %        pp.av        % (23)+13 lines: fudge factor for bc calculation  [2.12]
0054 %        pp.td        % time to take minimum over [1.536 seconds]
0055 %        pp.nu        % number of subwindows to use [3]
0056 %        pp.qith      % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
0057 %        pp.nsmdb     % corresponding noise slope thresholds in dB/second   [47 31.4 15.7 4.1]
0058 %
0059 %
0060 % If convenient, you can call specsub in chunks of arbitrary size. Thus the following are equivalent:
0061 %
0062 %                   (a) y=specsub(s,fs);
0063 %
0064 %                   (b) [y1,z]=specsub(s(1:1000),fs);
0065 %                       [y2,z]=specsub(s(1001:2000),z);
0066 %                       y3=specsub(s(2001:end),z);
0067 %                       y=[y1; y2; y3];
0068 %
0069 % Note that the number of output samples will be less than the number of input samples if
0070 % the input length is not an exact number of frames. In addition, if two output arguments
0071 % are specified, the last partial frame samples will be retained for overlap adding
0072 % with the output from the next call to specsub().
0073 %
0074 % See also ssubmmse() for an alternative gain function
0075 %
0076 % Refs:
0077 %    [1] M. Berouti, R. Schwartz and J. Makhoul
0078 %        Enhancement of speech corrupted by acoustic noise
0079 %        Proc IEEE ICASSP, 1979, 4, 208-211
0080 %    [2] Rainer Martin.
0081 %        Noise power spectral density estimation based on optimal smoothing and minimum statistics.
0082 %        IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.
0083 
0084 %      Copyright (C) Mike Brookes 2004
0085 %      Version: $Id: specsub.m 713 2011-10-16 14:45:43Z dmb $
0086 %
0087 %   VOICEBOX is a MATLAB toolbox for speech processing.
0088 %   Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
0089 %
0090 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0091 %   This program is free software; you can redistribute it and/or modify
0092 %   it under the terms of the GNU General Public License as published by
0093 %   the Free Software Foundation; either version 2 of the License, or
0094 %   (at your option) any later version.
0095 %
0096 %   This program is distributed in the hope that it will be useful,
0097 %   but WITHOUT ANY WARRANTY; without even the implied warranty of
0098 %   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
0099 %   GNU General Public License for more details.
0100 %
0101 %   You can obtain a copy of the GNU General Public License from
0102 %   http://www.gnu.org/copyleft/gpl.html or by writing to
0103 %   Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
0104 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0105 if isstruct(fsz)
0106     fs=fsz.fs;
0107     qq=fsz.qq;
0108     qp=fsz.qp;
0109     ze=fsz.ze;
0110     s=zeros(length(fsz.si)+length(si(:)),1); % allocate space for speech
0111     s(1:length(fsz.si))=fsz.si;
0112     s(length(fsz.si)+1:end)=si(:);
0113 else
0114     fs=fsz;     % sample frequency
0115     s=si(:);
0116     % default algorithm constants
0117 
0118     qq.of=2;   % overlap factor = (fft length)/(frame increment)
0119     qq.ti=16e-3;   % desired frame increment (16 ms)
0120     qq.ri=0;       % round ni to the nearest power of 2
0121     qq.g=1;        % subtraction domain: 1=magnitude, 2=power
0122     qq.e=1;        % gain exponent
0123     qq.am=3;      % max oversubtraction factor
0124     qq.b=0.01;      % noise floor
0125     qq.al=-5;       % SNR for maximum a (set to Inf for fixed a)
0126     qq.ah=20;       % SNR for minimum a
0127     qq.bt=-1;       % suppress binary masking
0128     qq.mx=0;        % no input mixing
0129     qq.gh=1;        % maximum gain
0130     if nargin>=3 && ~isempty(pp)
0131         qp=pp;      % save for estnoisem call
0132         qqn=fieldnames(qq);
0133         for i=1:length(qqn)
0134             if isfield(pp,qqn{i})
0135                 qq.(qqn{i})=pp.(qqn{i});
0136             end
0137         end
0138     else
0139         qp=struct;  % make an empty structure
0140     end
0141 end
0142 % derived algorithm constants
0143 if qq.ri
0144     ni=pow2(nextpow2(qq.ti*fs*sqrt(0.5)));
0145 else
0146     ni=round(qq.ti*fs);    % frame increment in samples
0147 end
0148 tinc=ni/fs;          % true frame increment time
0149 
0150 % calculate power spectrum in frames
0151 
0152 no=round(qq.of);                                   % integer overlap factor
0153 nf=ni*no;           % fft length
0154 w=sqrt(hamming(nf+1))'; w(end)=[]; % for now always use sqrt hamming window
0155 w=w/sqrt(sum(w(1:ni:nf).^2));       % normalize to give overall gain of 1
0156 y=enframe(s,w,ni);
0157 yf=rfft(y,nf,2);
0158 yp=yf.*conj(yf);        % power spectrum of input speech
0159 [nr,nf2]=size(yp);              % number of frames
0160 if isstruct(fsz)
0161     [dp,ze]=estnoisem(yp,ze);   % estimate the noise using minimum statistics
0162     ssv=fsz.ssv;
0163 else
0164     [dp,ze]=estnoisem(yp,tinc,qp);   % estimate the noise using minimum statistics
0165     ssv=zeros(ni*(no-1),1);             % dummy saved overlap
0166 end
0167 if ~nr                                  % no data frames
0168     ss=[];
0169 else
0170 
0171     mz=yp==0;   %  mask for zero power time-frequency bins (unlikely)
0172     if qq.al<Inf
0173         ypf=sum(yp,2);
0174         dpf=sum(dp,2);
0175         mzf=dpf==0;     % zero noise frames = very high SNR
0176         af=1+(qq.am-1)*(min(max(10*log10(ypf./(dpf+mzf)),qq.al),qq.ah)-qq.ah)/(qq.al-qq.ah);
0177         af(mzf)=1;      % fix the zero noise frames
0178     else
0179         af=repmat(qq.am,nr,1);
0180     end
0181     switch qq.g
0182         case 1   % magnitude domain subtraction
0183             v=sqrt(dp./(yp+mz));
0184             af=sqrt(af);
0185             bf=sqrt(qq.b);
0186         case 2   % power domain subtraction
0187             v=dp./(yp+mz);
0188             bf=qq.b;
0189         otherwise % arbitrary subtraction domain
0190             v=(dp./(yp+mz)).^(0.5*qq.g);
0191             af=af.^(0.5*qq.g);
0192             bf=qq.b^(0.5*qq.g);
0193     end
0194     af =repmat(af,1,nf2);       % replicate frame oversubtraction factors for each frequency
0195     mf=v>=(af+bf).^(-1);        % mask for noise floor limiting
0196     g=zeros(size(v));           % reserve space for gain matrix
0197     eg=qq.e/qq.g;               % gain exponent relative to subtraction domain
0198     gh=qq.gh;
0199     switch eg
0200         case 1                          % Normal case
0201             g(mf)=min(bf*v(mf),gh);      % never give a gain > 1
0202             g(~mf)=1-af(~mf).*v(~mf);
0203         case 0.5                        
0204             g(mf)=min(sqrt(bf*v(mf)),gh);
0205             g(~mf)=sqrt(1-af(~mf).*v(~mf));
0206         otherwise
0207             g(mf)=min((bf*v(mf)).^eg,gh);
0208             g(~mf)=(1-af(~mf).*v(~mf)).^eg;
0209     end
0210     if qq.bt>=0
0211         g=g>qq.bt;
0212     end
0213     g=qq.mx+(1-qq.mx)*g;   % mix in some of the input
0214     se=(irfft((yf.*g).',nf).').*repmat(w,nr,1);   % inverse dft and apply output window
0215     ss=zeros(ni*(nr+no-1),no);                      % space for overlapped output speech
0216     ss(1:ni*(no-1),end)=ssv;
0217     for i=1:no
0218         nm=nf*(1+floor((nr-i)/no));  % number of samples in this set
0219         ss(1+(i-1)*ni:nm+(i-1)*ni,i)=reshape(se(i:no:nr,:)',nm,1);
0220     end
0221     ss=sum(ss,2);
0222 
0223 end
0224 if nargout>1
0225     if nr
0226         zo.ssv=ss(end-ni*(no-1)+1:end);    % save the output tail for next time
0227         ss(end-ni*(no-1)+1:end)=[];
0228     else
0229         zo.ssv=ssv;  %
0230     end
0231     zo.si=s(length(ss)+1:end);      % save the tail end of the input speech signal
0232     zo.fs=fs;                       % save sample frequency
0233     zo.qq=qq;                       % save loval parameters
0234     zo.qp=qp;                       % save estnoisem parameters
0235     zo.ze=ze;                       % save state of noise estimation
0236 end
0237 if ~nargout && nr>0
0238     ttax=(1:nr)*tinc;
0239     ffax=(0:size(g,2)-1)*fs/nf/1000;    ax=zeros(4,1);
0240     ax(1)=subplot(223);
0241     imagesc(ttax,ffax,20*log10(g)');
0242     colorbar;
0243     axis('xy');
0244     if qq.al==Inf
0245         title(sprintf('Filter Gain (dB): a=%.2g, b=%.3g',qq.am,qq.b));
0246     else
0247         title(sprintf('Filter Gain (dB): a=%.2g (%.0f to %.0fdB), b=%.3g',qq.am,qq.al,qq.ah,qq.b));
0248     end
0249     xlabel('Time (s)');
0250     ylabel('Frequency (kHz)');
0251 
0252     ax(2)=subplot(222);
0253     imagesc(ttax,ffax,10*log10(yp)');
0254     colorbar;
0255     axis('xy');
0256     title('Noisy Speech (dB)');
0257     xlabel('Time (s)');
0258     ylabel('Frequency (kHz)');
0259 
0260     ax(3)=subplot(224);
0261     imagesc(ttax,ffax,10*log10(yp.*g.^2)');
0262     colorbar;
0263     axis('xy');
0264     title(sprintf('Enhanced Speech (dB): g=%.2g, e=%.3g',qq.g,qq.e));
0265     xlabel('Time (s)');
0266     ylabel('Frequency (kHz)');
0267 
0268     ax(4)=subplot(221);
0269     imagesc(ttax,ffax,10*log10(dp)');
0270     colorbar;
0271     axis('xy');
0272     title('Noise Estimate (dB)');
0273     xlabel('Time (s)');
0274     ylabel('Frequency (kHz)');
0275     linkaxes(ax);
0276 end

Generated on Thu 02-Feb-2012 09:15:04 by m2html © 2003