Home > voicebox > ssubmmse.m

ssubmmse

PURPOSE ^

SSUBMMSE performs speech enhancement using mmse estimate of spectral amplitude or log amplitude [SS,ZO]=(S,FSZ,P)

SYNOPSIS ^

function [ss,zo]=ssubmmse(si,fsz,pp)

DESCRIPTION ^

SSUBMMSE performs speech enhancement using mmse estimate of spectral amplitude or log amplitude [SS,ZO]=(S,FSZ,P)

 Inputs:
   si      input speech signal
   fsz     sample frequency in Hz
           Alternatively, the input state from a previous call (see below)
   pp      algorithm parameters [optional]

 Outputs:
   ss      output enhanced speech (length is rounded down to the nearest frame boundary)
   zo      output state

 The algorithm operation is controlled by a small number of parameters:

        pp.of          % overlap factor = (fft length)/(frame increment) [2]
        pp.ti          % desired frame increment [0.016 seconds]
        pp.ri          % set to 1 to round ti to the nearest power of 2 samples [0]
        pp.ta          % Time const for smoothing SNR estimate [0.396 seconds]
        pp.gx          % maximum posterior SNR as a power ratio [1000]
        pp.gn          % min posterior SNR as a power ratio when estimating prior SNR [1]
        pp.xn          % minimum prior SNR [0]
        pp.xb          % bias compensation factor for prior SNR [1]
        pp.lg          % optimal estimate for log spectrum rather than spectrum [1]
        pp.bt          % threshold for binary gain or -1 for continuous gain [-1]
        pp.mx          % input mixture gain [0]

 The applied gain is mx+(1-mx)*optgain where optgain is calculated according to [1] or [2].
 If pp.bt>=0 then optgain is first thresholded with pp.bt to produce a binary gain 0 or 1.

 The default parameters implement the original algorithm in [1,2].

 Several parameters relate to the estimation of xi, the so-called "prior SNR",
 
             xi=max(a*pp.xb*xu+(1-a)*max(gami-1,pp.gn-1),pp.xn);

 This is estimated as a smoothed version of gami, the "posterior SNR"
 which is the noisy speech power divided by the noise power. This is
 clipped to (pp.gn-1), smoothed using a factor "a" which corresponds to a
 time-constasnt of pp.ta and then clipped to a minimum of pp.xn. The
 previous value is taken to be pp.xb*xu where xu is the ratio of the
 estimated speech amplitude squared to the noise power.
 
 In addition it is possible to specify parameters for the noise estimation algorithm
 which implements reference [3] from which equation numbers are given in parentheses.
 They are as follows:

        pp.taca      % (11): smoothing time constant for alpha_c [0.0449 seconds]
        pp.tamax     % (3): max smoothing time constant [0.392 seconds]
        pp.taminh    % (3): min smoothing time constant (upper limit) [0.0133 seconds]
        pp.tpfall    % (12): time constant for P to fall [0.064 seconds]
        pp.tbmax     % (20): max smoothing time constant [0.0717 seconds]
        pp.qeqmin    % (23): minimum value of Qeq [2]
        pp.qeqmax    % max value of Qeq per frame [14]
        pp.av        % (23)+13 lines: fudge factor for bc calculation  [2.12]
        pp.td        % time to take minimum over [1.536 seconds]
        pp.nu        % number of subwindows to use [3]
        pp.qith      % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
        pp.nsmdb     % corresponding noise slope thresholds in dB/second   [47 31.4 15.7 4.1]


 If convenient, you can call specsub in chunks of arbitrary size. Thus the following are equivalent:

                   (a) y=ssubmmse(s,fs);

                   (b) [y1,z]=ssubmmse(s(1:1000),fs);
                       [y2,z]=ssubmmse(s(1001:2000),z);
                       y3=ssubmmse(s(2001:end),z);
                       y=[y1; y2; y3];

 Note that in all cases the number of output samples will be less than the number of input samples if
 the input length is not an exact number of frames. In addition, if two output arguments
 are specified, the last partial frame samples will be retained for overlap adding
 with the output from the next call to ssubmmse().

 See also specsub() for an alternative gain function

 Refs:
    [1] Ephraim, Y. & Malah, D.
        Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
        IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984
    [2] Ephraim, Y. & Malah, D.
        Speech enhancement using a minimum mean-square error log-spectral amplitude estimator
        IEEE Trans Acoustics Speech and Signal Processing, 33(2):443-445, Apr 1985
    [3] Rainer Martin.
        Noise power spectral density estimation based on optimal smoothing and minimum statistics.
        IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.
    [4] O. Cappe.
        Elimination of the musical noise phenomenon with the ephraim and malah noise suppressor.
        IEEE Trans Speech Audio Processing, 2 (2): 345–349, Apr. 1994. doi: 10.1109/89.279283.
    [5] J. Erkelens, J. Jensen, and R. Heusdens.
        A data-driven approach to optimizing spectral speech enhancement methods for various error criteria.
        Speech Communication, 49: 530–541, 2007. doi: 10.1016/j.specom.2006.06.012.
    [6] R. Martin.
        Statistical methods for the enhancement of noisy speech.
        In J. Benesty, S. Makino, and J. Chen, editors,
        Speech Enhancement, chapter 3, pages 43–64. Springer-Verlag, 2005.

CROSS-REFERENCE INFORMATION ^

This function calls: This function is called by:

SOURCE CODE ^

0001 function [ss,zo]=ssubmmse(si,fsz,pp)
0002 %SSUBMMSE performs speech enhancement using mmse estimate of spectral amplitude or log amplitude [SS,ZO]=(S,FSZ,P)
0003 %
0004 % Inputs:
0005 %   si      input speech signal
0006 %   fsz     sample frequency in Hz
0007 %           Alternatively, the input state from a previous call (see below)
0008 %   pp      algorithm parameters [optional]
0009 %
0010 % Outputs:
0011 %   ss      output enhanced speech (length is rounded down to the nearest frame boundary)
0012 %   zo      output state
0013 %
0014 % The algorithm operation is controlled by a small number of parameters:
0015 %
0016 %        pp.of          % overlap factor = (fft length)/(frame increment) [2]
0017 %        pp.ti          % desired frame increment [0.016 seconds]
0018 %        pp.ri          % set to 1 to round ti to the nearest power of 2 samples [0]
0019 %        pp.ta          % Time const for smoothing SNR estimate [0.396 seconds]
0020 %        pp.gx          % maximum posterior SNR as a power ratio [1000]
0021 %        pp.gn          % min posterior SNR as a power ratio when estimating prior SNR [1]
0022 %        pp.xn          % minimum prior SNR [0]
0023 %        pp.xb          % bias compensation factor for prior SNR [1]
0024 %        pp.lg          % optimal estimate for log spectrum rather than spectrum [1]
0025 %        pp.bt          % threshold for binary gain or -1 for continuous gain [-1]
0026 %        pp.mx          % input mixture gain [0]
0027 %
0028 % The applied gain is mx+(1-mx)*optgain where optgain is calculated according to [1] or [2].
0029 % If pp.bt>=0 then optgain is first thresholded with pp.bt to produce a binary gain 0 or 1.
0030 %
0031 % The default parameters implement the original algorithm in [1,2].
0032 %
0033 % Several parameters relate to the estimation of xi, the so-called "prior SNR",
0034 %
0035 %             xi=max(a*pp.xb*xu+(1-a)*max(gami-1,pp.gn-1),pp.xn);
0036 %
0037 % This is estimated as a smoothed version of gami, the "posterior SNR"
0038 % which is the noisy speech power divided by the noise power. This is
0039 % clipped to (pp.gn-1), smoothed using a factor "a" which corresponds to a
0040 % time-constasnt of pp.ta and then clipped to a minimum of pp.xn. The
0041 % previous value is taken to be pp.xb*xu where xu is the ratio of the
0042 % estimated speech amplitude squared to the noise power.
0043 %
0044 % In addition it is possible to specify parameters for the noise estimation algorithm
0045 % which implements reference [3] from which equation numbers are given in parentheses.
0046 % They are as follows:
0047 %
0048 %        pp.taca      % (11): smoothing time constant for alpha_c [0.0449 seconds]
0049 %        pp.tamax     % (3): max smoothing time constant [0.392 seconds]
0050 %        pp.taminh    % (3): min smoothing time constant (upper limit) [0.0133 seconds]
0051 %        pp.tpfall    % (12): time constant for P to fall [0.064 seconds]
0052 %        pp.tbmax     % (20): max smoothing time constant [0.0717 seconds]
0053 %        pp.qeqmin    % (23): minimum value of Qeq [2]
0054 %        pp.qeqmax    % max value of Qeq per frame [14]
0055 %        pp.av        % (23)+13 lines: fudge factor for bc calculation  [2.12]
0056 %        pp.td        % time to take minimum over [1.536 seconds]
0057 %        pp.nu        % number of subwindows to use [3]
0058 %        pp.qith      % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
0059 %        pp.nsmdb     % corresponding noise slope thresholds in dB/second   [47 31.4 15.7 4.1]
0060 %
0061 %
0062 % If convenient, you can call specsub in chunks of arbitrary size. Thus the following are equivalent:
0063 %
0064 %                   (a) y=ssubmmse(s,fs);
0065 %
0066 %                   (b) [y1,z]=ssubmmse(s(1:1000),fs);
0067 %                       [y2,z]=ssubmmse(s(1001:2000),z);
0068 %                       y3=ssubmmse(s(2001:end),z);
0069 %                       y=[y1; y2; y3];
0070 %
0071 % Note that in all cases the number of output samples will be less than the number of input samples if
0072 % the input length is not an exact number of frames. In addition, if two output arguments
0073 % are specified, the last partial frame samples will be retained for overlap adding
0074 % with the output from the next call to ssubmmse().
0075 %
0076 % See also specsub() for an alternative gain function
0077 %
0078 % Refs:
0079 %    [1] Ephraim, Y. & Malah, D.
0080 %        Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
0081 %        IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984
0082 %    [2] Ephraim, Y. & Malah, D.
0083 %        Speech enhancement using a minimum mean-square error log-spectral amplitude estimator
0084 %        IEEE Trans Acoustics Speech and Signal Processing, 33(2):443-445, Apr 1985
0085 %    [3] Rainer Martin.
0086 %        Noise power spectral density estimation based on optimal smoothing and minimum statistics.
0087 %        IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.
0088 %    [4] O. Cappe.
0089 %        Elimination of the musical noise phenomenon with the ephraim and malah noise suppressor.
0090 %        IEEE Trans Speech Audio Processing, 2 (2): 345–349, Apr. 1994. doi: 10.1109/89.279283.
0091 %    [5] J. Erkelens, J. Jensen, and R. Heusdens.
0092 %        A data-driven approach to optimizing spectral speech enhancement methods for various error criteria.
0093 %        Speech Communication, 49: 530–541, 2007. doi: 10.1016/j.specom.2006.06.012.
0094 %    [6] R. Martin.
0095 %        Statistical methods for the enhancement of noisy speech.
0096 %        In J. Benesty, S. Makino, and J. Chen, editors,
0097 %        Speech Enhancement, chapter 3, pages 43–64. Springer-Verlag, 2005.
0098 
0099 %      Copyright (C) Mike Brookes 2004-2011
0100 %      Version: $Id: ssubmmse.m 713 2011-10-16 14:45:43Z dmb $
0101 %
0102 %   VOICEBOX is a MATLAB toolbox for speech processing.
0103 %   Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
0104 %
0105 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0106 %   This program is free software; you can redistribute it and/or modify
0107 %   it under the terms of the GNU General Public License as published by
0108 %   the Free Software Foundation; either version 2 of the License, or
0109 %   (at your option) any later version.
0110 %
0111 %   This program is distributed in the hope that it will be useful,
0112 %   but WITHOUT ANY WARRANTY; without even the implied warranty of
0113 %   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
0114 %   GNU General Public License for more details.
0115 %
0116 %   You can obtain a copy of the GNU General Public License from
0117 %   http://www.gnu.org/copyleft/gpl.html or by writing to
0118 %   Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA.
0119 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0120 if isstruct(fsz)
0121     fs=fsz.fs;
0122     qq=fsz.qq;
0123     qp=fsz.qp;
0124     ze=fsz.ze;
0125     s=zeros(length(fsz.si)+length(si(:)),1); % allocate space for speech
0126     s(1:length(fsz.si))=fsz.si;
0127     s(length(fsz.si)+1:end)=si(:);
0128 else
0129     fs=fsz;     % sample frequency
0130     s=si(:);
0131     % default algorithm constants
0132 
0133     qq.of=2;        % overlap factor = (fft length)/(frame increment)
0134     qq.ti=16e-3;    % desired frame increment (16 ms)
0135     qq.ri=0;        % round ni to the nearest power of 2
0136     qq.ta=0.396;    % Time const for smoothing SNR estimate = -tinc/log(0.98) from [1]
0137     qq.gx=1000;     % maximum posterior SNR = 30dB
0138     qq.gn=1;        % min posterior SNR as a power ratio when estimating prior SNR [1]
0139     qq.xn=0;        % minimum prior SNR = -Inf dB
0140     qq.xb=1;        % bias compensation factor for prior SNR [1]
0141     qq.lg=1;        % use log-domain estimator by default
0142     qq.bt=-1;       % suppress binary masking
0143     qq.mx=0;        % no input mixing
0144     if nargin>=3 && ~isempty(pp)
0145         qp=pp;      % save for estnoisem call
0146         qqn=fieldnames(qq);
0147         for i=1:length(qqn)
0148             if isfield(pp,qqn{i})
0149                 qq.(qqn{i})=pp.(qqn{i});
0150             end
0151         end
0152     else
0153         qp=struct;  % make an empty structure
0154     end
0155 end
0156 % derived algorithm constants
0157 if qq.ri
0158     ni=pow2(nextpow2(ti*fs*sqrt(0.5)));
0159 else
0160     ni=round(qq.ti*fs);    % frame increment in samples
0161 end
0162 tinc=ni/fs;          % true frame increment time
0163 a=exp(-tinc/qq.ta); % SNR smoothing coefficient
0164 gx=qq.gx;           % max posterior SNR = 20 dB
0165 kk=sqrt(2*pi);   % sqrt(8)*Gamma(1.5) - required constant
0166 xn=qq.xn;           % floor for prior SNR, xi
0167 gn1=max(qq.gn-1,0);        % floor for posterior SNR when estimating prior SNR
0168 xb=qq.xb;
0169 
0170 % calculate power spectrum in frames
0171 
0172 no=round(qq.of);                                   % integer overlap factor
0173 nf=ni*no;           % fft length
0174 w=sqrt(hamming(nf+1))'; w(end)=[]; % for now always use sqrt hamming window
0175 w=w/sqrt(sum(w(1:ni:nf).^2));       % normalize to give overall gain of 1
0176 y=enframe(s,w,ni);
0177 yf=rfft(y,nf,2);
0178 yp=yf.*conj(yf);        % power spectrum of input speech
0179 [nr,nf2]=size(yp);              % number of frames
0180 if isstruct(fsz)
0181     [dp,ze]=estnoisem(yp,ze);   % estimate the noise using minimum statistics
0182     ssv=fsz.ssv;
0183     xu=fsz.xu;                  % saved unsmoothed SNR
0184 else
0185     [dp,ze]=estnoisem(yp,tinc,qp);   % estimate the noise using minimum statistics
0186     ssv=zeros(ni*(no-1),1);             % dummy saved overlap
0187     xu=1;               % dummy unsmoothed SNR from previous frame
0188 end
0189 if ~nr                                  % no data frames
0190     ss=[];
0191 else
0192     gam=min(yp./dp,gx);               % gamma = posterior SNR
0193     g=zeros(nr,nf2);   % create space for gain matrix
0194     if qq.lg            % use log domain estimator
0195         for i=1:nr
0196             gami=gam(i,:);
0197             xi=max(a*xb*xu+(1-a)*max(gami-1,gn1),xn);
0198             xir=xi./(1+xi);
0199             gi=xir.*exp(0.5*expint(xir.*gami));
0200             g(i,:)=gi;                  % save gain for later
0201             xu=gami.*gi.^2;         % unsmoothed prior SNR
0202         end
0203     else
0204         for i=1:nr
0205             gami=gam(i,:);
0206             xi=max(a*xb*xu+(1-a)*max(gami-1,gn1),xn);
0207             v=0.5*xi.*gami./(1+xi);     % note that this is 0.5*vk in [1]
0208             gi=(0.277+2*v)./gami; % accurate to 0.02 dB for v>0.5
0209             mv=v<0.5;
0210             if any(mv)
0211                 vmv=v(mv);
0212                 gi(mv)=kk*sqrt(vmv).*((0.5+vmv).*besseli(0,vmv)+vmv.*besseli(1,vmv))./(gami(mv).*exp(vmv));
0213             end
0214             g(i,:)=gi;                  % save gain for later
0215             xu=gami.*gi.^2;         % unsmoothed prior SNR
0216         end
0217     end
0218     if qq.bt>=0
0219         g=g>qq.bt;
0220     end
0221     g=qq.mx+(1-qq.mx)*g;   % mix in some of the input
0222     se=(irfft((yf.*g).',nf).').*repmat(w,nr,1);   % inverse dft and apply output window
0223     ss=zeros(ni*(nr+no-1),no);                      % space for overlapped output speech
0224     ss(1:ni*(no-1),end)=ssv;
0225     for i=1:no
0226         nm=nf*(1+floor((nr-i)/no));  % number of samples in this set
0227         ss(1+(i-1)*ni:nm+(i-1)*ni,i)=reshape(se(i:no:nr,:)',nm,1);
0228     end
0229     ss=sum(ss,2);
0230 end
0231 if nargout>1
0232     if nr
0233         zo.ssv=ss(end-ni*(no-1)+1:end);    % save the output tail for next time
0234         ss(end-ni*(no-1)+1:end)=[];
0235     else
0236         zo.ssv=ssv;  %
0237     end
0238     zo.si=s(length(ss)+1:end);      % save the tail end of the input speech signal
0239     zo.fs=fs;                       % save sample frequency
0240     zo.qq=qq;                       % save loval parameters
0241     zo.qp=qp;                       % save estnoisem parameters
0242     zo.ze=ze;                       % save state of noise estimation
0243     zo.xu=xu;
0244 end
0245 if ~nargout && nr>0
0246     ttax=(1:nr)*tinc;
0247     ffax=(0:size(g,2)-1)*fs/nf/1000;    ax=zeros(4,1);
0248     ax(1)=subplot(223);
0249     imagesc(ttax,ffax,20*log10(g)');
0250     colorbar;
0251     axis('xy');
0252     title(sprintf('Filter Gain (dB): ta=%.2g',qq.ta));
0253     xlabel('Time (s)');
0254     ylabel('Frequency (kHz)');
0255 
0256     ax(2)=subplot(222);
0257     imagesc(ttax,ffax,10*log10(yp)');
0258     colorbar;
0259     axis('xy');
0260     title('Noisy Speech (dB)');
0261     xlabel('Time (s)');
0262     ylabel('Frequency (kHz)');
0263 
0264     ax(3)=subplot(224);
0265     imagesc(ttax,ffax,10*log10(yp.*g.^2)');
0266     colorbar;
0267     axis('xy');
0268     title('Enhanced Speech (dB)');
0269     xlabel('Time (s)');
0270     ylabel('Frequency (kHz)');
0271 
0272     ax(4)=subplot(221);
0273     imagesc(ttax,ffax,10*log10(dp)');
0274     colorbar;
0275     axis('xy');
0276     title('Noise Estimate (dB)');
0277     xlabel('Time (s)');
0278     ylabel('Frequency (kHz)');
0279     linkaxes(ax);
0280 end

Generated on Thu 02-Feb-2012 09:15:04 by m2html © 2003