


SSUBMMSE performs speech enhancement using mmse estimate of spectral amplitude or log amplitude [SS,ZO]=(S,FSZ,P)
Inputs:
si input speech signal
fsz sample frequency in Hz
Alternatively, the input state from a previous call (see below)
pp algorithm parameters [optional]
Outputs:
ss output enhanced speech (length is rounded down to the nearest frame boundary)
zo output state
The algorithm operation is controlled by a small number of parameters:
pp.of % overlap factor = (fft length)/(frame increment) [2]
pp.ti % desired frame increment [0.016 seconds]
pp.ri % set to 1 to round ti to the nearest power of 2 samples [0]
pp.ta % Time const for smoothing SNR estimate [0.396 seconds]
pp.gx % maximum posterior SNR as a power ratio [1000]
pp.gn % min posterior SNR as a power ratio when estimating prior SNR [1]
pp.xn % minimum prior SNR [0]
pp.xb % bias compensation factor for prior SNR [1]
pp.lg % optimal estimate for log spectrum rather than spectrum [1]
pp.bt % threshold for binary gain or -1 for continuous gain [-1]
pp.mx % input mixture gain [0]
The applied gain is mx+(1-mx)*optgain where optgain is calculated according to [1] or [2].
If pp.bt>=0 then optgain is first thresholded with pp.bt to produce a binary gain 0 or 1.
The default parameters implement the original algorithm in [1,2].
Several parameters relate to the estimation of xi, the so-called "prior SNR",
xi=max(a*pp.xb*xu+(1-a)*max(gami-1,pp.gn-1),pp.xn);
This is estimated as a smoothed version of gami, the "posterior SNR"
which is the noisy speech power divided by the noise power. This is
clipped to (pp.gn-1), smoothed using a factor "a" which corresponds to a
time-constasnt of pp.ta and then clipped to a minimum of pp.xn. The
previous value is taken to be pp.xb*xu where xu is the ratio of the
estimated speech amplitude squared to the noise power.
In addition it is possible to specify parameters for the noise estimation algorithm
which implements reference [3] from which equation numbers are given in parentheses.
They are as follows:
pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds]
pp.tamax % (3): max smoothing time constant [0.392 seconds]
pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds]
pp.tpfall % (12): time constant for P to fall [0.064 seconds]
pp.tbmax % (20): max smoothing time constant [0.0717 seconds]
pp.qeqmin % (23): minimum value of Qeq [2]
pp.qeqmax % max value of Qeq per frame [14]
pp.av % (23)+13 lines: fudge factor for bc calculation [2.12]
pp.td % time to take minimum over [1.536 seconds]
pp.nu % number of subwindows to use [3]
pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1]
If convenient, you can call specsub in chunks of arbitrary size. Thus the following are equivalent:
(a) y=ssubmmse(s,fs);
(b) [y1,z]=ssubmmse(s(1:1000),fs);
[y2,z]=ssubmmse(s(1001:2000),z);
y3=ssubmmse(s(2001:end),z);
y=[y1; y2; y3];
Note that in all cases the number of output samples will be less than the number of input samples if
the input length is not an exact number of frames. In addition, if two output arguments
are specified, the last partial frame samples will be retained for overlap adding
with the output from the next call to ssubmmse().
See also specsub() for an alternative gain function
Refs:
[1] Ephraim, Y. & Malah, D.
Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator
IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984
[2] Ephraim, Y. & Malah, D.
Speech enhancement using a minimum mean-square error log-spectral amplitude estimator
IEEE Trans Acoustics Speech and Signal Processing, 33(2):443-445, Apr 1985
[3] Rainer Martin.
Noise power spectral density estimation based on optimal smoothing and minimum statistics.
IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.
[4] O. Cappe.
Elimination of the musical noise phenomenon with the ephraim and malah noise suppressor.
IEEE Trans Speech Audio Processing, 2 (2): 345–349, Apr. 1994. doi: 10.1109/89.279283.
[5] J. Erkelens, J. Jensen, and R. Heusdens.
A data-driven approach to optimizing spectral speech enhancement methods for various error criteria.
Speech Communication, 49: 530–541, 2007. doi: 10.1016/j.specom.2006.06.012.
[6] R. Martin.
Statistical methods for the enhancement of noisy speech.
In J. Benesty, S. Makino, and J. Chen, editors,
Speech Enhancement, chapter 3, pages 43–64. Springer-Verlag, 2005.

0001 function [ss,zo]=ssubmmse(si,fsz,pp) 0002 %SSUBMMSE performs speech enhancement using mmse estimate of spectral amplitude or log amplitude [SS,ZO]=(S,FSZ,P) 0003 % 0004 % Inputs: 0005 % si input speech signal 0006 % fsz sample frequency in Hz 0007 % Alternatively, the input state from a previous call (see below) 0008 % pp algorithm parameters [optional] 0009 % 0010 % Outputs: 0011 % ss output enhanced speech (length is rounded down to the nearest frame boundary) 0012 % zo output state 0013 % 0014 % The algorithm operation is controlled by a small number of parameters: 0015 % 0016 % pp.of % overlap factor = (fft length)/(frame increment) [2] 0017 % pp.ti % desired frame increment [0.016 seconds] 0018 % pp.ri % set to 1 to round ti to the nearest power of 2 samples [0] 0019 % pp.ta % Time const for smoothing SNR estimate [0.396 seconds] 0020 % pp.gx % maximum posterior SNR as a power ratio [1000] 0021 % pp.gn % min posterior SNR as a power ratio when estimating prior SNR [1] 0022 % pp.xn % minimum prior SNR [0] 0023 % pp.xb % bias compensation factor for prior SNR [1] 0024 % pp.lg % optimal estimate for log spectrum rather than spectrum [1] 0025 % pp.bt % threshold for binary gain or -1 for continuous gain [-1] 0026 % pp.mx % input mixture gain [0] 0027 % 0028 % The applied gain is mx+(1-mx)*optgain where optgain is calculated according to [1] or [2]. 0029 % If pp.bt>=0 then optgain is first thresholded with pp.bt to produce a binary gain 0 or 1. 0030 % 0031 % The default parameters implement the original algorithm in [1,2]. 0032 % 0033 % Several parameters relate to the estimation of xi, the so-called "prior SNR", 0034 % 0035 % xi=max(a*pp.xb*xu+(1-a)*max(gami-1,pp.gn-1),pp.xn); 0036 % 0037 % This is estimated as a smoothed version of gami, the "posterior SNR" 0038 % which is the noisy speech power divided by the noise power. This is 0039 % clipped to (pp.gn-1), smoothed using a factor "a" which corresponds to a 0040 % time-constasnt of pp.ta and then clipped to a minimum of pp.xn. The 0041 % previous value is taken to be pp.xb*xu where xu is the ratio of the 0042 % estimated speech amplitude squared to the noise power. 0043 % 0044 % In addition it is possible to specify parameters for the noise estimation algorithm 0045 % which implements reference [3] from which equation numbers are given in parentheses. 0046 % They are as follows: 0047 % 0048 % pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds] 0049 % pp.tamax % (3): max smoothing time constant [0.392 seconds] 0050 % pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds] 0051 % pp.tpfall % (12): time constant for P to fall [0.064 seconds] 0052 % pp.tbmax % (20): max smoothing time constant [0.0717 seconds] 0053 % pp.qeqmin % (23): minimum value of Qeq [2] 0054 % pp.qeqmax % max value of Qeq per frame [14] 0055 % pp.av % (23)+13 lines: fudge factor for bc calculation [2.12] 0056 % pp.td % time to take minimum over [1.536 seconds] 0057 % pp.nu % number of subwindows to use [3] 0058 % pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ] 0059 % pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1] 0060 % 0061 % 0062 % If convenient, you can call specsub in chunks of arbitrary size. Thus the following are equivalent: 0063 % 0064 % (a) y=ssubmmse(s,fs); 0065 % 0066 % (b) [y1,z]=ssubmmse(s(1:1000),fs); 0067 % [y2,z]=ssubmmse(s(1001:2000),z); 0068 % y3=ssubmmse(s(2001:end),z); 0069 % y=[y1; y2; y3]; 0070 % 0071 % Note that in all cases the number of output samples will be less than the number of input samples if 0072 % the input length is not an exact number of frames. In addition, if two output arguments 0073 % are specified, the last partial frame samples will be retained for overlap adding 0074 % with the output from the next call to ssubmmse(). 0075 % 0076 % See also specsub() for an alternative gain function 0077 % 0078 % Refs: 0079 % [1] Ephraim, Y. & Malah, D. 0080 % Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator 0081 % IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984 0082 % [2] Ephraim, Y. & Malah, D. 0083 % Speech enhancement using a minimum mean-square error log-spectral amplitude estimator 0084 % IEEE Trans Acoustics Speech and Signal Processing, 33(2):443-445, Apr 1985 0085 % [3] Rainer Martin. 0086 % Noise power spectral density estimation based on optimal smoothing and minimum statistics. 0087 % IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001. 0088 % [4] O. Cappe. 0089 % Elimination of the musical noise phenomenon with the ephraim and malah noise suppressor. 0090 % IEEE Trans Speech Audio Processing, 2 (2): 345–349, Apr. 1994. doi: 10.1109/89.279283. 0091 % [5] J. Erkelens, J. Jensen, and R. Heusdens. 0092 % A data-driven approach to optimizing spectral speech enhancement methods for various error criteria. 0093 % Speech Communication, 49: 530–541, 2007. doi: 10.1016/j.specom.2006.06.012. 0094 % [6] R. Martin. 0095 % Statistical methods for the enhancement of noisy speech. 0096 % In J. Benesty, S. Makino, and J. Chen, editors, 0097 % Speech Enhancement, chapter 3, pages 43–64. Springer-Verlag, 2005. 0098 0099 % Copyright (C) Mike Brookes 2004-2011 0100 % Version: $Id: ssubmmse.m 713 2011-10-16 14:45:43Z dmb $ 0101 % 0102 % VOICEBOX is a MATLAB toolbox for speech processing. 0103 % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 0104 % 0105 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0106 % This program is free software; you can redistribute it and/or modify 0107 % it under the terms of the GNU General Public License as published by 0108 % the Free Software Foundation; either version 2 of the License, or 0109 % (at your option) any later version. 0110 % 0111 % This program is distributed in the hope that it will be useful, 0112 % but WITHOUT ANY WARRANTY; without even the implied warranty of 0113 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 0114 % GNU General Public License for more details. 0115 % 0116 % You can obtain a copy of the GNU General Public License from 0117 % http://www.gnu.org/copyleft/gpl.html or by writing to 0118 % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. 0119 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0120 if isstruct(fsz) 0121 fs=fsz.fs; 0122 qq=fsz.qq; 0123 qp=fsz.qp; 0124 ze=fsz.ze; 0125 s=zeros(length(fsz.si)+length(si(:)),1); % allocate space for speech 0126 s(1:length(fsz.si))=fsz.si; 0127 s(length(fsz.si)+1:end)=si(:); 0128 else 0129 fs=fsz; % sample frequency 0130 s=si(:); 0131 % default algorithm constants 0132 0133 qq.of=2; % overlap factor = (fft length)/(frame increment) 0134 qq.ti=16e-3; % desired frame increment (16 ms) 0135 qq.ri=0; % round ni to the nearest power of 2 0136 qq.ta=0.396; % Time const for smoothing SNR estimate = -tinc/log(0.98) from [1] 0137 qq.gx=1000; % maximum posterior SNR = 30dB 0138 qq.gn=1; % min posterior SNR as a power ratio when estimating prior SNR [1] 0139 qq.xn=0; % minimum prior SNR = -Inf dB 0140 qq.xb=1; % bias compensation factor for prior SNR [1] 0141 qq.lg=1; % use log-domain estimator by default 0142 qq.bt=-1; % suppress binary masking 0143 qq.mx=0; % no input mixing 0144 if nargin>=3 && ~isempty(pp) 0145 qp=pp; % save for estnoisem call 0146 qqn=fieldnames(qq); 0147 for i=1:length(qqn) 0148 if isfield(pp,qqn{i}) 0149 qq.(qqn{i})=pp.(qqn{i}); 0150 end 0151 end 0152 else 0153 qp=struct; % make an empty structure 0154 end 0155 end 0156 % derived algorithm constants 0157 if qq.ri 0158 ni=pow2(nextpow2(ti*fs*sqrt(0.5))); 0159 else 0160 ni=round(qq.ti*fs); % frame increment in samples 0161 end 0162 tinc=ni/fs; % true frame increment time 0163 a=exp(-tinc/qq.ta); % SNR smoothing coefficient 0164 gx=qq.gx; % max posterior SNR = 20 dB 0165 kk=sqrt(2*pi); % sqrt(8)*Gamma(1.5) - required constant 0166 xn=qq.xn; % floor for prior SNR, xi 0167 gn1=max(qq.gn-1,0); % floor for posterior SNR when estimating prior SNR 0168 xb=qq.xb; 0169 0170 % calculate power spectrum in frames 0171 0172 no=round(qq.of); % integer overlap factor 0173 nf=ni*no; % fft length 0174 w=sqrt(hamming(nf+1))'; w(end)=[]; % for now always use sqrt hamming window 0175 w=w/sqrt(sum(w(1:ni:nf).^2)); % normalize to give overall gain of 1 0176 y=enframe(s,w,ni); 0177 yf=rfft(y,nf,2); 0178 yp=yf.*conj(yf); % power spectrum of input speech 0179 [nr,nf2]=size(yp); % number of frames 0180 if isstruct(fsz) 0181 [dp,ze]=estnoisem(yp,ze); % estimate the noise using minimum statistics 0182 ssv=fsz.ssv; 0183 xu=fsz.xu; % saved unsmoothed SNR 0184 else 0185 [dp,ze]=estnoisem(yp,tinc,qp); % estimate the noise using minimum statistics 0186 ssv=zeros(ni*(no-1),1); % dummy saved overlap 0187 xu=1; % dummy unsmoothed SNR from previous frame 0188 end 0189 if ~nr % no data frames 0190 ss=[]; 0191 else 0192 gam=min(yp./dp,gx); % gamma = posterior SNR 0193 g=zeros(nr,nf2); % create space for gain matrix 0194 if qq.lg % use log domain estimator 0195 for i=1:nr 0196 gami=gam(i,:); 0197 xi=max(a*xb*xu+(1-a)*max(gami-1,gn1),xn); 0198 xir=xi./(1+xi); 0199 gi=xir.*exp(0.5*expint(xir.*gami)); 0200 g(i,:)=gi; % save gain for later 0201 xu=gami.*gi.^2; % unsmoothed prior SNR 0202 end 0203 else 0204 for i=1:nr 0205 gami=gam(i,:); 0206 xi=max(a*xb*xu+(1-a)*max(gami-1,gn1),xn); 0207 v=0.5*xi.*gami./(1+xi); % note that this is 0.5*vk in [1] 0208 gi=(0.277+2*v)./gami; % accurate to 0.02 dB for v>0.5 0209 mv=v<0.5; 0210 if any(mv) 0211 vmv=v(mv); 0212 gi(mv)=kk*sqrt(vmv).*((0.5+vmv).*besseli(0,vmv)+vmv.*besseli(1,vmv))./(gami(mv).*exp(vmv)); 0213 end 0214 g(i,:)=gi; % save gain for later 0215 xu=gami.*gi.^2; % unsmoothed prior SNR 0216 end 0217 end 0218 if qq.bt>=0 0219 g=g>qq.bt; 0220 end 0221 g=qq.mx+(1-qq.mx)*g; % mix in some of the input 0222 se=(irfft((yf.*g).',nf).').*repmat(w,nr,1); % inverse dft and apply output window 0223 ss=zeros(ni*(nr+no-1),no); % space for overlapped output speech 0224 ss(1:ni*(no-1),end)=ssv; 0225 for i=1:no 0226 nm=nf*(1+floor((nr-i)/no)); % number of samples in this set 0227 ss(1+(i-1)*ni:nm+(i-1)*ni,i)=reshape(se(i:no:nr,:)',nm,1); 0228 end 0229 ss=sum(ss,2); 0230 end 0231 if nargout>1 0232 if nr 0233 zo.ssv=ss(end-ni*(no-1)+1:end); % save the output tail for next time 0234 ss(end-ni*(no-1)+1:end)=[]; 0235 else 0236 zo.ssv=ssv; % 0237 end 0238 zo.si=s(length(ss)+1:end); % save the tail end of the input speech signal 0239 zo.fs=fs; % save sample frequency 0240 zo.qq=qq; % save loval parameters 0241 zo.qp=qp; % save estnoisem parameters 0242 zo.ze=ze; % save state of noise estimation 0243 zo.xu=xu; 0244 end 0245 if ~nargout && nr>0 0246 ttax=(1:nr)*tinc; 0247 ffax=(0:size(g,2)-1)*fs/nf/1000; ax=zeros(4,1); 0248 ax(1)=subplot(223); 0249 imagesc(ttax,ffax,20*log10(g)'); 0250 colorbar; 0251 axis('xy'); 0252 title(sprintf('Filter Gain (dB): ta=%.2g',qq.ta)); 0253 xlabel('Time (s)'); 0254 ylabel('Frequency (kHz)'); 0255 0256 ax(2)=subplot(222); 0257 imagesc(ttax,ffax,10*log10(yp)'); 0258 colorbar; 0259 axis('xy'); 0260 title('Noisy Speech (dB)'); 0261 xlabel('Time (s)'); 0262 ylabel('Frequency (kHz)'); 0263 0264 ax(3)=subplot(224); 0265 imagesc(ttax,ffax,10*log10(yp.*g.^2)'); 0266 colorbar; 0267 axis('xy'); 0268 title('Enhanced Speech (dB)'); 0269 xlabel('Time (s)'); 0270 ylabel('Frequency (kHz)'); 0271 0272 ax(4)=subplot(221); 0273 imagesc(ttax,ffax,10*log10(dp)'); 0274 colorbar; 0275 axis('xy'); 0276 title('Noise Estimate (dB)'); 0277 xlabel('Time (s)'); 0278 ylabel('Frequency (kHz)'); 0279 linkaxes(ax); 0280 end