


SPECSUB performs speech enhancement using spectral subtraction [SS,ZO]=(S,FSZ,P)
Inputs:
si input speech signal
fsz sample frequency in Hz
Alternatively, the input state from a previous call (see below)
pp algorithm parameters [optional]
Outputs:
ss output enhanced speech (length is rounded down to the nearest frame boundary)
zo output state
The algorithm operation is controlled by a small number of parameters:
pp.of % overlap factor = (fft length)/(frame increment) [2]
pp.ti % desired frame increment [0.016 seconds]
pp.ri % set to 1 to round ti to the nearest power of 2 samples [0]
pp.g % subtraction domain: 1=magnitude, 2=power [1]
pp.e % gain exponent [1]
pp.am % max oversubtraction factor [3]
pp.b % max noise attenutaion in power domain [0.01]
pp.al % SNR for oversubtraction=am (set this to Inf for fixed a) [-5 dB]
pp.ah % SNR for oversubtraction=1 [20 dB]
pp.bt % threshold for binary gain or -1 for continuous gain [-1]
pp.mx % input mixture gain [0]
pp.gh % maximum gain for noise floor [1]
Following [1], the magnitude-domain gain in each time-frequency bin is given by
gain=mx+(1-mx)*max((1-(a*N/X)^(g/2))^(e/g),min(gh,(b*N/X)^(e/2)))
where N and X are the powers of the noise and noisy speech respectively.
The oversubtraction factor varies linearly between a=am for a frame SNR of al down to
a=1 for a frame SNR of ah. To obtain a fixed value of a for all values of SNR, set al=Inf.
Common exponent combinations are:
g=1 e=1 Magnitude Domain spectral subtraction
g=2 e=1 Power Domain spectral subtraction
g=2 e=2 Wiener filtering
Many authors use the parameters alpha=a^(g/2), beta=b^(g/2) and gamma2=e/g instead of a, b and e
but this increases interdependence amongst the parameters.
If bt>=0 then the max(...) expression above is thresholded to become 0 or 1.
In addition it is possible to specify parameters for the noise estimation algorithm
which implements reference [2] from which equation numbers are given in parentheses.
They are as follows:
pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds]
pp.tamax % (3): max smoothing time constant [0.392 seconds]
pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds]
pp.tpfall % (12): time constant for P to fall [0.064 seconds]
pp.tbmax % (20): max smoothing time constant [0.0717 seconds]
pp.qeqmin % (23): minimum value of Qeq [2]
pp.qeqmax % max value of Qeq per frame [14]
pp.av % (23)+13 lines: fudge factor for bc calculation [2.12]
pp.td % time to take minimum over [1.536 seconds]
pp.nu % number of subwindows to use [3]
pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ]
pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1]
If convenient, you can call specsub in chunks of arbitrary size. Thus the following are equivalent:
(a) y=specsub(s,fs);
(b) [y1,z]=specsub(s(1:1000),fs);
[y2,z]=specsub(s(1001:2000),z);
y3=specsub(s(2001:end),z);
y=[y1; y2; y3];
Note that the number of output samples will be less than the number of input samples if
the input length is not an exact number of frames. In addition, if two output arguments
are specified, the last partial frame samples will be retained for overlap adding
with the output from the next call to specsub().
See also ssubmmse() for an alternative gain function
Refs:
[1] M. Berouti, R. Schwartz and J. Makhoul
Enhancement of speech corrupted by acoustic noise
Proc IEEE ICASSP, 1979, 4, 208-211
[2] Rainer Martin.
Noise power spectral density estimation based on optimal smoothing and minimum statistics.
IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001.

0001 function [ss,zo]=specsub(si,fsz,pp) 0002 %SPECSUB performs speech enhancement using spectral subtraction [SS,ZO]=(S,FSZ,P) 0003 % 0004 % Inputs: 0005 % si input speech signal 0006 % fsz sample frequency in Hz 0007 % Alternatively, the input state from a previous call (see below) 0008 % pp algorithm parameters [optional] 0009 % 0010 % Outputs: 0011 % ss output enhanced speech (length is rounded down to the nearest frame boundary) 0012 % zo output state 0013 % 0014 % The algorithm operation is controlled by a small number of parameters: 0015 % 0016 % pp.of % overlap factor = (fft length)/(frame increment) [2] 0017 % pp.ti % desired frame increment [0.016 seconds] 0018 % pp.ri % set to 1 to round ti to the nearest power of 2 samples [0] 0019 % pp.g % subtraction domain: 1=magnitude, 2=power [1] 0020 % pp.e % gain exponent [1] 0021 % pp.am % max oversubtraction factor [3] 0022 % pp.b % max noise attenutaion in power domain [0.01] 0023 % pp.al % SNR for oversubtraction=am (set this to Inf for fixed a) [-5 dB] 0024 % pp.ah % SNR for oversubtraction=1 [20 dB] 0025 % pp.bt % threshold for binary gain or -1 for continuous gain [-1] 0026 % pp.mx % input mixture gain [0] 0027 % pp.gh % maximum gain for noise floor [1] 0028 % 0029 % Following [1], the magnitude-domain gain in each time-frequency bin is given by 0030 % gain=mx+(1-mx)*max((1-(a*N/X)^(g/2))^(e/g),min(gh,(b*N/X)^(e/2))) 0031 % where N and X are the powers of the noise and noisy speech respectively. 0032 % The oversubtraction factor varies linearly between a=am for a frame SNR of al down to 0033 % a=1 for a frame SNR of ah. To obtain a fixed value of a for all values of SNR, set al=Inf. 0034 % Common exponent combinations are: 0035 % g=1 e=1 Magnitude Domain spectral subtraction 0036 % g=2 e=1 Power Domain spectral subtraction 0037 % g=2 e=2 Wiener filtering 0038 % Many authors use the parameters alpha=a^(g/2), beta=b^(g/2) and gamma2=e/g instead of a, b and e 0039 % but this increases interdependence amongst the parameters. 0040 % If bt>=0 then the max(...) expression above is thresholded to become 0 or 1. 0041 % 0042 % In addition it is possible to specify parameters for the noise estimation algorithm 0043 % which implements reference [2] from which equation numbers are given in parentheses. 0044 % They are as follows: 0045 % 0046 % pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds] 0047 % pp.tamax % (3): max smoothing time constant [0.392 seconds] 0048 % pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds] 0049 % pp.tpfall % (12): time constant for P to fall [0.064 seconds] 0050 % pp.tbmax % (20): max smoothing time constant [0.0717 seconds] 0051 % pp.qeqmin % (23): minimum value of Qeq [2] 0052 % pp.qeqmax % max value of Qeq per frame [14] 0053 % pp.av % (23)+13 lines: fudge factor for bc calculation [2.12] 0054 % pp.td % time to take minimum over [1.536 seconds] 0055 % pp.nu % number of subwindows to use [3] 0056 % pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ] 0057 % pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1] 0058 % 0059 % 0060 % If convenient, you can call specsub in chunks of arbitrary size. Thus the following are equivalent: 0061 % 0062 % (a) y=specsub(s,fs); 0063 % 0064 % (b) [y1,z]=specsub(s(1:1000),fs); 0065 % [y2,z]=specsub(s(1001:2000),z); 0066 % y3=specsub(s(2001:end),z); 0067 % y=[y1; y2; y3]; 0068 % 0069 % Note that the number of output samples will be less than the number of input samples if 0070 % the input length is not an exact number of frames. In addition, if two output arguments 0071 % are specified, the last partial frame samples will be retained for overlap adding 0072 % with the output from the next call to specsub(). 0073 % 0074 % See also ssubmmse() for an alternative gain function 0075 % 0076 % Refs: 0077 % [1] M. Berouti, R. Schwartz and J. Makhoul 0078 % Enhancement of speech corrupted by acoustic noise 0079 % Proc IEEE ICASSP, 1979, 4, 208-211 0080 % [2] Rainer Martin. 0081 % Noise power spectral density estimation based on optimal smoothing and minimum statistics. 0082 % IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001. 0083 0084 % Copyright (C) Mike Brookes 2004 0085 % Version: $Id: specsub.m 713 2011-10-16 14:45:43Z dmb $ 0086 % 0087 % VOICEBOX is a MATLAB toolbox for speech processing. 0088 % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 0089 % 0090 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0091 % This program is free software; you can redistribute it and/or modify 0092 % it under the terms of the GNU General Public License as published by 0093 % the Free Software Foundation; either version 2 of the License, or 0094 % (at your option) any later version. 0095 % 0096 % This program is distributed in the hope that it will be useful, 0097 % but WITHOUT ANY WARRANTY; without even the implied warranty of 0098 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 0099 % GNU General Public License for more details. 0100 % 0101 % You can obtain a copy of the GNU General Public License from 0102 % http://www.gnu.org/copyleft/gpl.html or by writing to 0103 % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. 0104 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0105 if isstruct(fsz) 0106 fs=fsz.fs; 0107 qq=fsz.qq; 0108 qp=fsz.qp; 0109 ze=fsz.ze; 0110 s=zeros(length(fsz.si)+length(si(:)),1); % allocate space for speech 0111 s(1:length(fsz.si))=fsz.si; 0112 s(length(fsz.si)+1:end)=si(:); 0113 else 0114 fs=fsz; % sample frequency 0115 s=si(:); 0116 % default algorithm constants 0117 0118 qq.of=2; % overlap factor = (fft length)/(frame increment) 0119 qq.ti=16e-3; % desired frame increment (16 ms) 0120 qq.ri=0; % round ni to the nearest power of 2 0121 qq.g=1; % subtraction domain: 1=magnitude, 2=power 0122 qq.e=1; % gain exponent 0123 qq.am=3; % max oversubtraction factor 0124 qq.b=0.01; % noise floor 0125 qq.al=-5; % SNR for maximum a (set to Inf for fixed a) 0126 qq.ah=20; % SNR for minimum a 0127 qq.bt=-1; % suppress binary masking 0128 qq.mx=0; % no input mixing 0129 qq.gh=1; % maximum gain 0130 if nargin>=3 && ~isempty(pp) 0131 qp=pp; % save for estnoisem call 0132 qqn=fieldnames(qq); 0133 for i=1:length(qqn) 0134 if isfield(pp,qqn{i}) 0135 qq.(qqn{i})=pp.(qqn{i}); 0136 end 0137 end 0138 else 0139 qp=struct; % make an empty structure 0140 end 0141 end 0142 % derived algorithm constants 0143 if qq.ri 0144 ni=pow2(nextpow2(qq.ti*fs*sqrt(0.5))); 0145 else 0146 ni=round(qq.ti*fs); % frame increment in samples 0147 end 0148 tinc=ni/fs; % true frame increment time 0149 0150 % calculate power spectrum in frames 0151 0152 no=round(qq.of); % integer overlap factor 0153 nf=ni*no; % fft length 0154 w=sqrt(hamming(nf+1))'; w(end)=[]; % for now always use sqrt hamming window 0155 w=w/sqrt(sum(w(1:ni:nf).^2)); % normalize to give overall gain of 1 0156 y=enframe(s,w,ni); 0157 yf=rfft(y,nf,2); 0158 yp=yf.*conj(yf); % power spectrum of input speech 0159 [nr,nf2]=size(yp); % number of frames 0160 if isstruct(fsz) 0161 [dp,ze]=estnoisem(yp,ze); % estimate the noise using minimum statistics 0162 ssv=fsz.ssv; 0163 else 0164 [dp,ze]=estnoisem(yp,tinc,qp); % estimate the noise using minimum statistics 0165 ssv=zeros(ni*(no-1),1); % dummy saved overlap 0166 end 0167 if ~nr % no data frames 0168 ss=[]; 0169 else 0170 0171 mz=yp==0; % mask for zero power time-frequency bins (unlikely) 0172 if qq.al<Inf 0173 ypf=sum(yp,2); 0174 dpf=sum(dp,2); 0175 mzf=dpf==0; % zero noise frames = very high SNR 0176 af=1+(qq.am-1)*(min(max(10*log10(ypf./(dpf+mzf)),qq.al),qq.ah)-qq.ah)/(qq.al-qq.ah); 0177 af(mzf)=1; % fix the zero noise frames 0178 else 0179 af=repmat(qq.am,nr,1); 0180 end 0181 switch qq.g 0182 case 1 % magnitude domain subtraction 0183 v=sqrt(dp./(yp+mz)); 0184 af=sqrt(af); 0185 bf=sqrt(qq.b); 0186 case 2 % power domain subtraction 0187 v=dp./(yp+mz); 0188 bf=qq.b; 0189 otherwise % arbitrary subtraction domain 0190 v=(dp./(yp+mz)).^(0.5*qq.g); 0191 af=af.^(0.5*qq.g); 0192 bf=qq.b^(0.5*qq.g); 0193 end 0194 af =repmat(af,1,nf2); % replicate frame oversubtraction factors for each frequency 0195 mf=v>=(af+bf).^(-1); % mask for noise floor limiting 0196 g=zeros(size(v)); % reserve space for gain matrix 0197 eg=qq.e/qq.g; % gain exponent relative to subtraction domain 0198 gh=qq.gh; 0199 switch eg 0200 case 1 % Normal case 0201 g(mf)=min(bf*v(mf),gh); % never give a gain > 1 0202 g(~mf)=1-af(~mf).*v(~mf); 0203 case 0.5 0204 g(mf)=min(sqrt(bf*v(mf)),gh); 0205 g(~mf)=sqrt(1-af(~mf).*v(~mf)); 0206 otherwise 0207 g(mf)=min((bf*v(mf)).^eg,gh); 0208 g(~mf)=(1-af(~mf).*v(~mf)).^eg; 0209 end 0210 if qq.bt>=0 0211 g=g>qq.bt; 0212 end 0213 g=qq.mx+(1-qq.mx)*g; % mix in some of the input 0214 se=(irfft((yf.*g).',nf).').*repmat(w,nr,1); % inverse dft and apply output window 0215 ss=zeros(ni*(nr+no-1),no); % space for overlapped output speech 0216 ss(1:ni*(no-1),end)=ssv; 0217 for i=1:no 0218 nm=nf*(1+floor((nr-i)/no)); % number of samples in this set 0219 ss(1+(i-1)*ni:nm+(i-1)*ni,i)=reshape(se(i:no:nr,:)',nm,1); 0220 end 0221 ss=sum(ss,2); 0222 0223 end 0224 if nargout>1 0225 if nr 0226 zo.ssv=ss(end-ni*(no-1)+1:end); % save the output tail for next time 0227 ss(end-ni*(no-1)+1:end)=[]; 0228 else 0229 zo.ssv=ssv; % 0230 end 0231 zo.si=s(length(ss)+1:end); % save the tail end of the input speech signal 0232 zo.fs=fs; % save sample frequency 0233 zo.qq=qq; % save loval parameters 0234 zo.qp=qp; % save estnoisem parameters 0235 zo.ze=ze; % save state of noise estimation 0236 end 0237 if ~nargout && nr>0 0238 ttax=(1:nr)*tinc; 0239 ffax=(0:size(g,2)-1)*fs/nf/1000; ax=zeros(4,1); 0240 ax(1)=subplot(223); 0241 imagesc(ttax,ffax,20*log10(g)'); 0242 colorbar; 0243 axis('xy'); 0244 if qq.al==Inf 0245 title(sprintf('Filter Gain (dB): a=%.2g, b=%.3g',qq.am,qq.b)); 0246 else 0247 title(sprintf('Filter Gain (dB): a=%.2g (%.0f to %.0fdB), b=%.3g',qq.am,qq.al,qq.ah,qq.b)); 0248 end 0249 xlabel('Time (s)'); 0250 ylabel('Frequency (kHz)'); 0251 0252 ax(2)=subplot(222); 0253 imagesc(ttax,ffax,10*log10(yp)'); 0254 colorbar; 0255 axis('xy'); 0256 title('Noisy Speech (dB)'); 0257 xlabel('Time (s)'); 0258 ylabel('Frequency (kHz)'); 0259 0260 ax(3)=subplot(224); 0261 imagesc(ttax,ffax,10*log10(yp.*g.^2)'); 0262 colorbar; 0263 axis('xy'); 0264 title(sprintf('Enhanced Speech (dB): g=%.2g, e=%.3g',qq.g,qq.e)); 0265 xlabel('Time (s)'); 0266 ylabel('Frequency (kHz)'); 0267 0268 ax(4)=subplot(221); 0269 imagesc(ttax,ffax,10*log10(dp)'); 0270 colorbar; 0271 axis('xy'); 0272 title('Noise Estimate (dB)'); 0273 xlabel('Time (s)'); 0274 ylabel('Frequency (kHz)'); 0275 linkaxes(ax); 0276 end