V_SPECSUB performs speech enhancement using spectral subtraction [SS,ZO]=(S,FSZ,P) Usage: (1) y=v_specsub(x,fs); % enhance the speech using default parameters Inputs: si input speech signal fsz sample frequency in Hz Alternatively, the input state from a previous call (see below) pp algorithm parameters [optional] Outputs: ss output enhanced speech gg(t,f,i) selected time-frequency values (see pp.tf below) tt centre of frames (in seconds) ff centre of frequency bins (in Hz) zo output state (or the 2nd argument if gg,tt,ff are omitted) The algorithm operation is controlled by a small number of parameters: pp.of % overlap factor = (fft length)/(frame increment) [2] pp.ti % desired frame increment [0.016 seconds] pp.ri % set to 1 to round ti to the nearest power of 2 samples [0] pp.g % subtraction domain: 1=magnitude, 2=power [1] pp.e % gain exponent [1] pp.am % max oversubtraction factor [3] pp.b % max noise attenutaion in power domain [0.01] pp.al % SNR for oversubtraction=am (set this to Inf for fixed a) [-5 dB] pp.ah % SNR for oversubtraction=1 [20 dB] pp.ne % noise estimation: 0=min statistics, 1=MMSE [0] pp.bt % threshold for binary gain or -1 for continuous gain [-1] pp.mx % input mixture gain [0] pp.gh % maximum gain for noise floor [1] pp.rf % round output signal to an exact number of frames [0] pp.tf % selects time-frequency planes to output in the gg() variable ['g'] 'i' = input power spectrum 'I' = input complex spectrum 'n' = noise power spectrum 'g' = gain 'o' = output power spectrum 'O' = output complex spectrum Following [1], the magnitude-domain gain in each time-frequency bin is given by gain=mx+(1-mx)*max((1-(a*N/X)^(g/2))^(e/g),min(gh,(b*N/X)^(e/2))) where N and X are the powers of the noise and noisy speech respectively. The oversubtraction factor varies linearly between a=am for a frame SNR of al down to a=1 for a frame SNR of ah. To obtain a fixed value of a for all values of SNR, set al=Inf. Common exponent combinations are: g=1 e=1 Magnitude Domain spectral subtraction g=2 e=1 Power Domain spectral subtraction g=2 e=2 Wiener filtering Many authors use the parameters alpha=a^(g/2), beta=b^(g/2) and gamma2=e/g instead of a, b and e but this increases interdependence amongst the parameters. If bt>=0 then the max(...) expression above is thresholded to become 0 or 1. In addition it is possible to specify parameters for the noise estimation algorithm which implements reference [2] or [3] according to the setting of pp.ne Minimum statistics noise estimate [2]: pp.ne=0 pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds] pp.tamax % (3): max smoothing time constant [0.392 seconds] pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds] pp.tpfall % (12): time constant for P to fall [0.064 seconds] pp.tbmax % (20): max smoothing time constant [0.0717 seconds] pp.qeqmin % (23): minimum value of Qeq [2] pp.qeqmax % max value of Qeq per frame [14] pp.av % (23)+13 lines: fudge factor for bc calculation [2.12] pp.td % time to take minimum over [1.536 seconds] pp.nu % number of subwindows to use [3] pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ] pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1] MMSE noise estimate [3]: pp.ne=1 pp.tax % smoothing time constant for noise power estimate [0.0717 seconds](8) pp.tap % smoothing time constant for smoothed speech prob [0.152 seconds](23) pp.psthr % threshold for smoothed speech probability [0.99] (24) pp.pnsaf % noise probability safety value [0.01] (24) pp.pspri % prior speech probability [0.5] (18) pp.asnr % active SNR in dB [15] (18) pp.psini % initial speech probability [0.5] (23) pp.tavini % assumed speech absent time at start [0.064 seconds] If convenient, you can call v_specsub in chunks of arbitrary size. Thus the following are equivalent: (a) y=v_specsub(s,fs); (b) [y1,z]=v_specsub(s(1:1000),fs); [y2,z]=v_specsub(s(1001:2000),z); y3=v_specsub(s(2001:end),z); y=[y1; y2; y3]; If the number of output arguments is either 2 or 5, the last partial frame of samples will be retained for overlap adding with the output from the next call to v_specsub(). See also v_ssubmmse() for an alternative gain function Refs: [1] M. Berouti, R. Schwartz and J. Makhoul Enhancement of speech corrupted by acoustic noise Proc IEEE ICASSP, 1979, 4, 208-211 [2] Rainer Martin. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001. [3] Gerkmann, T. & Hendriks, R. C. Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay IEEE Trans Audio, Speech, Language Processing, 2012, 20, 1383-1393
0001 function [ss,gg,tt,ff,zo]=v_specsub(si,fsz,pp) 0002 %V_SPECSUB performs speech enhancement using spectral subtraction [SS,ZO]=(S,FSZ,P) 0003 % 0004 % Usage: (1) y=v_specsub(x,fs); % enhance the speech using default parameters 0005 % 0006 % Inputs: 0007 % si input speech signal 0008 % fsz sample frequency in Hz 0009 % Alternatively, the input state from a previous call (see below) 0010 % pp algorithm parameters [optional] 0011 % 0012 % Outputs: 0013 % ss output enhanced speech 0014 % gg(t,f,i) selected time-frequency values (see pp.tf below) 0015 % tt centre of frames (in seconds) 0016 % ff centre of frequency bins (in Hz) 0017 % zo output state (or the 2nd argument if gg,tt,ff are omitted) 0018 % 0019 % The algorithm operation is controlled by a small number of parameters: 0020 % 0021 % pp.of % overlap factor = (fft length)/(frame increment) [2] 0022 % pp.ti % desired frame increment [0.016 seconds] 0023 % pp.ri % set to 1 to round ti to the nearest power of 2 samples [0] 0024 % pp.g % subtraction domain: 1=magnitude, 2=power [1] 0025 % pp.e % gain exponent [1] 0026 % pp.am % max oversubtraction factor [3] 0027 % pp.b % max noise attenutaion in power domain [0.01] 0028 % pp.al % SNR for oversubtraction=am (set this to Inf for fixed a) [-5 dB] 0029 % pp.ah % SNR for oversubtraction=1 [20 dB] 0030 % pp.ne % noise estimation: 0=min statistics, 1=MMSE [0] 0031 % pp.bt % threshold for binary gain or -1 for continuous gain [-1] 0032 % pp.mx % input mixture gain [0] 0033 % pp.gh % maximum gain for noise floor [1] 0034 % pp.rf % round output signal to an exact number of frames [0] 0035 % pp.tf % selects time-frequency planes to output in the gg() variable ['g'] 0036 % 'i' = input power spectrum 0037 % 'I' = input complex spectrum 0038 % 'n' = noise power spectrum 0039 % 'g' = gain 0040 % 'o' = output power spectrum 0041 % 'O' = output complex spectrum 0042 % 0043 % Following [1], the magnitude-domain gain in each time-frequency bin is given by 0044 % gain=mx+(1-mx)*max((1-(a*N/X)^(g/2))^(e/g),min(gh,(b*N/X)^(e/2))) 0045 % where N and X are the powers of the noise and noisy speech respectively. 0046 % The oversubtraction factor varies linearly between a=am for a frame SNR of al down to 0047 % a=1 for a frame SNR of ah. To obtain a fixed value of a for all values of SNR, set al=Inf. 0048 % Common exponent combinations are: 0049 % g=1 e=1 Magnitude Domain spectral subtraction 0050 % g=2 e=1 Power Domain spectral subtraction 0051 % g=2 e=2 Wiener filtering 0052 % Many authors use the parameters alpha=a^(g/2), beta=b^(g/2) and gamma2=e/g instead of a, b and e 0053 % but this increases interdependence amongst the parameters. 0054 % If bt>=0 then the max(...) expression above is thresholded to become 0 or 1. 0055 % 0056 % In addition it is possible to specify parameters for the noise estimation algorithm 0057 % which implements reference [2] or [3] according to the setting of pp.ne 0058 % 0059 % Minimum statistics noise estimate [2]: pp.ne=0 0060 % pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds] 0061 % pp.tamax % (3): max smoothing time constant [0.392 seconds] 0062 % pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds] 0063 % pp.tpfall % (12): time constant for P to fall [0.064 seconds] 0064 % pp.tbmax % (20): max smoothing time constant [0.0717 seconds] 0065 % pp.qeqmin % (23): minimum value of Qeq [2] 0066 % pp.qeqmax % max value of Qeq per frame [14] 0067 % pp.av % (23)+13 lines: fudge factor for bc calculation [2.12] 0068 % pp.td % time to take minimum over [1.536 seconds] 0069 % pp.nu % number of subwindows to use [3] 0070 % pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ] 0071 % pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1] 0072 % 0073 % MMSE noise estimate [3]: pp.ne=1 0074 % pp.tax % smoothing time constant for noise power estimate [0.0717 seconds](8) 0075 % pp.tap % smoothing time constant for smoothed speech prob [0.152 seconds](23) 0076 % pp.psthr % threshold for smoothed speech probability [0.99] (24) 0077 % pp.pnsaf % noise probability safety value [0.01] (24) 0078 % pp.pspri % prior speech probability [0.5] (18) 0079 % pp.asnr % active SNR in dB [15] (18) 0080 % pp.psini % initial speech probability [0.5] (23) 0081 % pp.tavini % assumed speech absent time at start [0.064 seconds] 0082 % 0083 % If convenient, you can call v_specsub in chunks of arbitrary size. Thus the following are equivalent: 0084 % 0085 % (a) y=v_specsub(s,fs); 0086 % 0087 % (b) [y1,z]=v_specsub(s(1:1000),fs); 0088 % [y2,z]=v_specsub(s(1001:2000),z); 0089 % y3=v_specsub(s(2001:end),z); 0090 % y=[y1; y2; y3]; 0091 % 0092 % If the number of output arguments is either 2 or 5, the last partial frame of samples will 0093 % be retained for overlap adding with the output from the next call to v_specsub(). 0094 % 0095 % See also v_ssubmmse() for an alternative gain function 0096 % 0097 % Refs: 0098 % [1] M. Berouti, R. Schwartz and J. Makhoul 0099 % Enhancement of speech corrupted by acoustic noise 0100 % Proc IEEE ICASSP, 1979, 4, 208-211 0101 % [2] Rainer Martin. 0102 % Noise power spectral density estimation based on optimal smoothing and minimum statistics. 0103 % IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001. 0104 % [3] Gerkmann, T. & Hendriks, R. C. 0105 % Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay 0106 % IEEE Trans Audio, Speech, Language Processing, 2012, 20, 1383-1393 0107 0108 % Copyright (C) Mike Brookes 2004 0109 % Version: $Id: v_specsub.m 10865 2018-09-21 17:22:45Z dmb $ 0110 % 0111 % VOICEBOX is a MATLAB toolbox for speech processing. 0112 % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 0113 % 0114 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0115 % This program is free software; you can redistribute it and/or modify 0116 % it under the terms of the GNU General Public License as published by 0117 % the Free Software Foundation; either version 2 of the License, or 0118 % (at your option) any later version. 0119 % 0120 % This program is distributed in the hope that it will be useful, 0121 % but WITHOUT ANY WARRANTY; without even the implied warranty of 0122 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 0123 % GNU General Public License for more details. 0124 % 0125 % You can obtain a copy of the GNU General Public License from 0126 % http://www.gnu.org/copyleft/gpl.html or by writing to 0127 % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. 0128 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0129 if numel(si)>length(si) 0130 error('Input speech signal must be a vector not a matrix'); 0131 end 0132 if isstruct(fsz) 0133 fs=fsz.fs; 0134 qq=fsz.qq; 0135 qp=fsz.qp; 0136 ze=fsz.ze; 0137 s=zeros(length(fsz.si)+length(si(:)),1); % allocate space for speech 0138 s(1:length(fsz.si))=fsz.si; 0139 s(length(fsz.si)+1:end)=si(:); 0140 else 0141 fs=fsz; % sample frequency 0142 s=si(:); 0143 % default algorithm constants 0144 0145 qq.of=2; % overlap factor = (fft length)/(frame increment) 0146 qq.ti=16e-3; % desired frame increment (16 ms) 0147 qq.ri=0; % round ni to the nearest power of 2 0148 qq.g=1; % subtraction domain: 1=magnitude, 2=power 0149 qq.e=1; % gain exponent 0150 qq.am=3; % max oversubtraction factor 0151 qq.b=0.01; % noise floor 0152 qq.al=-5; % SNR for maximum a (set to Inf for fixed a) 0153 qq.ah=20; % SNR for minimum a 0154 qq.bt=-1; % suppress binary masking 0155 qq.ne=0; % noise estimation: 0=min statistics, 1=MMSE [0] 0156 qq.mx=0; % no input mixing 0157 qq.gh=1; % maximum gain 0158 qq.tf='g'; % output the gain time-frequency plane by default 0159 qq.rf=0; 0160 if nargin>=3 && ~isempty(pp) 0161 qp=pp; % save for v_estnoisem call 0162 qqn=fieldnames(qq); 0163 for i=1:length(qqn) 0164 if isfield(pp,qqn{i}) 0165 qq.(qqn{i})=pp.(qqn{i}); 0166 end 0167 end 0168 else 0169 qp=struct; % make an empty structure 0170 end 0171 end 0172 % derived algorithm constants 0173 if qq.ri 0174 ni=pow2(nextpow2(qq.ti*fs*sqrt(0.5))); 0175 else 0176 ni=round(qq.ti*fs); % frame increment in samples 0177 end 0178 tinc=ni/fs; % true frame increment time 0179 tf=qq.tf; 0180 rf=qq.rf || nargout==2 || nargout==5; % round down to an exact number of frames 0181 ne=qq.ne; % noise estimation: 0=min statistics, 1=MMSE [0] 0182 0183 % calculate power spectrum in frames 0184 0185 no=round(qq.of); % integer overlap factor 0186 nf=ni*no; % fft length 0187 w=sqrt(hamming(nf+1))'; w(end)=[]; % for now always use sqrt hamming window 0188 w=w/sqrt(sum(w(1:ni:nf).^2)); % normalize to give overall gain of 1 0189 if rf>0 0190 rfm=''; % truncated input to an exact number of frames 0191 else 0192 rfm='r'; 0193 end 0194 [y,tt]=v_enframe(s,w,ni,rfm); 0195 tt=tt/fs; % frame times 0196 yf=v_rfft(y,nf,2); 0197 yp=yf.*conj(yf); % power spectrum of input speech 0198 [nr,nf2]=size(yp); % number of frames 0199 ff=(0:nf2-1)*fs/nf; 0200 if isstruct(fsz) 0201 if ne>0 0202 [dp,ze]=v_estnoiseg(yp,ze); % estimate the noise using MMSE 0203 else 0204 [dp,ze]=v_estnoisem(yp,ze); % estimate the noise using minimum statistics 0205 end 0206 ssv=fsz.ssv; 0207 else 0208 if ne>0 0209 [dp,ze]=v_estnoiseg(yp,tinc,qp); % estimate the noise using MMSE 0210 else 0211 [dp,ze]=v_estnoisem(yp,tinc,qp); % estimate the noise using minimum statistics 0212 end 0213 ssv=zeros(ni*(no-1),1); % dummy saved overlap 0214 end 0215 if ~nr % no data frames 0216 ss=[]; 0217 gg=[]; 0218 else 0219 mz=yp==0; % mask for zero power time-frequency bins (unlikely) 0220 if qq.al<Inf 0221 ypf=sum(yp,2); 0222 dpf=sum(dp,2); 0223 mzf=dpf==0; % zero noise frames = very high SNR 0224 af=1+(qq.am-1)*(min(max(10*log10(ypf./(dpf+mzf)),qq.al),qq.ah)-qq.ah)/(qq.al-qq.ah); 0225 af(mzf)=1; % fix the zero noise frames 0226 else 0227 af=repmat(qq.am,nr,1); 0228 end 0229 switch qq.g 0230 case 1 % magnitude domain subtraction 0231 v=sqrt(dp./(yp+mz)); 0232 af=sqrt(af); 0233 bf=sqrt(qq.b); 0234 case 2 % power domain subtraction 0235 v=dp./(yp+mz); 0236 bf=qq.b; 0237 otherwise % arbitrary subtraction domain 0238 v=(dp./(yp+mz)).^(0.5*qq.g); 0239 af=af.^(0.5*qq.g); 0240 bf=qq.b^(0.5*qq.g); 0241 end 0242 af =repmat(af,1,nf2); % replicate frame oversubtraction factors for each frequency 0243 mf=v>=(af+bf).^(-1); % mask for noise floor limiting 0244 g=zeros(size(v)); % reserve space for gain matrix 0245 eg=qq.e/qq.g; % gain exponent relative to subtraction domain 0246 gh=qq.gh; 0247 switch eg 0248 case 1 % Normal case 0249 g(mf)=min(bf*v(mf),gh); % never give a gain > 1 0250 g(~mf)=1-af(~mf).*v(~mf); 0251 case 0.5 0252 g(mf)=min(sqrt(bf*v(mf)),gh); 0253 g(~mf)=sqrt(1-af(~mf).*v(~mf)); 0254 otherwise 0255 g(mf)=min((bf*v(mf)).^eg,gh); 0256 g(~mf)=(1-af(~mf).*v(~mf)).^eg; 0257 end 0258 if qq.bt>=0 0259 g=g>qq.bt; 0260 end 0261 g=qq.mx+(1-qq.mx)*g; % mix in some of the input 0262 se=(v_irfft((yf.*g).',nf).').*repmat(w,nr,1); % inverse dft and apply output window 0263 ss=zeros(ni*(nr+no-1),no); % space for overlapped output speech 0264 ss(1:ni*(no-1),end)=ssv; 0265 for i=1:no 0266 nm=nf*(1+floor((nr-i)/no)); % number of samples in this set 0267 ss(1+(i-1)*ni:nm+(i-1)*ni,i)=reshape(se(i:no:nr,:)',nm,1); 0268 end 0269 ss=sum(ss,2); 0270 if nargout>2 && ~isempty(tf) 0271 gg=zeros(nr,nf2,length(tf)); % make space 0272 for i=1:length(tf) 0273 switch tf(i) 0274 case 'i' % 'i' = input power spectrum 0275 gg(:,:,i)=yp; 0276 case 'I' % 'i' = input power spectrum 0277 gg(:,:,i)=yf; 0278 case 'n' % 'n' = noise power spectrum 0279 gg(:,:,i)=dp; 0280 case 'g' % 'g' = gain 0281 gg(:,:,i)=g; 0282 case 'o' % 'o' = output power spectrum 0283 gg(:,:,i)=yp.*g.^2; 0284 case 'O' % 'o' = output power spectrum 0285 gg(:,:,i)=yf.*g; 0286 end 0287 end 0288 end 0289 end 0290 if nargout==2 || nargout==5 0291 if nr 0292 zo.ssv=ss(end-ni*(no-1)+1:end); % save the output tail for next time 0293 ss(end-ni*(no-1)+1:end)=[]; 0294 else 0295 zo.ssv=ssv; % 0296 end 0297 zo.si=s(length(ss)+1:end); % save the tail end of the input speech signal 0298 zo.fs=fs; % save sample frequency 0299 zo.qq=qq; % save local parameters 0300 zo.qp=qp; % save v_estnoisem parameters 0301 zo.ze=ze; % save state of noise estimation 0302 if nargout==2 0303 gg=zo; % 2nd of two arguments is zo 0304 end 0305 elseif rf==0 0306 ss=ss(1:length(s)); % trim to the correct length if not an exact number of frames 0307 end 0308 if ~nargout && nr>0 0309 ffax=ff/1000; ax=zeros(4,1); 0310 ax(1)=subplot(223); 0311 imagesc(tt,ffax,20*log10(g)'); 0312 colorbar; 0313 axis('xy'); 0314 if qq.al==Inf 0315 title(sprintf('Filter Gain (dB): a=%.2g, b=%.3g',qq.am,qq.b)); 0316 else 0317 title(sprintf('Filter Gain (dB): a=%.2g (%.0f to %.0fdB), b=%.3g',qq.am,qq.al,qq.ah,qq.b)); 0318 end 0319 xlabel('Time (s)'); 0320 ylabel('Frequency (kHz)'); 0321 0322 ax(2)=subplot(222); 0323 imagesc(tt,ffax,10*log10(yp)'); 0324 colorbar; 0325 axis('xy'); 0326 title('Noisy Speech (dB)'); 0327 xlabel('Time (s)'); 0328 ylabel('Frequency (kHz)'); 0329 0330 ax(3)=subplot(224); 0331 imagesc(tt,ffax,10*log10(yp.*g.^2)'); 0332 colorbar; 0333 axis('xy'); 0334 title(sprintf('Enhanced Speech (dB): g=%.2g, e=%.3g',qq.g,qq.e)); 0335 xlabel('Time (s)'); 0336 ylabel('Frequency (kHz)'); 0337 0338 ax(4)=subplot(221); 0339 imagesc(tt,ffax,10*log10(dp)'); 0340 colorbar; 0341 axis('xy'); 0342 title('Noise Estimate (dB)'); 0343 xlabel('Time (s)'); 0344 ylabel('Frequency (kHz)'); 0345 linkaxes(ax); 0346 end