V_SSUBMMSE performs speech enhancement using mmse estimate of spectral amplitude or log amplitude [SS,ZO]=(S,FSZ,PP) Usage: (1) %%%%%%%%%%%% Simple Enhancement %%%%%%%%%%%% y=v_ssubmmse(x,fs); % enhance the noisy speech using default parameters (2) %%%%%%%%%%%% Reading input signal in chunks %%%%%%%%%%%% [y1,zo]=v_ssubmmse(x(1:1000,fs); % enhance the first chunck of x [y2,zo]=v_ssubmmse(x(1001:2000),zo); % enhance the second chunck of x y3=v_ssubmmse(x(1001:2000),zo); % enhance the remainder of x outputting all samples y=[y1; y2; y3]; % reassembling the chunks gives same y as example (1) (3) %%%%%%%%%%%% Multiple channels using the same gain %%%%%%%%%%%% y=v_ssubmmse([s+n s n],fs); % enhance the noisy speech, s+n, and also apply the same gain to s and n separately (4) %%%%%%%%%%%% Multiple channels using maximum of gains %%%%%%%%%%%% pp.gw=[0 0 0 0 1]; % find the maximum of the calculated time-frequency gains y=v_ssubmmse([x1 x2 x3],fs,pp); % and apply this maximum gain to each of the signals (5) %%%%%%%%%%%% Plot gain function %%%%%%%%%%%% [y,g,t,f]=v_ssubmmse(x,fs); % save the time-frequency gain function imagesc(t,f,20*log10(g')); % display the time-frequency gain axis 'xy'; % origin at bottom left xlabel('Time (s)'); % time axis ylabel('Frequency (Hz)'); % frequency axis colorbar; % show a color bar v_cblabel('Gain dB'); % label the color bar set(gca,'clim',[-30,5]); % limit the displayed range to [-30,5] dB Inputs: si(n,c,p) input speech signal with n samples in each of c channels where each channel has p planes. All planes will use the same set of gain functions. fsz sample frequency in Hz Alternatively, the input state from a previous call (see below) pp algorithm parameters [optional] Outputs: ss(n,c,p) output enhanced speech. Same number of samples as si unless zo output is given in which case incomplete frames will be retained gg(t,f,c,p,length(pp.tf)) selected time-frequency values (see pp.tf below) note that multiple planes will only be included if pp.tf includes 'i', 'I', 'o' or 'O' Note also that the tt output MUST also be specified tt centre of frames (in seconds) ff centre of frequency bins (in Hz) zo output state (or the 2nd argument if gg,tt,ff are omitted) The algorithm operation is controlled by a small number of parameters: pp.of % overlap factor = (fft length)/(frame increment) [2] pp.ti % desired frame increment [0.016 seconds] pp.ri % set to 1 to round ti to the nearest power of 2 samples [0] pp.ta % time const for smoothing SNR estimate [0.396 seconds] pp.gx % maximum posterior SNR as a power ratio [1000 = +30dB] pp.gn % min posterior SNR as a power ratio when estimating prior SNR [1 = 0dB] pp.gz % min posterior SNR as a power ratio [0.001 = -30dB] pp.xn % minimum prior SNR [0] pp.xb % bias compensation factor for prior SNR [1] pp.lg % MMSE target: 0=amplitude, 1=log amplitude, 2=perceptual Bayes [1] pp.ne % noise estimation: 0=min statistics, 1=MMSE [1] pp.bt % threshold for binary gain or -1 for continuous gain [-1] pp.mx % input mixture gain [0] pp.gc % maximum amplitude gain [10 = 20 dB] pp.rf % round output signal to an exact number of frames [0] pp.gw % multichannel gain weights: [chan-n chan-1 ave min max] default:[0 1 0 0 0] pp.tf % selects one or more time-frequency planes to output in the gg() variable ['g'] 'i' = input power spectrum 'I' = input complex spectrum 'n' = noise power spectrum 'z' = "posterior" SNR (i.e. (S+N)/N ) 'x' = "prior" SNR (i.e. S/N ) 'G' = raw gain (before clipping and multichannel weighting) 'g' = final gain 'o' = output power spectrum 'O' = output complex spectrum The applied gain is mx+(1-mx)*optgain where optgain is calculated according to [1] or [2]. If pp.bt>=0 then optgain is first thresholded with pp.bt to produce a binary gain 0 or 1. The default parameters implement the original algorithm in [1,2]. Several parameters relate to the estimation of xi, the so-called "prior SNR", xi=max(a*pp.xb*xu+(1-a)*max(gami-1,pp.gn-1),pp.xn); This is estimated as a smoothed version of 1 less than gami, the "posterior SNR" which is the noisy speech power divided by the noise power. This is clipped to a min of (pp.gn-1), smoothed using a factor "a" which corresponds to a time-constant of pp.ta and then clipped to a minimum of pp.xn. The previous value is taken to be pp.xb*xu where xu is the ratio of the estimated speech amplitude squared to the noise power. In addition it is possible to specify parameters for the noise estimation algorithm which implements reference [3] or [7] according to the setting of pp.ne Minimum statistics noise estimate [3]: pp.ne=0 pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds] pp.tamax % (3): max smoothing time constant [0.392 seconds] pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds] pp.tpfall % (12): time constant for P to fall [0.064 seconds] pp.tbmax % (20): max smoothing time constant [0.0717 seconds] pp.qeqmin % (23): minimum value of Qeq [2] pp.qeqmax % max value of Qeq per frame [14] pp.av % (23)+13 lines: fudge factor for bc calculation [2.12] pp.td % time to take minimum over [1.536 seconds] pp.nu % number of subwindows to use [3] pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ] pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1] MMSE noise estimate [7]: pp.ne=1 pp.tax % smoothing time constant for noise power estimate [0.0717 seconds](8) pp.tap % smoothing time constant for smoothed speech prob [0.152 seconds](23) pp.psthr % threshold for smoothed speech probability [0.99] (24) pp.pnsaf % noise probability safety value [0.01] (24) pp.pspri % prior speech probability [0.5] (18) pp.asnr % active SNR in dB [15] (18) pp.psini % initial speech probability [0.5] (23) pp.tavini % assumed speech absent time at start [0.064 seconds] If convenient, you can call v_specsub in chunks of arbitrary size. Thus the following are equivalent: (a) y=v_ssubmmse(s,fs); (b) [y1,z]=v_ssubmmse(s(1:1000),fs); [y2,z]=v_ssubmmse(s(1001:2000),z); y3=v_ssubmmse(s(2001:end),z); y=[y1; y2; y3]; If the number of output arguments is either 2 or 5, the last partial frame of samples will be retained for overlap adding with the output from the next call to v_ssubmmse(). See also v_specsub() for an alternative gain function Refs: [1] Ephraim, Y. & Malah, D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984 [2] Ephraim, Y. & Malah, D. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator IEEE Trans Acoustics Speech and Signal Processing, 33(2):443-445, Apr 1985 [3] Rainer Martin. Noise power spectral density estimation based on optimal smoothing and minimum statistics. IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001. [4] O. Cappe. Elimination of the musical noise phenomenon with the ephraim and malah noise suppressor. IEEE Trans Speech Audio Processing, 2 (2): 345-349, Apr. 1994. doi: 10.1109/89.279283. [5] J. Erkelens, J. Jensen, and R. Heusdens. A data-driven approach to optimizing spectral speech enhancement methods for various error criteria. Speech Communication, 49: 530-541, 2007. doi: 10.1016/j.specom.2006.06.012. [6] R. Martin. Statistical methods for the enhancement of noisy speech. In J. Benesty, S. Makino, and J. Chen, editors, Speech Enhancement, chapter 3, pages 43-64. Springer-Verlag, 2005. [7] Gerkmann, T. & Hendriks, R. C. Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay IEEE Trans Audio, Speech, Language Processing, 2012, 20, 1383-1393 [8] Loizou, P. Speech enhancement based on perceptually motivated Bayesian estimators of the speech magnitude spectrum. IEEE Trans. Speech and Audio Processing, 13(5), 857-869, 2005.
0001 function [ss,gg,tt,ff,zo]=v_ssubmmse(si,fsz,pp) 0002 %V_SSUBMMSE performs speech enhancement using mmse estimate of spectral amplitude or log amplitude [SS,ZO]=(S,FSZ,PP) 0003 % 0004 % Usage: 0005 % (1) %%%%%%%%%%%% Simple Enhancement %%%%%%%%%%%% 0006 % y=v_ssubmmse(x,fs); % enhance the noisy speech using default parameters 0007 % 0008 % (2) %%%%%%%%%%%% Reading input signal in chunks %%%%%%%%%%%% 0009 % [y1,zo]=v_ssubmmse(x(1:1000,fs); % enhance the first chunck of x 0010 % [y2,zo]=v_ssubmmse(x(1001:2000),zo); % enhance the second chunck of x 0011 % y3=v_ssubmmse(x(1001:2000),zo); % enhance the remainder of x outputting all samples 0012 % y=[y1; y2; y3]; % reassembling the chunks gives same y as example (1) 0013 % 0014 % (3) %%%%%%%%%%%% Multiple channels using the same gain %%%%%%%%%%%% 0015 % y=v_ssubmmse([s+n s n],fs); % enhance the noisy speech, s+n, and 0016 % also apply the same gain to s and n separately 0017 % 0018 % (4) %%%%%%%%%%%% Multiple channels using maximum of gains %%%%%%%%%%%% 0019 % pp.gw=[0 0 0 0 1]; % find the maximum of the calculated time-frequency gains 0020 % y=v_ssubmmse([x1 x2 x3],fs,pp); % and apply this maximum gain to each of the signals 0021 % 0022 % (5) %%%%%%%%%%%% Plot gain function %%%%%%%%%%%% 0023 % [y,g,t,f]=v_ssubmmse(x,fs); % save the time-frequency gain function 0024 % imagesc(t,f,20*log10(g')); % display the time-frequency gain 0025 % axis 'xy'; % origin at bottom left 0026 % xlabel('Time (s)'); % time axis 0027 % ylabel('Frequency (Hz)'); % frequency axis 0028 % colorbar; % show a color bar 0029 % v_cblabel('Gain dB'); % label the color bar 0030 % set(gca,'clim',[-30,5]); % limit the displayed range to [-30,5] dB 0031 % 0032 % 0033 % Inputs: 0034 % si(n,c,p) input speech signal with n samples in each of c channels where each channel has p planes. 0035 % All planes will use the same set of gain functions. 0036 % fsz sample frequency in Hz 0037 % Alternatively, the input state from a previous call (see below) 0038 % pp algorithm parameters [optional] 0039 % 0040 % Outputs: 0041 % ss(n,c,p) output enhanced speech. Same number of samples as si unless 0042 % zo output is given in which case incomplete frames will be retained 0043 % gg(t,f,c,p,length(pp.tf)) selected time-frequency values (see pp.tf below) 0044 % note that multiple planes will only be included if pp.tf includes 'i', 'I', 'o' or 'O' 0045 % Note also that the tt output MUST also be specified 0046 % tt centre of frames (in seconds) 0047 % ff centre of frequency bins (in Hz) 0048 % zo output state (or the 2nd argument if gg,tt,ff are omitted) 0049 % 0050 % The algorithm operation is controlled by a small number of parameters: 0051 % 0052 % pp.of % overlap factor = (fft length)/(frame increment) [2] 0053 % pp.ti % desired frame increment [0.016 seconds] 0054 % pp.ri % set to 1 to round ti to the nearest power of 2 samples [0] 0055 % pp.ta % time const for smoothing SNR estimate [0.396 seconds] 0056 % pp.gx % maximum posterior SNR as a power ratio [1000 = +30dB] 0057 % pp.gn % min posterior SNR as a power ratio when estimating prior SNR [1 = 0dB] 0058 % pp.gz % min posterior SNR as a power ratio [0.001 = -30dB] 0059 % pp.xn % minimum prior SNR [0] 0060 % pp.xb % bias compensation factor for prior SNR [1] 0061 % pp.lg % MMSE target: 0=amplitude, 1=log amplitude, 2=perceptual Bayes [1] 0062 % pp.ne % noise estimation: 0=min statistics, 1=MMSE [1] 0063 % pp.bt % threshold for binary gain or -1 for continuous gain [-1] 0064 % pp.mx % input mixture gain [0] 0065 % pp.gc % maximum amplitude gain [10 = 20 dB] 0066 % pp.rf % round output signal to an exact number of frames [0] 0067 % pp.gw % multichannel gain weights: [chan-n chan-1 ave min max] default:[0 1 0 0 0] 0068 % pp.tf % selects one or more time-frequency planes to output in the gg() variable ['g'] 0069 % 'i' = input power spectrum 0070 % 'I' = input complex spectrum 0071 % 'n' = noise power spectrum 0072 % 'z' = "posterior" SNR (i.e. (S+N)/N ) 0073 % 'x' = "prior" SNR (i.e. S/N ) 0074 % 'G' = raw gain (before clipping and multichannel weighting) 0075 % 'g' = final gain 0076 % 'o' = output power spectrum 0077 % 'O' = output complex spectrum 0078 % 0079 % The applied gain is mx+(1-mx)*optgain where optgain is calculated according to [1] or [2]. 0080 % If pp.bt>=0 then optgain is first thresholded with pp.bt to produce a binary gain 0 or 1. 0081 % 0082 % The default parameters implement the original algorithm in [1,2]. 0083 % 0084 % Several parameters relate to the estimation of xi, the so-called "prior SNR", 0085 % 0086 % xi=max(a*pp.xb*xu+(1-a)*max(gami-1,pp.gn-1),pp.xn); 0087 % 0088 % This is estimated as a smoothed version of 1 less than gami, the "posterior SNR" 0089 % which is the noisy speech power divided by the noise power. This is 0090 % clipped to a min of (pp.gn-1), smoothed using a factor "a" which corresponds to a 0091 % time-constant of pp.ta and then clipped to a minimum of pp.xn. The 0092 % previous value is taken to be pp.xb*xu where xu is the ratio of the 0093 % estimated speech amplitude squared to the noise power. 0094 % 0095 % In addition it is possible to specify parameters for the noise estimation algorithm 0096 % which implements reference [3] or [7] according to the setting of pp.ne 0097 % 0098 % Minimum statistics noise estimate [3]: pp.ne=0 0099 % pp.taca % (11): smoothing time constant for alpha_c [0.0449 seconds] 0100 % pp.tamax % (3): max smoothing time constant [0.392 seconds] 0101 % pp.taminh % (3): min smoothing time constant (upper limit) [0.0133 seconds] 0102 % pp.tpfall % (12): time constant for P to fall [0.064 seconds] 0103 % pp.tbmax % (20): max smoothing time constant [0.0717 seconds] 0104 % pp.qeqmin % (23): minimum value of Qeq [2] 0105 % pp.qeqmax % max value of Qeq per frame [14] 0106 % pp.av % (23)+13 lines: fudge factor for bc calculation [2.12] 0107 % pp.td % time to take minimum over [1.536 seconds] 0108 % pp.nu % number of subwindows to use [3] 0109 % pp.qith % Q-inverse thresholds to select maximum noise slope [0.03 0.05 0.06 Inf ] 0110 % pp.nsmdb % corresponding noise slope thresholds in dB/second [47 31.4 15.7 4.1] 0111 % 0112 % MMSE noise estimate [7]: pp.ne=1 0113 % pp.tax % smoothing time constant for noise power estimate [0.0717 seconds](8) 0114 % pp.tap % smoothing time constant for smoothed speech prob [0.152 seconds](23) 0115 % pp.psthr % threshold for smoothed speech probability [0.99] (24) 0116 % pp.pnsaf % noise probability safety value [0.01] (24) 0117 % pp.pspri % prior speech probability [0.5] (18) 0118 % pp.asnr % active SNR in dB [15] (18) 0119 % pp.psini % initial speech probability [0.5] (23) 0120 % pp.tavini % assumed speech absent time at start [0.064 seconds] 0121 % 0122 % If convenient, you can call v_specsub in chunks of arbitrary size. Thus the following are equivalent: 0123 % 0124 % (a) y=v_ssubmmse(s,fs); 0125 % 0126 % (b) [y1,z]=v_ssubmmse(s(1:1000),fs); 0127 % [y2,z]=v_ssubmmse(s(1001:2000),z); 0128 % y3=v_ssubmmse(s(2001:end),z); 0129 % y=[y1; y2; y3]; 0130 % 0131 % If the number of output arguments is either 2 or 5, the last partial frame of samples will 0132 % be retained for overlap adding with the output from the next call to v_ssubmmse(). 0133 % 0134 % See also v_specsub() for an alternative gain function 0135 % 0136 % Refs: 0137 % [1] Ephraim, Y. & Malah, D. 0138 % Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator 0139 % IEEE Trans Acoustics Speech and Signal Processing, 32(6):1109-1121, Dec 1984 0140 % [2] Ephraim, Y. & Malah, D. 0141 % Speech enhancement using a minimum mean-square error log-spectral amplitude estimator 0142 % IEEE Trans Acoustics Speech and Signal Processing, 33(2):443-445, Apr 1985 0143 % [3] Rainer Martin. 0144 % Noise power spectral density estimation based on optimal smoothing and minimum statistics. 0145 % IEEE Trans. Speech and Audio Processing, 9(5):504-512, July 2001. 0146 % [4] O. Cappe. 0147 % Elimination of the musical noise phenomenon with the ephraim and malah noise suppressor. 0148 % IEEE Trans Speech Audio Processing, 2 (2): 345-349, Apr. 1994. doi: 10.1109/89.279283. 0149 % [5] J. Erkelens, J. Jensen, and R. Heusdens. 0150 % A data-driven approach to optimizing spectral speech enhancement methods for various error criteria. 0151 % Speech Communication, 49: 530-541, 2007. doi: 10.1016/j.specom.2006.06.012. 0152 % [6] R. Martin. 0153 % Statistical methods for the enhancement of noisy speech. 0154 % In J. Benesty, S. Makino, and J. Chen, editors, 0155 % Speech Enhancement, chapter 3, pages 43-64. Springer-Verlag, 2005. 0156 % [7] Gerkmann, T. & Hendriks, R. C. 0157 % Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay 0158 % IEEE Trans Audio, Speech, Language Processing, 2012, 20, 1383-1393 0159 % [8] Loizou, P. 0160 % Speech enhancement based on perceptually motivated Bayesian estimators of the speech magnitude spectrum. 0161 % IEEE Trans. Speech and Audio Processing, 13(5), 857-869, 2005. 0162 0163 % Bugs/suggestions: 0164 % (1) sort out behaviour when si() is a matrix rather than a vector 0165 % 0166 % Copyright (C) Mike Brookes 2004-2017 0167 % Version: $Id: v_ssubmmse.m 10865 2018-09-21 17:22:45Z dmb $ 0168 % 0169 % VOICEBOX is a MATLAB toolbox for speech processing. 0170 % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 0171 % 0172 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0173 % This program is free software; you can redistribute it and/or modify 0174 % it under the terms of the GNU General Public License as published by 0175 % the Free Software Foundation; either version 2 of the License, or 0176 % (at your option) any later version. 0177 % 0178 % This program is distributed in the hope that it will be useful, 0179 % but WITHOUT ANY WARRANTY; without even the implied warranty of 0180 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 0181 % GNU General Public License for more details. 0182 % 0183 % You can obtain a copy of the GNU General Public License from 0184 % http://www.gnu.org/copyleft/gpl.html or by writing to 0185 % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. 0186 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0187 persistent cc kk 0188 if ~numel(kk) % if precomputed constants don't exist yet 0189 kk=sqrt(2*pi); % sqrt(8)*Gamma(1.5) 0190 cc=sqrt(2/pi); % sqrt(2)/Gamma(0.5) 0191 end 0192 szs=size(si); 0193 if szs(1)==1 && szs(2)>1 && (~isstruct(fsz) || ~size(fsz.si,1) || size(fsz.si,2)==1) 0194 si=si'; % flip if only one channel and size(si,1)==1 0195 szs=size(si); 0196 end 0197 nc=szs(2); % number of channels 0198 if length(szs)>2 0199 np=szs(3); % number of planes 0200 else 0201 np=1; 0202 end 0203 ncp=nc*np; 0204 if isstruct(fsz) 0205 fs=fsz.fs; % sample frequency 0206 qq=fsz.qq; % parameter structure 0207 qp=fsz.qp; 0208 ze=fsz.ze; % saved structure for noise estimation 0209 nrs=size(fsz.si,1); % number of samples saved from last time 0210 s=zeros(nrs+szs(1),nc*np); % allocate space for speech 0211 s(1:nrs,:)=fsz.si; % preappend the saved samples 0212 s(nrs+1:end,:)=reshape(si,[],ncp); % append the new samples 0213 else 0214 fs=fsz; % sample frequency 0215 s=si; 0216 % default algorithm constants 0217 0218 qq.of=2; % overlap factor = (fft length)/(frame increment) 0219 qq.ti=16e-3; % desired frame increment (16 ms) 0220 qq.ri=0; % round ni to the nearest power of 2 0221 qq.ta=0.396; % Time const for smoothing SNR estimate = -tinc/log(0.98) from [1] 0222 qq.gx=1000; % maximum posterior SNR = 30dB 0223 qq.gn=1; % min posterior SNR as a power ratio when estimating prior SNR [1] 0224 qq.gz=0.001; % min posterior SNR as a power ratio [0.001 = -30dB] 0225 qq.xn=0; % minimum prior SNR = -Inf dB 0226 qq.xb=1; % bias compensation factor for prior SNR [1] 0227 qq.lg=1; % use log-domain estimator by default 0228 qq.ne=1; % noise estimation: 0=min statistics, 1=MMSE [1] 0229 qq.bt=-1; % suppress binary masking 0230 qq.mx=0; % no input mixing 0231 qq.gc=10; % maximum amplitude gain [10 = 20 dB] 0232 qq.gw=[0 1 0 0 0]; % multichannel gain weights: [chan-n chan-1 ave min max] 0233 qq.tf='g'; % output the gain time-frequency plane by default 0234 qq.rf=0; 0235 if nargin>=3 && ~isempty(pp) 0236 qp=pp; % save for v_estnoisem call 0237 qqn=fieldnames(qq); 0238 for i=1:length(qqn) 0239 if isfield(pp,qqn{i}) 0240 qq.(qqn{i})=pp.(qqn{i}); 0241 end 0242 end 0243 if length(qq.gw)<5 0244 qq.gw(5)=0; 0245 end 0246 qq.gw=qq.gw(1:5)/sum(qq.gw(1:5)); % ensure gain weights are normaized and trim to length 5 0247 else 0248 qp=struct; % make an empty structure for v_estnoisem 0249 end 0250 end 0251 % derived algorithm constants 0252 if qq.ri 0253 ni=pow2(nextpow2(qq.ti*fs*sqrt(0.5))); 0254 else 0255 ni=round(qq.ti*fs); % frame increment in samples 0256 end 0257 tinc=ni/fs; % true frame increment time 0258 a=exp(-tinc/qq.ta); % SNR smoothing coefficient 0259 gx=qq.gx; % max posterior SNR as a power ratio 0260 gz=qq.gz; % min posterior SNR as a power ratio 0261 xn=qq.xn; % floor for prior SNR, xi 0262 ne=qq.ne; % noise estimation: 0=min statistics, 1=MMSE [0] 0263 gn1=max(qq.gn-1,0); % floor for posterior SNR when estimating prior SNR 0264 xb=qq.xb; 0265 tf=qq.tf; 0266 0267 % calculate power spectrum in frames 0268 0269 no=round(qq.of); % integer overlap factor 0270 nf=ni*no; % fft length 0271 w=sqrt(hamming(nf+1))'; w(end)=[]; % for now always use sqrt hamming window 0272 w=w/sqrt(sum(w(1:ni:nf).^2)); % normalize to give overall gain of 1 0273 rf=qq.rf || nargout==2 || nargout==5; % rf=1 if we need to round down to an exact number of frames 0274 if rf>0 % set flag for call to v_enframe 0275 rfm=''; % truncate input to an exact number of frames 0276 else 0277 rfm='r'; % reflect final few input samples to fill up final frame 0278 end 0279 [y,tt]=v_enframe(s(:,1),w,ni,rfm); % v_enframe channel 1 0280 tt=tt/fs; % frame times 0281 yf=v_rfft(y,nf,2); % complex spectrum of input speech 0282 [nr,nf2]=size(yf); % nr = number of frames in this chunk 0283 nf2nc=nf2*nc; % number of gains to calculate per frame 0284 nf2ncp=nf2*ncp; % number of STFT values per frame 0285 yf(1,1,ncp)=yf(1,1); % enlarge yf to cope with nc*np channels 0286 for ic=2:ncp % loop for each additional channel 0287 yf(:,:,ic)=v_rfft(v_enframe(s(:,ic),w,ni,rfm),nf,2); % complex spectrum of channel ic 0288 end 0289 yf=reshape(yf,nr,nf2ncp); % concatenate all the channels 0290 tfe='nzxG'; % tf entries that require enhancement on all channels 0291 nzxG=nargout>2 && ~isempty(tf) && any(any(tf(ones(length(tfe),1),:) == tfe(ones(length(tf),1),:)')); % tf parameter includes 'n','z','x' or 'G' 0292 nce= 1 + (nc-1)*(qq.gw(2)<1 || nzxG); % channels to enhance 0293 nf2nce=nf2*nce; % total number of frequency bins for enhancement 0294 if nc > nce % only need the gain from channel 1 0295 yp=yf(:,1:nf2).*conj(yf(:,1:nf2)); % power spectrum of channel 1 of input speech 0296 else 0297 yp=yf(:,1:nf2nc).*conj(yf(:,1:nf2nc)); % power spectrum of all channels but only plane 1 of input speech 0298 end 0299 ff=(0:nf2-1)*fs/nf; % frequency axis 0300 if isstruct(fsz) % check if we are following a previous call 0301 if ne>0 % check noise estimation method 0302 [dp,ze]=v_estnoiseg(yp,ze); % estimate the noise using MMSE in all frequency bins of nce channels 0303 else 0304 [dp,ze]=v_estnoisem(yp,ze); % estimate the noise using minimum statistics 0305 end 0306 ssv=fsz.ssv; % saved output samples from previous call for overlap adding 0307 xu=fsz.xu; % saved unsmoothed SNR 0308 else 0309 if ne>0 % check noise estimation method 0310 [dp,ze]=v_estnoiseg(yp,tinc,qp); % estimate the noise using MMSE 0311 else 0312 [dp,ze]=v_estnoisem(yp,tinc,qp); % estimate the noise using minimum statistics 0313 end 0314 ssv=zeros(ni*(no-1),ncp); % dummy saved output samples from previous call for overlap adding 0315 xu=1; % dummy unsmoothed SNR from previous frame %%%%%%%%%%% check dimensions 0316 end 0317 0318 ss=[]; % in case of no data frames 0319 gg=[]; % in case of no data frames or empty tf 0320 if nr>0 0321 gam=max(min(yp./dp,gx),gz); % gamma = posterior SNR 0322 g=zeros(nr,nf2nce); % create space for gain matrix 0323 x=zeros(nr,nf2nce); % create space for prior SNR 0324 switch qq.lg 0325 case 0 % use amplitude domain estimator from [1] 0326 for i=1:nr % loop for each frame 0327 gami=gam(i,:); 0328 xi=max(a*xb*xu+(1-a)*max(gami-1,gn1),xn); % prior SNR 0329 v=0.5*xi.*gami./(1+xi); % note that this equals 0.5*vk in [1] 0330 gi=(0.277+2*v)./gami; % accurate to 0.02 dB for v>0.5 0331 mv=v<0.5; 0332 if any(mv) 0333 vmv=v(mv); 0334 gi(mv)=kk*sqrt(vmv).*((0.5+vmv).*besseli(0,vmv)+vmv.*besseli(1,vmv))./(gami(mv).*exp(vmv)); 0335 end 0336 g(i,:)=gi; % save gain for later 0337 x(i,:)=xi; % save prior SNR 0338 xu=gami.*gi.^2; % unsmoothed prior SNR 0339 end 0340 case 2 % perceptually motivated estimator from [8] 0341 for i=1:nr % loop for each frame 0342 gami=gam(i,:); 0343 xi=max(a*xb*xu+(1-a)*max(gami-1,gn1),xn); % prior SNR 0344 v=0.5*xi.*gami./(1+xi); % note that this is 0.5*vk in [8] 0345 gi=cc*sqrt(v).*exp(v)./(gami.*besseli(0,v)); 0346 g(i,:)=gi; % save gain for later 0347 x(i,:)=xi; % save prior SNR 0348 xu=gami.*gi.^2; % unsmoothed prior SNR 0349 end 0350 otherwise % use log domain estimator from [2] 0351 for i=1:nr % loop for each frame 0352 gami=gam(i,:); 0353 xi=max(a*xb*xu+(1-a)*max(gami-1,gn1),xn); % prior SNR 0354 xir=xi./(1+xi); 0355 gi=xir.*exp(0.5*expint(xir.*gami)); 0356 g(i,:)=gi; % save gain for later 0357 x(i,:)=xi; % save prior SNR 0358 xu=gami.*gi.^2; % unsmoothed prior SNR 0359 end 0360 end 0361 if nc>nce % only calculated the gain from channel 1 0362 g=repmat(g,1,nc); % replicate channel 1 gain for all channels 0363 else 0364 gr=g; % save raw gain in case we need to output it 0365 gwx=find(qq.gw(2:end)); % find which channel-independent gain terms to include 0366 if nc>1 && ~isempty(gwx) 0367 g3=reshape(g,[nr,nf2,nc]); 0368 gx=zeros(nr,nf2); % space for channel-independent gain component 0369 for ig=gwx % loop for each gain component with a non-zero weight 0370 switch ig 0371 case 1 % chan 1 0372 gx=g3(:,:,1)*qq.gw(2); 0373 case 2 % average 0374 gx=gx+mean(g3,3)*qq.gw(3); 0375 case 3 % min 0376 gx=gx+min(g3,[],3)*qq.gw(4); 0377 case 4 % max 0378 gx=gx+max(g3,[],3)*qq.gw(5); 0379 end 0380 end 0381 g=g*qq.gw(1)+repmat(gx,1,nc); % add in the channel-independent gain components 0382 end 0383 end 0384 g=min(qq.mx+(1-qq.mx)*g,qq.gc); % mix in some of the input and clip to qq.gc 0385 if qq.bt>=0 0386 g=g>qq.bt; % apply binary masking threshold 0387 end 0388 se=v_irfft(reshape(yf.*repmat(g,1,np),[nr nf2 nc*np]),nf,2).*repmat(w,[nr 1 nc*np]); % inverse dft and apply output window 0389 ss=zeros(ni*(nr+no-1),nc*np,no); % space for overlapped output speech 0390 ss(1:ni*(no-1),:,end)=ssv; % insert saved output speech (already overlap-added) 0391 for i=1:no % insert frames into no columns for overlap-adding 0392 nm=nf*(1+floor((nr-i)/no)); % number of samples in this column 0393 ss(1+(i-1)*ni:nm+(i-1)*ni,:,i)=reshape(permute(se(i:no:nr,:,:),[2 1 3]),nm,nc*np); % concatenate every no'th frame 0394 end 0395 ss=reshape(sum(ss,3),[],nc,np); % perform overlap add and split into planes 0396 if nargout>2 && ~isempty(tf) 0397 if (any(lower(tf)=='i') || any(lower(tf)=='o')) % tf entries that require all planes 0398 npg=np; % number of planes in gg array 0399 else 0400 npg=1; 0401 end 0402 gg=zeros(nr,nf2nc*npg,length(tf)); % make space 0403 if ncp>nce && (any(tf=='i') || any(tf=='o')) 0404 yp=yf.*conj(yf); % calculate power spectrum for all channels/planes 0405 end 0406 for i=1:length(tf) 0407 switch tf(i) 0408 case 'i' % 'i' = input power spectrum (nc*np) 0409 gg(:,:,i)=yp; 0410 case 'I' % 'I' = input complex spectrum (nc*np) 0411 gg(:,:,i)=yf; 0412 case 'n' % 'n' = noise power spectrum (nc) 0413 gg(:,:,i)=repmat(dp,1,npg); 0414 case 'z' % 'z' = posterior SNR (i.e. (S+N)/N ) (nc) 0415 gg(:,:,i)=repmat(gam,1,npg); 0416 case 'x' % 'x' = prior SNR (nc) 0417 gg(:,:,i)=repmat(x,1,npg); 0418 case 'g' % 'G' = raw gain (nc) 0419 gg(:,:,i)=repmat(gr,1,npg); 0420 case 'g' % 'g' = final gain (nc) 0421 gg(:,:,i)=repmat(g,1,npg); 0422 case 'o' % 'o' = output power spectrum (nc*np) 0423 gg(:,:,i)=yp.*g.^2; 0424 case 'O' % 'O' = output complex spectrum (nc*np) 0425 gg(:,:,i)=yf.*g; 0426 end 0427 end 0428 gg=reshape(gg,[nr,nf2,nc,npg,length(tf)]); % make it 5D 0429 end 0430 end % if ~nr 0431 if nargout==2 || nargout==5 % we have a zo output argument 0432 if nr 0433 zo.ssv=ss(end-ni*(no-1)+1:end,:); % save the last no-1 hops for overlap-adding next time 0434 ss(end-ni*(no-1)+1:end,:)=[]; % only output the frames that are completed 0435 else 0436 zo.ssv=ssv; % if no new frames just keep the old tail 0437 end 0438 zo.si=s(length(ss)+1:end,:); % save the tail end of the input speech signal 0439 zo.fs=fs; % save sample frequency 0440 zo.qq=qq; % save local parameters 0441 zo.qp=qp; % save v_estnoisem parameters 0442 zo.ze=ze; % save state of noise estimation 0443 zo.xu=xu; 0444 if nargout==2 0445 gg=zo; % 2nd of two arguments is zo 0446 end 0447 elseif rf==0 0448 ss=ss(1:size(s,1),:,:); % trim to the correct length if not an exact number of frames 0449 end 0450 if ~nargout && nr>0 0451 ffax=ff/1000; 0452 ax=zeros(4,1); 0453 ax(1)=subplot(223); 0454 imagesc(tt,ffax,20*log10(g(:,1:nf2))'); 0455 colorbar; 0456 axis('xy'); 0457 title(sprintf('Filter Gain (dB): ta=%.2g',qq.ta)); 0458 xlabel('Time (s)'); 0459 ylabel('Frequency (kHz)'); 0460 0461 ax(2)=subplot(222); 0462 imagesc(tt,ffax,10*log10(yp(:,1:nf2))'); 0463 colorbar; 0464 axis('xy'); 0465 title('Noisy Speech (dB)'); 0466 xlabel('Time (s)'); 0467 ylabel('Frequency (kHz)'); 0468 0469 ax(3)=subplot(224); 0470 imagesc(tt,ffax,10*log10(yp(:,1:nf2).*g(:,1:nf2).^2)'); 0471 colorbar; 0472 axis('xy'); 0473 title('Enhanced Speech (dB)'); 0474 xlabel('Time (s)'); 0475 ylabel('Frequency (kHz)'); 0476 0477 ax(4)=subplot(221); 0478 imagesc(tt,ffax,10*log10(dp(:,1:nf2))'); 0479 colorbar; 0480 axis('xy'); 0481 title('Noise Estimate (dB)'); 0482 xlabel('Time (s)'); 0483 ylabel('Frequency (kHz)'); 0484 linkaxes(ax); 0485 end