V_SAPISYNTH text-to-speech synthesize of text string or matrix [X,FS,TXT]=(T,M) Usage: v_sapisynth('Hello world'); % Speak text v_sapisynth([1 2+3i; -1i 4],'j'); % speak a matrix using 'j' for sqrt(-1) [x,fs]=v_sapisynth('Hello world','k11'); % save waveform at 11kHz v_sapisynth('Hello world','fc'); % use a female child voice if available Inputs: t is either a text string or a matrix m is a mode string containing one or more of the following options (# denotes an integer): 'l' x=a cell array containing a list of talkers 'l#' specify talker # (in the range 1:nvoices) 'r#' speaking rate -10(slow) to +10(fast) [0] 'k#' target sample rate in kHz [22] 'o' audio output [default if no output arguments] 'O' unblocked audio output (may result in simultaneous overlapping sounds) 'j' use 'j' rather than 'i' for complex numbers 'm','f' 'c','t','a','s' = Male Female Child, Teen, Adult, Senior specify any combination in order of priority 'v' autoscale volumne to a peak value of +-1 'v#' set volume (0 to 100) [100] 'p#' set pitch -10 to +10 [0] 'n#' number of digits precision for numeric values [3] Outputs: x is the output waveform unless the 'l' option is chosen in which case x is a cell array with one row per available voice containing {'Name' 'Gender' 'Age'} where Gender={Male,Female} and Age={Unknown,Child,Teen,Adult,Senior} fs is the actual sample frequency txt gives the actual text sring sent to the synthesiser The input text string can contain embedded command which are described in full at http://msdn.microsoft.com/en-us/library/ms717077(v=vs.85).aspx and summarised here: '... <bookmark mark="xyz"/> ...' insert a bookmark '... <context id="date_mdy"> 03/04/01 </context> ...' specify order of dates '... <emph> ... </emph> ...' emphasise '... <volume level="50"> ... </volume> ...' change volume level to 50% of full '... <partofsp part="noun"> XXX </partofsp> ...' specify part of speech of XXX: unkown, noun, verb modifier, function, interjection '... <pitch absmiddle="-5"> ... </pitch> ...' change pitch to -5 [0 is default pitch] '... <pitch middle="5"> ... </pitch> ...' add 5 onto the pitch '... <pron sym="h eh 1 l ow "/> ...' insert phoneme string '... <rate absspeed="-5"> ... </rate> ...' change rate to -5 [0 is default rate] '... <rate speed="5"> ... </rate> ...' add 5 onto the rate '... <silence msec="500"/> ...' insert 500 ms of silence '... <spell> ... </spell> ...' spell out the words '... <voice required="Gender=Female;Age!=Child"> ...' specify target voice attributes to be Female non-child Age={Child, Teen, Adult, Senior}, Gender={Male, Female} Acknowledgement: This function was originally based on tts.m written by Siyi Deng
0001 function [x,fs,txt] = v_sapisynth(t,m) 0002 %V_SAPISYNTH text-to-speech synthesize of text string or matrix [X,FS,TXT]=(T,M) 0003 % 0004 % Usage: v_sapisynth('Hello world'); % Speak text 0005 % v_sapisynth([1 2+3i; -1i 4],'j'); % speak a matrix using 'j' for sqrt(-1) 0006 % [x,fs]=v_sapisynth('Hello world','k11'); % save waveform at 11kHz 0007 % v_sapisynth('Hello world','fc'); % use a female child voice if available 0008 % 0009 % Inputs: t is either a text string or a matrix 0010 % m is a mode string containing one or more of the 0011 % following options (# denotes an integer): 0012 % 0013 % 'l' x=a cell array containing a list of talkers 0014 % 'l#' specify talker # (in the range 1:nvoices) 0015 % 'r#' speaking rate -10(slow) to +10(fast) [0] 0016 % 'k#' target sample rate in kHz [22] 0017 % 'o' audio output [default if no output arguments] 0018 % 'O' unblocked audio output (may result in simultaneous overlapping sounds) 0019 % 'j' use 'j' rather than 'i' for complex numbers 0020 % 'm','f' 'c','t','a','s' = Male Female Child, Teen, Adult, Senior 0021 % specify any combination in order of priority 0022 % 'v' autoscale volumne to a peak value of +-1 0023 % 'v#' set volume (0 to 100) [100] 0024 % 'p#' set pitch -10 to +10 [0] 0025 % 'n#' number of digits precision for numeric values [3] 0026 % 0027 % Outputs: x is the output waveform unless the 'l' option is chosen in 0028 % which case x is a cell array with one row per available 0029 % voice containing {'Name' 'Gender' 'Age'} where 0030 % Gender={Male,Female} and Age={Unknown,Child,Teen,Adult,Senior} 0031 % fs is the actual sample frequency 0032 % txt gives the actual text sring sent to the synthesiser 0033 % 0034 % The input text string can contain embedded command which are described 0035 % in full at http://msdn.microsoft.com/en-us/library/ms717077(v=vs.85).aspx 0036 % and summarised here: 0037 % 0038 % '... <bookmark mark="xyz"/> ...' insert a bookmark 0039 % '... <context id="date_mdy"> 03/04/01 </context> ...' specify order of dates 0040 % '... <emph> ... </emph> ...' emphasise 0041 % '... <volume level="50"> ... </volume> ...' change volume level to 50% of full 0042 % '... <partofsp part="noun"> XXX </partofsp> ...' specify part of speech of XXX: unkown, noun, verb modifier, function, interjection 0043 % '... <pitch absmiddle="-5"> ... </pitch> ...' change pitch to -5 [0 is default pitch] 0044 % '... <pitch middle="5"> ... </pitch> ...' add 5 onto the pitch 0045 % '... <pron sym="h eh 1 l ow "/> ...' insert phoneme string 0046 % '... <rate absspeed="-5"> ... </rate> ...' change rate to -5 [0 is default rate] 0047 % '... <rate speed="5"> ... </rate> ...' add 5 onto the rate 0048 % '... <silence msec="500"/> ...' insert 500 ms of silence 0049 % '... <spell> ... </spell> ...' spell out the words 0050 % '... <voice required="Gender=Female;Age!=Child"> ...' specify target voice attributes to be Female non-child 0051 % Age={Child, Teen, Adult, Senior}, Gender={Male, Female} 0052 % 0053 % Acknowledgement: This function was originally based on tts.m written by Siyi Deng 0054 0055 % Bugs/Suggestions: 0056 % (1) Allow the speaking of structures and cells 0057 % (2) Allow a blocking call to sound output and/or a callback procedure and/or a status call 0058 % (3) Have pitch and/or volume change to emphasise the first entry in a matrix row. 0059 % (4) extract true frequency from output stream 0060 0061 % Copyright (C) Mike Brookes 2011 0062 % Version: $Id: v_sapisynth.m 10865 2018-09-21 17:22:45Z dmb $ 0063 % 0064 % VOICEBOX is a MATLAB toolbox for speech processing. 0065 % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 0066 % 0067 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0068 % This program is free software; you can redistribute it and/or modify 0069 % it under the terms of the GNU General Public License as published by 0070 % the Free Software Foundation; either version 2 of the License, or 0071 % (at your option) any later version. 0072 % 0073 % This program is distributed in the hope that it will be useful, 0074 % but WITHOUT ANY WARRANTY; without even the implied warranty of 0075 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 0076 % GNU General Public License for more details. 0077 % 0078 % You can obtain a copy of the GNU General Public License from 0079 % http://www.gnu.org/copyleft/gpl.html or by writing to 0080 % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. 0081 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0082 persistent vv vvi vvj tsou lsou 0083 0084 % Check that we are on a PC 0085 0086 if ~ispc, error('only works on a PC'); end 0087 0088 % decode the options 0089 0090 if nargin<2 0091 m=''; 0092 end 0093 opts=zeros(52,3); % [exists+number specified, value] 0094 lmode=length(m); 0095 i=1; 0096 while i<=lmode 0097 if i<lmode % read a following integer if it exists 0098 [v,nv,e,ni]=sscanf(m(i+1:end),'%d',1); 0099 else 0100 nv=0; 0101 ni=1; 0102 end 0103 k=1+double(lower(m(i)))-'a'+26*(m(i)<'a'); 0104 if k>=1 && k<=52 0105 opts(k,1)=1+nv; 0106 if nv 0107 opts(k,2)=v; 0108 end 0109 opts(k,3)=i; % save position in mode string 0110 end 0111 i=i+ni; 0112 end 0113 0114 S=actxserver('SAPI.SpVoice'); 0115 V=invoke(S,'GetVoices'); % get a list of voices from the registry 0116 nv=V.Count; 0117 if isempty(vv) || size(vvi,1)~=nv 0118 vv=cell(nv,3); 0119 vvi=zeros(nv,6); 0120 ages={'Senior' 'Adult' 'Teen' 'Child'}; 0121 for i=1:nv 0122 VI=V.Item(i-1); 0123 vv{i,1}=VI.GetDescription; 0124 vv{i,2}=VI.GetAttribute('Gender'); 0125 vvi(i,1)=MatchesAttributes(VI,'Gender=Male'); 0126 vvi(i,2)=MatchesAttributes(VI,'Gender=Female'); 0127 vv{i,3}='Unknown'; 0128 for j=1:length(ages) 0129 if MatchesAttributes(VI,['Age=' ages{j}]) 0130 vv{i,3}=ages{j}; 0131 vvi(i,2+j)=1; 0132 break 0133 end 0134 end 0135 end 0136 vvj=vvi; 0137 % in the matrix below, the rows and columns are in the order Senior,Adult,Teen,Child. 0138 % Thus the first row gives the cost of selecting a voice with the wrong age when 'Senior' 0139 % was requested by the user. A voice of unkown age always scores 0 so entries with negative 0140 % values are preferred over 'unknown' while those with positive values are not. 0141 % Diagonal elements of the matrix are ignored (hence set to 0) since correct matches are 0142 % handled earlier with higher priority. 0143 vvj(:,3:6)=vvj(:,3:6)*[0 0 1 2; 1 0 2 3; 1 0 0 -1; 1 0 -1 0]'; % fuzzy voice attribute matching 0144 end 0145 0146 % deal with the voice selection options 0147 0148 optv=opts([13 6 19 1 20 3],[3 1 2]); 0149 if opts(12) % if 'l' option specified - we need to get the voices 0150 if opts(12)>1 0151 S.Voice = V.Item(mod(opts(12,2)-1,nv)); 0152 else 0153 x=vv; 0154 return 0155 end 0156 elseif any(optv(:,2)) 0157 optv(:,3)=(1:6)'; 0158 optv=sortrows(optv(optv(:,2)>0,:)); % sort in order of occurrence in mode string 0159 no=size(optv,1); 0160 optp=zeros(nv,2*no+1); 0161 optp(:,end)=(1:nv)'; % lowest priority condition is original rank 0162 optp(:,1:no)=-vvi(:,optv(:,3)); 0163 optp(:,no+1:2*no)=vvj(:,optv(:,3)); 0164 optp=sortrows(optp); 0165 S.Voice = V.Item(optp(1,end)-1); 0166 end 0167 0168 % deal with the 'r' option 0169 0170 if opts(18)>1 % 'r' option is specified with a number 0171 S.Rate=min(max(opts(18,2),-10),10); 0172 end 0173 0174 % deal with the 'v' option 0175 0176 if opts(22)>1 % 'r' option is specified with a number 0177 S.Volume=min(max(opts(22,2),0),100); 0178 end 0179 0180 % deal with the 'k' option 0181 0182 ff=[11025 12000 16000 22050 24000 32000 44100 48000]; % valid frequencies 0183 if opts(11)>1 % 'k' option is specified with a number 0184 [v,jf]=min(abs(ff/1000-opts(11,2))); 0185 else 0186 jf=4; % default is 16kHz 0187 end 0188 fs=ff(jf); 0189 0190 % deal with the 'n' option 0191 0192 if opts(14)>1 % 'r' option is specified with a number 0193 prec=opts(14,2); 0194 else 0195 prec=3; 0196 end 0197 0198 M=actxserver('SAPI.SpMemoryStream'); 0199 M.Format.Type = sprintf('SAFT%dkHz16BitMono',fix(fs/1000)); 0200 S.AudioOutputStream = M; 0201 if ischar(t) 0202 txt=t; 0203 else 0204 txt=''; 0205 if numel(t) 0206 sgns={' minus ', '', ' plus '}; 0207 sz=size(t); 0208 w=permute(t,[2 1 3:numel(sz)]); 0209 sz(1:2)=sz(1)+sz(2)-sz(1:2); % Permute the first two dimensions for reading 0210 szp=cumprod(sz); 0211 imch='i'+(opts(10)>0); 0212 vsep=''; 0213 for i=1:numel(w) 0214 wr=real(w(i)); 0215 wi=imag(w(i)); 0216 switch((wr~=0)+2*(wi~=0))+4*(abs(wi)==1) 0217 case {0,1} 0218 txt=[txt sprintf('%s%.*g',vsep,prec,wr)]; 0219 case 2 0220 txt=[txt sprintf('%s%.*g%c,',vsep,prec,wi,imch)]; 0221 case 3 0222 txt=[txt sprintf('%s%.*g%s%.*g%c,',vsep,prec,wr,sgns{2+sign(wi)},prec,abs(wi),imch)]; 0223 case 6 0224 if wi>0 0225 txt=[txt vsep imch ',']; 0226 else 0227 txt=[txt vsep 'minus ' imch ',']; 0228 end 0229 case 7 0230 txt=[txt sprintf('%s%.*g%s%c,',vsep,prec,wr,sgns{2+sign(wi)},imch)]; 0231 end 0232 % could use a <silence msec="???"/> command here 0233 vsep=[repmat('; ',1,find([0 mod(i,szp)]==0,1,'last')-1) ' ']; 0234 end 0235 end 0236 end 0237 0238 % deal with the 'p' option 0239 0240 if opts(16)>1 % 'r' option is specified with a number 0241 txt=[sprintf('<pitch absmiddle="%d"> ',min(max(opts(16,2),-10),10)) txt]; 0242 end 0243 0244 invoke(S,'Speak',txt); 0245 x = mod(32768+reshape(double(invoke(M,'GetData')),2,[])'*[1; 256],65536)/32768-1; 0246 delete(M); % delete output stream 0247 delete(S); % delete all interfaces 0248 0249 if opts(22)==1 % 'v' option with no argument 0250 x=x*(1/max(abs(x))); % autoscale 0251 end 0252 if opts(15)>0 || opts(41)>0 || ~nargout % 'o' option for audio output 0253 while opts(41)==0 && ~isempty(tsou) && toc(tsou)<lsou 0254 end 0255 sound(x,fs); 0256 tsou=tic; % save time 0257 lsou=length(x)/fs; 0258 end