Description of v

V_SAPISYNTH  text-to-speech synthesize of text string or matrix [X,FS,TXT]=(T,M)

  Usage:         v_sapisynth('Hello world');          % Speak text
                 v_sapisynth([1 2+3i; -1i 4],'j');    % speak a matrix using 'j' for sqrt(-1)
          [x,fs]=v_sapisynth('Hello world','k11');    % save waveform at 11kHz
                 v_sapisynth('Hello world','fc');     % use a female child voice if available

  Inputs: t  is either a text string or a matrix
          m  is a mode string containing one or more of the
             following options (# denotes an integer):

             'l'   x=a cell array containing a list of talkers
             'l#'  specify talker # (in the range 1:nvoices)
             'r#'  speaking rate -10(slow) to +10(fast) [0]
             'k#'  target sample rate in kHz [22]
             'o'   audio output [default if no output arguments]
             'O'   unblocked audio output (may result in simultaneous overlapping sounds)
             'j'   use 'j' rather than 'i' for complex numbers
             'm','f' 'c','t','a','s' = Male Female Child, Teen, Adult, Senior
                       specify any combination in order of priority
             'v'   autoscale volumne to a peak value of +-1
             'v#'  set volume (0 to 100) [100]
             'p#'  set pitch -10 to +10 [0]
             'n#'  number of digits precision for numeric values [3]

 Outputs: x    is the output waveform unless the 'l' option is chosen in
               which case x is a cell array with one row per available
               voice containing {'Name' 'Gender' 'Age'} where
               Gender={Male,Female} and Age={Unknown,Child,Teen,Adult,Senior}
          fs   is the actual sample frequency
         txt   gives the actual text sring sent to the synthesiser

 The input text string can contain embedded command which are described
 in full at http://msdn.microsoft.com/en-us/library/ms717077(v=vs.85).aspx
 and summarised here:

 '... <bookmark mark="xyz"/> ...'               insert a bookmark
 '... <context id="date_mdy"> 03/04/01 </context> ...' specify order of dates
 '... <emph> ... </emph> ...'                   emphasise
 '... <volume level="50"> ... </volume> ...'    change volume level to 50% of full
 '... <partofsp part="noun"> XXX </partofsp> ...'      specify part of speech of XXX: unkown, noun, verb modifier, function, interjection
 '... <pitch absmiddle="-5"> ... </pitch> ...'  change pitch to -5 [0 is default pitch]
 '... <pitch middle="5"> ... </pitch> ...'      add 5 onto the pitch
 '... <pron sym="h eh 1 l ow "/> ...'           insert phoneme string
 '... <rate absspeed="-5"> ... </rate> ...'     change rate to -5 [0 is default rate]
 '... <rate speed="5"> ... </rate> ...'         add 5 onto the rate
 '... <silence msec="500"/> ...'                insert 500 ms of silence
 '... <spell> ... </spell> ...'                 spell out the words
 '... <voice required="Gender=Female;Age!=Child"> ...' specify target voice attributes to be Female non-child
                                                         Age={Child, Teen, Adult, Senior}, Gender={Male, Female}

 Acknowledgement: This function was originally based on tts.m written by Siyi Deng

0001 function [x,fs,txt] = v_sapisynth(t,m)
0002 %V_SAPISYNTH  text-to-speech synthesize of text string or matrix [X,FS,TXT]=(T,M)
0003 %
0004 %  Usage:         v_sapisynth('Hello world');          % Speak text
0005 %                 v_sapisynth([1 2+3i; -1i 4],'j');    % speak a matrix using 'j' for sqrt(-1)
0006 %          [x,fs]=v_sapisynth('Hello world','k11');    % save waveform at 11kHz
0007 %                 v_sapisynth('Hello world','fc');     % use a female child voice if available
0008 %
0009 %  Inputs: t  is either a text string or a matrix
0010 %          m  is a mode string containing one or more of the
0011 %             following options (# denotes an integer):
0012 %
0013 %             'l'   x=a cell array containing a list of talkers
0014 %             'l#'  specify talker # (in the range 1:nvoices)
0015 %             'r#'  speaking rate -10(slow) to +10(fast) [0]
0016 %             'k#'  target sample rate in kHz [22]
0017 %             'o'   audio output [default if no output arguments]
0018 %             'O'   unblocked audio output (may result in simultaneous overlapping sounds)
0019 %             'j'   use 'j' rather than 'i' for complex numbers
0020 %             'm','f' 'c','t','a','s' = Male Female Child, Teen, Adult, Senior
0021 %                       specify any combination in order of priority
0022 %             'v'   autoscale volumne to a peak value of +-1
0023 %             'v#'  set volume (0 to 100) [100]
0024 %             'p#'  set pitch -10 to +10 [0]
0025 %             'n#'  number of digits precision for numeric values [3]
0026 %
0027 % Outputs: x    is the output waveform unless the 'l' option is chosen in
0028 %               which case x is a cell array with one row per available
0029 %               voice containing {'Name' 'Gender' 'Age'} where
0030 %               Gender={Male,Female} and Age={Unknown,Child,Teen,Adult,Senior}
0031 %          fs   is the actual sample frequency
0032 %         txt   gives the actual text sring sent to the synthesiser
0033 %
0034 % The input text string can contain embedded command which are described
0035 % in full at http://msdn.microsoft.com/en-us/library/ms717077(v=vs.85).aspx
0036 % and summarised here:
0037 %
0038 % '... <bookmark mark="xyz"/> ...'               insert a bookmark
0039 % '... <context id="date_mdy"> 03/04/01 </context> ...' specify order of dates
0040 % '... <emph> ... </emph> ...'                   emphasise
0041 % '... <volume level="50"> ... </volume> ...'    change volume level to 50% of full
0042 % '... <partofsp part="noun"> XXX </partofsp> ...'      specify part of speech of XXX: unkown, noun, verb modifier, function, interjection
0043 % '... <pitch absmiddle="-5"> ... </pitch> ...'  change pitch to -5 [0 is default pitch]
0044 % '... <pitch middle="5"> ... </pitch> ...'      add 5 onto the pitch
0045 % '... <pron sym="h eh 1 l ow "/> ...'           insert phoneme string
0046 % '... <rate absspeed="-5"> ... </rate> ...'     change rate to -5 [0 is default rate]
0047 % '... <rate speed="5"> ... </rate> ...'         add 5 onto the rate
0048 % '... <silence msec="500"/> ...'                insert 500 ms of silence
0049 % '... <spell> ... </spell> ...'                 spell out the words
0050 % '... <voice required="Gender=Female;Age!=Child"> ...' specify target voice attributes to be Female non-child
0051 %                                                         Age={Child, Teen, Adult, Senior}, Gender={Male, Female}
0052 %
0053 % Acknowledgement: This function was originally based on tts.m written by Siyi Deng
0054 
0055 % Bugs/Suggestions:
0056 %  (1) Allow the speaking of structures and cells
0057 %  (2) Allow a blocking call to sound output and/or a callback procedure and/or a status call
0058 %  (3) Have pitch and/or volume change to emphasise the first entry in a matrix row.
0059 %  (4) extract true frequency from output stream
0060 
0061 %      Copyright (C) Mike Brookes 2011
0062 %      Version: $Id: v_sapisynth.m 10865 2018-09-21 17:22:45Z dmb $
0063 %
0064 %   VOICEBOX is a MATLAB toolbox for speech processing.
0065 %   Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html
0066 %
0067 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0068 %   This program is free software; you can redistribute it and/or modify
0069 %   it under the terms of the GNU Lesser General Public License as published by
0070 %   the Free Software Foundation; either version 3 of the License, or
0071 %   (at your option) any later version.
0072 %
0073 %   This program is distributed in the hope that it will be useful,
0074 %   but WITHOUT ANY WARRANTY; without even the implied warranty of
0075 %   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
0076 %   GNU Lesser General Public License for more details.
0077 %
0078 %   You can obtain a copy of the GNU Lesser General Public License from
0079 %   https://www.gnu.org/licenses/ .
0080 %    See files gpl-3.0.txt and lgpl-3.0.txt included in this distribution.
0081 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
0082 persistent vv vvi vvj tsou lsou
0083 
0084 % Check that we are on a PC
0085 
0086 if ~ispc, error('only works on a PC'); end
0087 
0088 % decode the options
0089 
0090 if nargin<2
0091     m='';
0092 end
0093 opts=zeros(52,3); % [exists+number specified, value]
0094 lmode=length(m);
0095 i=1;
0096 while i<=lmode
0097     if i<lmode  % read a following integer if it exists
0098         [v,nv,e,ni]=sscanf(m(i+1:end),'%d',1);
0099     else
0100         nv=0;
0101         ni=1;
0102     end
0103     k=1+double(lower(m(i)))-'a'+26*(m(i)<'a');
0104     if k>=1 && k<=52
0105         opts(k,1)=1+nv;
0106         if nv
0107             opts(k,2)=v;
0108         end
0109         opts(k,3)=i;  % save position in mode string
0110     end
0111     i=i+ni;
0112 end
0113 
0114 S=actxserver('SAPI.SpVoice');
0115 V=invoke(S,'GetVoices');  % get a list of voices from the registry
0116 nv=V.Count;
0117 if isempty(vv) || size(vvi,1)~=nv
0118     vv=cell(nv,3);
0119     vvi=zeros(nv,6);
0120     ages={'Senior' 'Adult' 'Teen' 'Child'};
0121     for i=1:nv
0122         VI=V.Item(i-1);
0123         vv{i,1}=VI.GetDescription;
0124         vv{i,2}=VI.GetAttribute('Gender');
0125         vvi(i,1)=MatchesAttributes(VI,'Gender=Male');
0126         vvi(i,2)=MatchesAttributes(VI,'Gender=Female');
0127         vv{i,3}='Unknown';
0128         for j=1:length(ages)
0129             if MatchesAttributes(VI,['Age=' ages{j}])
0130                 vv{i,3}=ages{j};
0131                 vvi(i,2+j)=1;
0132                 break
0133             end
0134         end
0135     end
0136     vvj=vvi;
0137     % in the matrix below, the rows and columns are in the order Senior,Adult,Teen,Child.
0138     % Thus the first row gives the cost of selecting a voice with the wrong age when 'Senior'
0139     % was requested by the user. A voice of unkown age always scores 0 so entries with negative
0140     % values are preferred over 'unknown' while those with positive values are not.
0141     % Diagonal elements of the matrix are ignored (hence set to 0) since correct matches are
0142     % handled earlier with higher priority.
0143     vvj(:,3:6)=vvj(:,3:6)*[0 0 1 2; 1 0 2 3; 1 0 0 -1; 1 0 -1 0]'; % fuzzy voice attribute matching
0144 end
0145 
0146 % deal with the voice selection options
0147 
0148 optv=opts([13 6 19 1 20 3],[3 1 2]);
0149 if opts(12)   % if 'l' option specified - we need to get the voices
0150     if opts(12)>1
0151         S.Voice = V.Item(mod(opts(12,2)-1,nv));
0152     else
0153         x=vv;
0154         return
0155     end
0156 elseif any(optv(:,2))
0157     optv(:,3)=(1:6)';
0158     optv=sortrows(optv(optv(:,2)>0,:));  % sort in order of occurrence in mode string
0159     no=size(optv,1);
0160     optp=zeros(nv,2*no+1);
0161     optp(:,end)=(1:nv)'; % lowest priority condition is original rank
0162     optp(:,1:no)=-vvi(:,optv(:,3));
0163     optp(:,no+1:2*no)=vvj(:,optv(:,3));
0164     optp=sortrows(optp);
0165     S.Voice = V.Item(optp(1,end)-1);
0166 end
0167 
0168 % deal with the 'r' option
0169 
0170 if opts(18)>1  % 'r' option is specified with a number
0171     S.Rate=min(max(opts(18,2),-10),10);
0172 end
0173 
0174 % deal with the 'v' option
0175 
0176 if opts(22)>1  % 'r' option is specified with a number
0177     S.Volume=min(max(opts(22,2),0),100);
0178 end
0179 
0180 % deal with the 'k' option
0181 
0182 ff=[11025 12000 16000 22050 24000 32000 44100 48000]; % valid frequencies
0183 if opts(11)>1  % 'k' option is specified with a number
0184     [v,jf]=min(abs(ff/1000-opts(11,2)));
0185 else
0186     jf=4;  % default is 16kHz
0187 end
0188 fs=ff(jf);
0189 
0190 % deal with the 'n' option
0191 
0192 if opts(14)>1  % 'r' option is specified with a number
0193     prec=opts(14,2);
0194 else
0195     prec=3;
0196 end
0197 
0198 M=actxserver('SAPI.SpMemoryStream');
0199 M.Format.Type = sprintf('SAFT%dkHz16BitMono',fix(fs/1000));
0200 S.AudioOutputStream = M;
0201 if ischar(t)
0202     txt=t;
0203 else
0204     txt='';
0205     if numel(t)
0206         sgns={' minus ', '', ' plus '};
0207         sz=size(t);
0208         w=permute(t,[2 1 3:numel(sz)]);
0209         sz(1:2)=sz(1)+sz(2)-sz(1:2); % Permute the first two dimensions for reading
0210         szp=cumprod(sz);
0211         imch='i'+(opts(10)>0);
0212         vsep='';
0213         for i=1:numel(w)
0214             wr=real(w(i));
0215             wi=imag(w(i));
0216             switch((wr~=0)+2*(wi~=0))+4*(abs(wi)==1)
0217                 case {0,1}
0218                     txt=[txt sprintf('%s%.*g',vsep,prec,wr)];
0219                 case 2
0220                     txt=[txt sprintf('%s%.*g%c,',vsep,prec,wi,imch)];
0221                 case 3
0222                     txt=[txt sprintf('%s%.*g%s%.*g%c,',vsep,prec,wr,sgns{2+sign(wi)},prec,abs(wi),imch)];
0223                 case 6
0224                     if wi>0
0225                         txt=[txt vsep imch ','];
0226                     else
0227                         txt=[txt vsep 'minus ' imch ','];
0228                     end
0229                 case 7
0230                     txt=[txt sprintf('%s%.*g%s%c,',vsep,prec,wr,sgns{2+sign(wi)},imch)];
0231             end
0232             % could use a <silence msec="???"/> command here
0233             vsep=[repmat('; ',1,find([0 mod(i,szp)]==0,1,'last')-1) ' '];
0234         end
0235     end
0236 end
0237 
0238 % deal with the 'p' option
0239 
0240 if opts(16)>1  % 'r' option is specified with a number
0241     txt=[sprintf('<pitch absmiddle="%d"> ',min(max(opts(16,2),-10),10)) txt];
0242 end
0243 
0244 invoke(S,'Speak',txt);
0245 x = mod(32768+reshape(double(invoke(M,'GetData')),2,[])'*[1; 256],65536)/32768-1;
0246 delete(M);      % delete output stream
0247 delete(S);      % delete all interfaces
0248 
0249 if opts(22)==1 % 'v' option with no argument
0250     x=x*(1/max(abs(x))); % autoscale
0251 end
0252 if opts(15)>0 || opts(41)>0 || ~nargout % 'o' option for audio output
0253     while opts(41)==0 && ~isempty(tsou) && toc(tsou)<lsou
0254     end
0255     sound(x,fs);
0256     tsou=tic;   % save time
0257     lsou=length(x)/fs;
0258 end

v_sapisynth

PURPOSE

SYNOPSIS

DESCRIPTION

CROSS-REFERENCE INFORMATION

SOURCE CODE