Text to speech (TTS) transforms written text into spoken words, bridging the gap between the written word and the spoken voice.
TTS tools offer several valuable use cases for businesses, enhancing productivity and user experience:
Audiobook Production
TTS technology can automate the conversion of written content into audiobooks, saving time and resources while catering to a broader audience's preferences for audio content.
Accessibility Compliance
Businesses can ensure their digital content is accessible to individuals with visual impairments by using TTS to convert text into spoken words, making websites and documents compliant with accessibility regulations.
Interactive Voice Response (IVR) Systems
TTS is vital for creating natural-sounding voice prompts in IVR systems enhancing customer service by providing automated but human-like interactions, such as call routing and information retrieval.
Virtual Assistants and Chatbots
Integrating TTS into virtual assistants and chatbots allows businesses to provide personalized and engaging interactions with users, whether on websites or through messaging apps, enhancing customer engagement and support.
Enhanced Product Demonstrations
Sales teams can use TTS to create audio-enhanced product demonstrations or tutorials. This makes it easier for potential customers to understand product features and benefits, leading to more informed purchase decisions.
Capabilities
Synchronous API: Text to Speech supports synchronous API over HTTPS protocols. You can send text input and get audio as a response.
Multiple Output Formats: Text to Speech can generate PCM, MP3, OGG, and JSON format.
Standard and Natural Voices: Text to Speech offers male and female standard and natural (human-like) voices.
Chunk Streaming Support: Text to Speech service supports chunk transfer encoding over HTTPS protocol. You can send a request with input text and get audio output in chunks. This helps in reducing latency at client side.
Speech Synthesis Markup Language (SSML): You can send Speech Synthesis Markup Language (SSML) in your Text to Speech request to for more customization in your audio response by providing details on pauses, and audio formatting for acronyms, dates, times, and abbreviations.
Add a pause in your message. Both natural and standard voices available.
<break> Attributes
Attribute
Value
Description
time
[number]s
The duration of the pause, in seconds.
[number]ms
The duration of the pause, in milliseconds.
strength
none
No pause. Use none to remove a normally occurring pause, such as after a period. Equivalent to "0ms".
x-weak
Has the same strength as none, no pause.
weak
Sets a pause of the same duration as the pause after a comma. Equivalent to "150ms".
medium
Has the same strength as weak.
strong
Sets a pause of the same duration as the pause after a sentence. Equivalent to "400ms".
x-strong:
Sets a pause of the same duration as the pause after a paragraph. Equivalent to "800ms".
Example 1:
Copy
<speak>
Close your eyes, take a deep breath <break time="1s"/>
and let go of all the stress and worries.
Feel the gentle breeze <break time="1500ms"/> as
it caresses your skin, and listen to the
soothing sounds of nature.
</speak>
Example 2:
Copy
<speak>
Let me give you a demonstration of the <break strength="x-strong"/> strong pause.
Now, let's try a <break strength="strong"/> medium pause.
Finally, we have a <break strength="weak"/> weak pause.
</speak>
To add a pause between lines or sentences in the text. Same effect as ending sentence with period or <break strength="strong"/>. Both natural and standard voices available.
Copy
<speak>
<s>This is the first sentence</s>
<s>This is the second sentence</s>
This is the last sentence.
</speak>
To add a pause at the end of paragraphs in your text. It provides a longer pause than native speakers usually place at commas or the end of a sentence. Both natural and standard voices available.
Copy
<speak>
<p>Good morning, ladies and gentlemen. I would like to take this opportunity to welcome you all to our annual conference on artificial intelligence.</p>
<p>Our keynote speaker for this event is Dr. Samantha Johnson, a renowned expert in machine learning and data analytics.</p>
</speak>
Used to tell how to say certain characters, words, and numbers. Both natural and standard voices available.
Attribute
Value
Description
interpret-as
date
Interprets the contained text as a Gregorian calendar date. The format of the date must be specified with the format attribute. The date separator character can be forward slash (/), dash (-), and period (.). No white space is allowed inside a date string.
time
Interprets the numerical text as duration, in hours, minutes, and seconds. The text must be in hour:min or hour:min:seconds . Optionally, it can be followed by "A.M." or "P.M.". Here A.M. can also be written as AM, a.m., or am. Setting detail = "1" instructs the SSML parser to give the text output in the 24-hour format and setting detail = "2" instructs the SSML parser to give output in 12-hour format.
fraction
Interprets the numerical text as a fraction. It works for both common and mixed fraction.
digits
Spells out each digit individually, Example 1234 as 1-2-3-4.
cardinal
Interprets the numerical text as a cardinal number.
ordinal
Interprets the numerical text as an ordinal number. Example '1' is interpreted as 1st, '2' as '2nd' and so on.
spell-out
Speaks out each character of the text enclosed between the say-as tag. This includes punctuation marks, special symbols and spaces also.
unit
Interprets a numerical text as a measurement. The value must be either a number or a fraction followed by a unit with no spaces.
Example:
Copy
<speak>
<p>Say As tag controls how special types of words are spoken, such as numbers, currencies, units, dates, times and acronyms</p>
For Example:
I can speak acronym <say-as interpret-as="spell-out">IRFC</say-as> for Indian Railway Finance Corporation.
I can speak India currency <say-as interpret-as="currency">₹5200</say-as>.
I can speak US currency <say-as interpret-as="currency">$5200</say-as>.
I can speak dimensions <say-as interpret-as="unit">5cm</say-as> length and <say-as interpret-as="unit">10cm</say-as> width.
I can speak temperature <say-as interpret-as="unit">25°C</say-as>.
I can speak fraction values <say-as interpret-as="fraction">3/4</say-as>.
I can speak ordinals <say-as interpret-as="ordinal">1731</say-as> Rank.
I can speak digits <say-as interpret-as="digits">1234 and 5678</say-as>.
I can speak date <say-as interpret-as="date" format="ymd">2022-11-13</say-as> and time <say-as interpret-as="time">10:00 AM</say-as>.
</speak>
Used with the alias attribute to substitute a different word (or pronunciation) for selected text such as an acronym or abbreviation. Both natural and standard voices available.
Example:
Copy
<speak>
My favorite chemical element is <sub alias="Mercury">Hg</sub>, because it looks so shiny.
</speak>
Refers to the patterns of stress and intonation in a language. Only standard voices are available.
Attribute
Value
Description
rate
"X%"
Controls the speed of speech. The value in percentage must be less than 100 % and the increase or decrease in rate is relative to default speaking rate.
X denotes increase (+X%) or decrease (-X%) in the rate.
default
Default speaking rate
x-slow
Very slow speaking rate.
slow
Slow speaking rate.
medium
Medium speaking rate. Default speaking rate.
fast
Fast speaking rate.
x-fast
Very fast speaking rate.
volume
"XdB"
Controls the volume of the speech. With the help of this attribute, you aren't assigning a fixed volume, but changing it relative to the current volume.
X can be a positive or a negative number depending on whether you want to increase or decrease volume.
default
Default volume.
x-soft
Very low volume. It's approx 12 dB lower than default.
soft
Low volume. It's approx 6 dB lower than default.
medium
Medium volume rate. Default value.
loud
Loud volume. It's approx 6 dB higher than default.
x-loud
Very loud volume. It's approx 12 dB higher than default.
pitch
default
Default pitch.
x-low
Very low pitch.
low
Low pitch.
medium
Medium pitch Default pitch.
high
High pitch.
x-high
Very high pitch.
Example 1:
Copy
<speak>
<prosody rate="0%">This is the default speaking rate.</prosody>
<prosody rate="-50%">Decrease the speaking rate by half the default rate.</prosody>
<prosody rate="+50%">Increase the speaking rate by fifty percent of the default rate.</prosody>
</speak>
Example 2:
<speak>
<p>
<s>Hi, this is a normal sentence.</s>
<s>
<prosody volume="+10dB">This is a louder sentence!</prosody>
</s>
<s>
<prosody volume="-8dB">This is a quieter sentence.</prosody>
</s>
</p>
</speak>
Example 3:
Copy
<speak>
<prosody pitch='default'>This is the default pitch.</prosody>
<prosody pitch='medium'>This is the default pitch.</prosody>
<prosody pitch='x-low'>This is the very low pitch.</prosody>
<prosody pitch='x-high'>This is the very high pitch.</prosody>
</speak>
Allows you to use multiple voices in a single SSML request. Both natural and standard voices available.
Example:
Copy
<speak>
<voice name="Bob">Hello Cindy, how are you doing.</voice>
<voice name="Cindy">Hello Bob, I am good, thank you.</voice>
<voice name="Bob">Hope you enjoyed your stay with us.</voice>
<voice name="Cindy">Yes, it was lovely. I enjoyed the food and the services a lot. Thank you for hosting me. I would love to be back sometime soon.</voice>
</speak>