Microsoft tts xml




















Important Costs vary for prebuilt neural voices referred as Neural on the pricing page and custom neural voices referred as Custom Neural on the pricing page. Note The above regions are available for neural voice model hosting and real-time synthesis. Note If your selected voice and output format have different bit rates, the audio is resampled as necessary. Submit and view feedback for This product This page. View all page feedback. In this article.

An authorization token preceded by the word Bearer. For more information, see Authentication. A required parameter is missing, empty, or null. Or, the value passed to either a required or optional parameter is invalid.

A common issue is a header that is too long. The request is not authorized. Check to make sure your subscription key or token is valid and in the correct region. Specifies the content type for the provided text. Specifies the audio output format.

Zero represents the default middle pitch for a voice, with positive values being higher and negative values being lower. The Emph tag instructs the voice to emphasize a word or section of text. The Emph tag cannot be empty. The following word should be emphasized. The Spell tag forces the voice to spell out all text, rather than using its default word and sentence breaking rules, normalization rules, and so forth. All characters should be expanded to corresponding words including punctuation, numbers, and so forth.

The Spell tag cannot be empty. Three tags are supported that applications the ability to insert items directly at some level: Silence, Pron, and Bookmark. The Silence tag inserts a specified number of milliseconds of silence into the output audio stream.

This tag must be empty, and must have one attribute, Msec. The Pron tag inserts a specified pronunciation. The voice will process the sequence of phonemes exactly as they are specified. This tag can be empty, or it can have content. If it does have content, it will be interpreted as providing the pronunciation for the enclosed text. That is, the enclosed text will not be processed as it normally would be.

The Bookmark tag inserts a bookmark event into the output audio stream. Use this event to signal the application when the audio corresponding to the text at the Bookmark tag has been reached.

The Bookmark tag must be empty. The Bookmark tag has one attribute, Mark, whose value is a string. This value can then be used to differentiate between bookmark events each of which will contain the string value from their corresponding tag. Two tags provide context to the current voice: PartOfSp and Context. Those tags enable the voice to determine how to deal with the text it is processing.

With both of these tags, the extent to which voices use the context may vary. The PartOfSp tag provides the voice with the part of speech of the enclosed word s. Use this tag to enable the voice to pronounce a word with multiple pronunciations correctly depending on its part of speech. The PartOfSp tag cannot be empty. Examples of valid values are 2s and ms. Specifies the location of silence be added: Leading — at the beginning of text Tailing — in the end of text Sentenceboundary — between adjacent sentences.

Specifies the phonetic alphabet to use when synthesizing the pronunciation of the string in the ph attribute. The string specifying the alphabet must be specified in lowercase letters. The following are the possible alphabets that you can specify. A string containing phones that specify the pronunciation of the word in the phoneme element. If the specified string contains unrecognized phones, the Text-to-Speech TTS service rejects the entire SSML document and produces none of the speech output specified in the document.

Indicates the baseline pitch for the text. You can express the pitch as: An absolute value, expressed as a number followed by "Hz" Hertz. The "st" indicates the change unit is semitone, which is half of a tone a half step on the standard diatonic scale. A constant value: x-low low medium high x-high default. Contour now supports neural voice. Contour represents changes in pitch. These changes are represented as an array of targets at specified time positions in the speech output.

Each target is defined by sets of parameter pairs. The second value specifies the amount to raise or lower the pitch, using a relative value or an enumeration value for pitch see pitch.

A value that represents the range of pitch for the text. You can express range using the same absolute values, relative values, or enumeration values used to describe pitch. Indicates the speaking rate of the text. You can express rate as: A relative value, expressed as a number that acts as a multiplier of the default. For example, a value of 1 results in no change in the rate. A value of 0. A value of 3 results in a tripling of the rate.

A constant value: x-slow slow medium fast x-fast default. Indicates the volume level of the speaking voice. You can express the volume as: An absolute value, expressed as a number in the range of 0. For example, The default is A constant value: silent x-soft soft medium loud x-loud default. Provides additional information about the precise formatting of the element's text for content types that might have ambiguous formats.

SSML defines formats for content types that use them see table below. Indicates the level of detail to be spoken. For example, this attribute might request that the speech synthesis engine pronounce punctuation marks.

There are no standard values defined for detail. The text is spoken as an address. The text is spoken as a cardinal number. The text is spoken as individual letters spelled out. The text is spoken as a date. The text is spoken as a sequence of individual digits. The text is spoken as a fractional number. The text is spoken as an ordinal number. The text is spoken as a telephone number.

The format attribute can contain digits that represent a country code. For example, "1" for the United States or "39" for Italy. The speech synthesis engine can use this information to guide its pronunciation of a phone number. The phone number might also include the country code, and if so, takes precedence over the country code in the format. The text is spoken as a time. The format attribute specifies whether the time is specified using a hour clock hms12 or a hour clock hms Use a colon to separate numbers representing hours, minutes, and seconds.

The following are valid time examples: , , , and The text is spoken as a person name. In Chinese names, some characters pronounce differently when they appear in a family name. Specifies the volume of the background audio file. Accepted values : 0 to inclusive. The default value is 1. Specifies the duration of the background audio "fade in" as milliseconds. The default value is 0 , which is the equivalent to no fade in.

Sign up using Email and Password. Post as a guest Name. Email Required, but never shown. The Overflow Blog. Stack Gives Back Safety in numbers: crowdsourcing data on nefarious IP addresses.

Featured on Meta. New post summary designs on greatest hits now, everywhere else eventually. Related Hot Network Questions.



0コメント

  • 1000 / 1000