Skip to content
This repository was archived by the owner on Mar 27, 2026. It is now read-only.

Latest commit

 

History

History
438 lines (314 loc) · 22.6 KB

File metadata and controls

438 lines (314 loc) · 22.6 KB

Techmo TTS Service API

1. TTS GRPC API

The service base API is defined by the proto file. The API includes functions listed below:

service TTS
{
	rpc GetServiceVersion(GetServiceVersionRequest) returns (GetServiceVersionResponse);
	rpc GetResourcesId(GetResourcesIdRequest) returns (GetResourcesIdResponse);

	rpc ListVoices(ListVoicesRequest) returns (ListVoicesResponse);
	rpc ListSoundIcons(ListSoundIconsRequest) returns (ListSoundIconsResponse);
	rpc ListRecordings(ListRecordingsRequest) returns (ListRecordingsResponse);
	rpc ListLexicons(ListLexiconsRequest) returns (ListLexiconsResponse);

	rpc SynthesizeStreaming(SynthesizeRequest) returns (stream SynthesizeResponse);
	rpc Synthesize(SynthesizeRequest) returns (SynthesizeResponse);

	rpc GetChannelsUsage(GetChannelsUsageRequest) returns (GetChannelsUsageResponse);

	rpc PutRecording(PutRecordingRequest) returns (PutRecordingResponse);
	rpc DeleteRecording(DeleteRecordingRequest) returns (DeleteRecordingResponse);
	rpc GetRecording(GetRecordingRequest) returns (GetRecordingResponse);

	rpc PutLexicon(PutLexiconRequest) returns (PutLexiconResponse);
	rpc DeleteLexicon(DeleteLexiconRequest) returns (DeleteLexiconResponse);
	rpc GetLexicon(GetLexiconRequest) returns (GetLexiconResponse);
}

1.1. Functions Definitions

GetServiceVersion

rpc GetServiceVersion(GetServiceVersionRequest) returns (GetServiceVersionResponse)

Returns the version of the service, in SemVer format.

GetResourcesId

rpc GetResourcesId(GetResourcesIdRequest) returns (GetResourcesIdResponse)

Returns an identifier of the resources used by the service.

ListVoices

rpc ListVoices(ListVoicesRequest) returns (ListVoicesResponse)

Lists all available voices which can be used to synthesize speech.

ListSoundIcons

rpc ListSoundIcons(ListSoundIconsRequest) returns (ListSoundIconsResponse)

Lists all sound icons (their keys) for the requested (voice, variant, language) tuple.

ListRecordings

rpc ListRecordings(ListRecordingsRequest) returns (ListRecordingsResponse)

Lists all recordings (their keys) for the requested (voice, variant, language) tuple.

ListLexicons

rpc ListLexicons(ListLexiconsRequest) returns (ListLexiconsResponse)

Lists all currently loaded lexicons which can be referred by <lexicon> tag in synthesize requests.

SynthesizeStreaming

rpc SynthesizeStreaming(SynthesizeRequest) returns (stream SynthesizeResponse)

Synthesizes the speech (audio signal) based on the requested phrase and the optional configuration.
Returns audio signal with synthesized speech (streaming version, one or more response packets).

Synthesize

rpc Synthesize(SynthesizeRequest) returns (SynthesizeResponse)

Synthesizes the speech (audio signal) based on the requested phrase and the optional configuration.
Returns audio signal with synthesized speech (non-streaming version, always one response packet).

GetChannelsUsage

rpc GetChannelsUsage(GetChannelsUsageRequest) returns (GetChannelsUsageResponse)

Returns the info containing number of total available channels and channels currently in use.

PutRecording

rpc PutRecording(PutRecordingRequest) returns (PutRecordingResponse)

Adds a new recording with the requested key for the requested voice, or overwrites the existing one if there is already such a recording defined.

Note:
Licence must allow reconfiguration, otherwise PERMISSION_DENIED error is returned.

DeleteRecording

rpc DeleteRecording(DeleteRecordingRequest) returns (DeleteRecordingResponse)

Removes the recording with the requested key from the list of recordings of the requested voice.

Note:
Licence must allow reconfiguration, otherwise PERMISSION_DENIED error is returned.

GetRecording

rpc GetRecording(GetRecordingRequest) returns (GetRecordingResponse)

Sends back the content of the recording with the requested key for the requested voice, data is returned in the linear PCM16 format.

PutLexicon

rpc PutLexicon(PutLexiconRequest) returns (PutLexiconResponse)

Adds a new lexicon with the requested name or overwrites the existing one if there is already a lexicon with such name.

Note:
Licence must allow reconfiguration, otherwise PERMISSION_DENIED error is returned.

DeleteLexicon

rpc DeleteLexicon(DeleteLexiconRequest) returns (DeleteLexiconResponse)

Removes the lexicon with the requested name.

Note:
Licence must allow reconfiguration, otherwise PERMISSION_DENIED error is returned.

GetLexicon

rpc GetLexicon(GetLexiconRequest) returns (GetLexiconResponse)

Sends back the content of the lexicon with the requested name.

1.2. Requests and Responses Definitions

GetServiceVersionRequest

The request message for GetServiceVersion function. The message is empty.

GetServiceVersionResponse

The version info returned by GetServiceVersion function.

Field Type Description
version string Version of the sevice, in SemVer format.

GetResourcesIdRequest

The request message for GetResourcesId function. The message is empty.

GetResourcesIdResponse

The identifier returned by GetResourcesId function.

Field Type Description
id string Identifier of the resource pack the service is started with.

Identifier is an free-form string, which uniquely identifies a resource pack provided with the service.

ListVoicesRequest

The request message for ListVoices function.

Field Type Description
language_code string ISO 639-1 language code with an optional dialect.
Optional. When non-empty, limits the listed voices to the voices supporting the requested language.

ListVoicesResponse

The listing of available voices returned by ListVoices function.

Field Type Description
sampling_rate_hz int32 The sampling rate in Hz of all voices (it is identical for all available voices).
voices VoiceInfo (repeated) The list of all available voices or voices supporting the requested language.

ListSoundIconsRequest

The request message for ListSoundIcons function.

Field Type Description
voice_profile VoiceProfile Profile of the voice to list the sound icons for.

ListSoundIconsResponse

The result of the ListSoundIcons function.

Field Type Description
keys string (repeated) The list of keys of all available sound icons for the requested voice profile.

ListRecordingsRequest

The request message for ListRecordings function.

Field Type Description
voice_profile VoiceProfile Profile of the voice to list the recordings for.

ListRecordingsResponse

The result of the ListRecordings function.

Field Type Description
keys string (repeated) The list of keys of all available recordings for the requested voice profile.

ListLexiconsRequest

The request message for ListLexicons function.

Field Type Description
language_code string ISO 639-1 language code with an optional dialect.
Optional. When non-empty, limits the listed lexicons to the lexicons supprting the requested language.

ListLexiconsResponse

The result of the ListLexicons function.

Field Type Description
lexicons LexiconInfo (repeated) The list of all available lexicons.

SynthesizeRequest

The request message for SynthesizeStreaming and Synthesize functions.

Field Type Description
text string A phrase to be synthesized.
synthesis_config SynthesisConfig Optional. Tweaks the default service synthesis configuration.
output_config OutputConfig Optional. Overrides the default output audio properties.

The message contains a phrase to be synthesized and optional configurations.
The phrase to synthesize is either a plain text in orthographic form, or a subset of SSML. Consult the service documentation for the full list of supported SSML tags.
synthesis_config's fields can be set to specify parameters of synthesis (language, voice, prosodic properties, etc.), and output_config alters the format of the output (sampling rate, PCM16 or encoding like Ogg/Vorbis compression).

SynthesizeResponse

The result of the SynthesizeStreaming and Synthesize functions.

Field Type Description
sampling_rate_hz int32 Sampling rate of the returned audio in hertz.
audio bytes Audio data bytes either as Linear PCM (uncompressed 16-bit signed little-endian samples),
or encoded if requested by output_config.
warnings string (repeated) All the warnings generated by the service during processing of the request.

During SynthesizeStreaming, a series of one or more such messages are streamed back to the caller.
On the other hand, Synthesize simply returns exactly one response message.

GetChannelsUsageRequest

The request message for GetChannelsUsage function. The message is empty.

GetChannelsUsageResponse

The result of the GetChannelsUsage function.

Field Type Description
total_channels_count int32 The number of all available channels for the service, set by the licence.
INT_MAX means unrestricted access.
used_channels_count int32 The number of channels currently in use.

PutRecordingRequest

The request message for PutRecording function.

Field Type Description
voice_profile VoiceProfile Profile of the voice to put the recording for.
recording_key string The key of the new recording.
sampling_rate_hz int32 Sampling rate of the recording audio data in Hertz.
content bytes The recording audio data, in linear PCM16 format.

If there already exists a recording with such key for the requested voice profile, the existing recording content is overwritten.

PutRecordingResponse

The result of the PutRecording function. The message is empty, the response is used to verify returned GRPC status.

DeleteRecordingRequest

The request message for DeleteRecording function.

Field Type Description
voice_profile VoiceProfile Profile of the voice to look for the recording.
recording_key string The requested key of the recording (unique for any given voice profile).

DeleteRecordingResponse

The result of the DeleteRecording function. Message is empty, is used to verify returned GRPC status.

GetRecordingRequest

The request message for GetRecording function.

Field Type Description
voice_profile VoiceProfile Profile of the voice to look for the recording.
recording_key string The requested key of the recording (unique for any given voice profile).

GetRecordingResponse

The result of the GetRecording function.

Field Type Description
sampling_rate_hz int32 Sampling rate of the recording audio data in Hertz.
content bytes The recording audio data, in linear PCM16 format.

PutLexiconRequest

The request message for PutLexicon function.

Field Type Description
uri string URI of the lexicon, used as uri attribute of <lexicon> tags in synthesize requests.
outside_lookup_behaviour OutsideLookupBehaviour Can lexicon be selected for phrases outside of <lookup> SSML tags.
content string A content of the lexicon, shall comply to PLS.

The service supports only a subset of PLS. Consult the service documentation for the full list of supported PLS tags.

PutLexiconResponse

The result of the PutLexicon function. Message is empty, the response is used to verify returned GRPC status.

DeleteLexiconRequest

The request message for DeleteLexicon function.

Field Type Description
uri string URI of the lexicon to delete.

DeleteLexiconResponse

The result of the DeleteLexicon function. Message is empty, is used to verify returned GRPC status.

GetLexiconRequest

The request message for GetLexicon function.

Field Type Description
uri string URI of the lexicon to list its content.

GetLexiconResponse

The result of the GetLexicon function.

Field Type Description
outside_lookup_behaviour OutsideLookupBehaviour Can lexicon be selected for phrases outside of <lookup> SSML tags.
content string If successful, contains the content of the lexicon, in PLS format.

VoiceProfile

Provides information about voice, its variant, and language code as a selector for set of sound icons and predefined recordings.

Field Type Description
voice_name string The voice name to look for the recording.
voice_variant int32 The variant of the voice to look for the recording.
language_code string ISO 639-1 language code with an optional dialect to look for the recording.

SynthesisConfig

Provides information to the synthesizer that specifies how to process the request.

Field Type Description
language_code string ISO 639-1 language code with an optional dialect of text to be synthesized.
may be overridden by SSML tags in request text.
voice Voice Requested voice to be used to synthesize the text.
May be overridden by SSML tags in request text.
prosodic_properties ProsodicProperties Optional. Defines the parameters of synthesized speech.
silence_duration_between_segments_ms int32 Optional. Overrides the configured value for duration of silence between segments, in milliseconds.

If there is no voice satisfying all the required criteria defined by the voice field, the voice is selected according to name (if defined) first, gender (if defined) second, and age (if defined) third.

OutputConfig

Defines the parameters of the output audio.

Field Type Description
audio_encoding AudioEncoding Requested format of the output audio stream.
sampling_rate_hz int32 Desired sampling frequency in Hertz of synthesized audio. The value 0 means use the default Synthesizer sampling rate.
max_frame_size int32 Maximum frame size sent at once to the client to enable RTF Throttling (default=0, throttling disabled).

When RTF Throttling is enabled, the RTF (Real Time Factor) is throttled to 1.0, with one frame (with max_frame_size size) sent in advance. The frame size is expressed in samples, regardless of audio_encoding used (frame size expressed in bytes would likely be far smaller if output is not PCM16). Enabling RTF Throttling guarantees that when connection is interrupted, the respective channel is freed after time no longer than the playback time of a one frame.

RTF Throttling is effective only for TTS::SynthesizeStreaming calls. It is silently ignored for TTS::Synthesize calls.

Voice

Voice definition used to describe requested voice in SynthesisConfig.

Field Type Description
name string The name of the voice. If empty, it is not taken into account in voice selection.
gender Gender (optional) Gender of the voice. If not set, it is not taken into account in voice selection.
age Age (optional) Age of the voice. If not set, it is not taken into account in voice selection.
variant int32 Variant of the voice.

ProsodicProperties

Prosodic properties of the speech to be synthesized.

Field Type Description
pitch float The average speech pitch scaling factor. The value 1.0 is a neutral value.
range float The pitch range scaling factor. The value 1.0 is a neutral value.
rate float The speech rate (speed) scaling factor. The value 1.0 is a neutral value.
stress float The speech stress scaling factor. The value 1.0 is a neutral value.
volume float The speech volume scaling factor. The value 1.0 is a neutral value.

VoiceInfo

Information about a voice, returned by ListVoices function.

Field Type Description
supported_languages string (repeated) The list of ISO 639-1 codes of languages supported by the voice.
name string The name of the voice.
gender Gender Gender of the voice.
age Age Age of the voice.
variants_count int32 The number of variants of the voice (at least one).

LexiconInfo

Lexicon uri and behaviour outside lookup tags returned by ListLexicons function.

Field Type Description
uri string URI of the lexicon, used as uri attributes of <lexicon> tags in synthesize requests.
outside_lookup_behaviour OutsideLookupBehaviour Can lexicon be selected for phrases outside of <lookup> SSML tags.

1.3. Enumerations

Gender

Enum type, indicates the gender of the voice.

Name Number
FEMALE 0
MALE 1

Age

Enum type, indicates the age of the voice.

Name Number Description
ADULT 0 Selected for SSML age attribute in range (16 - 60]. Default.
CHILD 1 Selected for SSML age attribute in range [0 - 16].
SENILE 2 Selected for SSML age attribute in range (60 - inf).

AudioEncoding

Enum type, indicates the requested format of the response audio data.

Name Number Description
PCM16 0 Uncompressed 16-bit signed integer samples, without any header.
OGG_VORBIS 1 Ogg/Vorbis encoded data stream.
OGG_OPUS 2 Ogg/Opus encoded data stream.
A_LAW 3 ITU-T G.711 A-law encoded stream.
MU_LAW 4 ITU-T G.711 mu-law encoded stream.

Note:
When using Ogg/Opus encoding, only 8kHz, 12kHz, 16kHz, 24kHz, and 48kHz sampling rates are allowed.

OutsideLookupBehaviour

Enum type, indicates if is lexicon allowed to be matched even for phrases outside of <lookup> SSML tags.

Name Number
ALLOWED 0
DISALLOWED 1