Code

https://github.com/Thommy96/IMS-Toucan

Audio Samples

Baseline - TTS system without additional conditioning prompt

The baseline TTS system generates speech directly from the given text input, without any additional conditioning prompts. This approach serves as a foundation for comparison.

Proposed - TTS system additionally conditioned on natural language prompts

The proposed TTS system takes advantage of natural language prompts for enhanced prosody control. By conditioning the generation process on these prompts, the goal is to achieve more expressive and contextually appropriate speech output. Thereby the produced speech prosody is expected to rely on the (emotional) content of the prompt.

Using the Input Text as Prompt

Emotion Input Sentence Baseline Proposed
Anger You can't be serious, how dare you not tell me you were going to marry her?
Joy I really enjoy the beach in the summer.
Neutral You can go to the Employment Development Office and pick it up.
Sadness Lily broke up with me last week, in fact, she dumped me.
Surprise He was astonished when he saw them come alone, and asked what had happened to them.

Using a different Prompt

Emotion Prompt Input Sentence Proposed
Anger You can't be serious, how dare you not tell me you were going to marry her? Lily broke up with me last week, in fact, she dumped me.
Joy I really enjoy the beach in the summer. You can go to the Employment Development Office and pick it up.
Neutral You can go to the Employment Development Office and pick it up. You can't be serious, how dare you not tell me you were going to marry her?
Sadness Lily broke up with me last week, in fact, she dumped me. He was astonished when he saw them come alone, and asked what had happened to them.
Surprise He was astonished when he saw them come alone, and asked what had happened to them. I really enjoy the beach in the summer.