Controllable Multi-Speaker Emotional Speech Synthesis With Emotion Representation of High Generalization Capability
code
1. TTS Samples (English) on Cross-speaker Emotion Transfer on the ESD emotional speech dataset
Emotion |
Reference Audio |
Target Speaker |
The Proposed Model(ours) |
Mspk-GST |
Mspk-VAE |
EDM |
Angry |
|
|
|
|
|
|
Happy |
|
|
|
|
|
|
Surprise |
|
|
|
|
|
|
Neutral |
|
|
|
|
|
|
Sad |
|
|
|
|
|
|
2. TTS Samples (Mandarin) on Cross-speaker Emotion Transfer on the ESD emotional speech dataset
Emotion |
Reference Audio |
Target Speaker |
The Proposed Model(ours) |
Mspk-GST |
Mspk-VAE |
EDM |
Angry |
|
|
|
|
|
|
Happy |
|
|
|
|
|
|
Surprise |
|
|
|
|
|
|
Neutral |
|
|
|
|
|
|
Sad |
|
|
|
|
|
|
3. TTS Samples (Mandarin) on Cross-speaker Emotion Transfer on the DOE emotional speech dataset
Emotion |
Reference Audio |
Target Speaker |
The Proposed Model(ours) |
Mspk-GST |
Mspk-VAE |
EDM |
Angry |
|
|
|
|
|
|
Happy |
|
|
|
|
|
|
Surprise |
|
|
|
|
|
|
Sad |
|
|
|
|
|
|
4. Extra samples on controllable emotional speech synthesis
Emotion |
Target Speaker |
Weak |
Medium |
Strong |
Angry |
|
|
|
|
Happy |
|
|
|
|
Surprise |
|
|
|
|
Sad |
|
|
|
|