Controllable Multi-Speaker Emotional Speech Synthesis With Emotion Representation of High Generalization Capability
code
1. TTS Samples (English) on Cross-speaker Emotion Transfer on the ESD emotional speech dataset
| Emotion |
Reference Audio |
Target Speaker |
The Proposed Model(ours) |
Mspk-GST |
Mspk-VAE |
EDM |
| Angry |
|
|
|
|
|
|
| Happy |
|
|
|
|
|
|
| Surprise |
|
|
|
|
|
|
| Neutral |
|
|
|
|
|
|
| Sad |
|
|
|
|
|
|
2. TTS Samples (Mandarin) on Cross-speaker Emotion Transfer on the ESD emotional speech dataset
| Emotion |
Reference Audio |
Target Speaker |
The Proposed Model(ours) |
Mspk-GST |
Mspk-VAE |
EDM |
| Angry |
|
|
|
|
|
|
| Happy |
|
|
|
|
|
|
| Surprise |
|
|
|
|
|
|
| Neutral |
|
|
|
|
|
|
| Sad |
|
|
|
|
|
|
3. TTS Samples (Mandarin) on Cross-speaker Emotion Transfer on the DOE emotional speech dataset
| Emotion |
Reference Audio |
Target Speaker |
The Proposed Model(ours) |
Mspk-GST |
Mspk-VAE |
EDM |
| Angry |
|
|
|
|
|
|
| Happy |
|
|
|
|
|
|
| Surprise |
|
|
|
|
|
|
| Sad |
|
|
|
|
|
|
4. Extra samples on controllable emotional speech synthesis
| Emotion |
Target Speaker |
Weak |
Medium |
Strong |
| Angry |
|
|
|
|
| Happy |
|
|
|
|
| Surprise |
|
|
|
|
| Sad |
|
|
|
|