Robots at the moment have come a good distance from their early inception as insentient beings meant primarily for mechanical help to people. Immediately, they’ll help us intellectually and even emotionally, getting ever higher at mimicking acutely aware people. An integral a part of this means is using speech to speak with the person (good assistants corresponding to Google Dwelling and Amazon Echo are notable examples). Regardless of these exceptional developments, they nonetheless don’t sound very “human.”
That is the place voice conversion (VC) is available in. A know-how used to change the speaker id from one to a different with out altering the linguistic content material, VC could make the human-machine communication sound extra ‘pure’ by altering the non-linguistic info, corresponding to including emotion to speech. “Apart from linguistic info, non-linguistic info can be essential for pure (human-to-human) communication. On this regard, VC can really assist folks be extra sociable since they’ll get extra info from speech,” explains Prof. Masato Akagi from Japan Superior Institute of Science and Know-how (JAIST), who works on speech notion and speech processing.
Speech, nevertheless, can happen in a mess of languages (for instance, on a language-learning platform) and infrequently we’d want a machine to behave as a speech-to-speech translator. On this case, a standard VC mannequin experiences a number of drawbacks, as Prof. Akagi and his doctoral pupil at JAIST, Tuan Vu Ho, found once they tried to use their monolingual VC mannequin to a “cross-lingual” VC (CLVC) activity. For one, altering the speaker id led to an undesirable modification of linguistic info. Furthermore, their mannequin didn’t account for cross-lingual variations in “F0 contour,” which is a crucial high quality for speech notion, with F0 referring to the basic frequency at which vocal cords vibrate in voiced sounds. It additionally didn’t assure the specified speaker id for the output speech.
Now, in a brand new research revealed in IEEE Entry, the researchers have proposed a brand new mannequin appropriate for CLVC that permits for each voice mimicking and management of speaker id of the generated speech, marking a major enchancment over their earlier VC mannequin.
Particularly, the brand new mannequin applies language embedding (mapping pure language textual content, corresponding to phrases and phrases, to mathematical representations) to separate languages from speaker individuality and F0 modeling with management over the F0 contour. Moreover, it adopts a deep learning-based coaching mannequin known as a star generative adversarial community, or StarGAN, aside from their beforehand used variational autoencoder (VAE) mannequin. Roughly put, a VAE mannequin takes in an enter, converts it right into a smaller and dense illustration, and converts it again to the unique enter, whereas a StarGAN makes use of two competing networks that push one another to generate improved iterations till the output samples are indistinguishable from pure ones.
The researchers confirmed that their mannequin may very well be skilled in an end-to-end trend with direct optimization of language embedding throughout the coaching and allowed good management of speaker id. The F0 conditioning additionally helped take away language dependence of speaker individuality, which enhanced this controllability.
The outcomes are thrilling, and Prof. Akagi envisions a number of future prospects of their CLVC mannequin. “Our findings have direct functions in safety of speaker’s privateness by anonymizing one’s id, including sense of urgency to speech throughout an emergency, post-surgery voice restoration, cloning of voices of historic figures, and decreasing the manufacturing price of audiobooks by creating totally different voice characters, to call a couple of,” he feedback. He intends to additional enhance upon the controllability of speaker id in future analysis.
Maybe the day just isn’t far when good units begin sounding much more like people.
Speech sign processing—enhancing voice conversion fashions
Tuan Vu Ho et al, Cross-Lingual Voice Conversion With Controllable Speaker Individuality Utilizing Variational Autoencoder and Star Generative Adversarial Community, IEEE Entry (2021). DOI: 10.1109/ACCESS.2021.3063519
Japan Superior Institute of Science and Know-how
Sounds acquainted: A speaker identity-controllable framework for machine speech translation (2021, April 26)
retrieved 27 April 2021
This doc is topic to copyright. Aside from any truthful dealing for the aim of personal research or analysis, no
half could also be reproduced with out the written permission. The content material is offered for info functions solely.