The VALL-E model, developed by Microsoft to demonstrate its most recent work in text-to-speech AI, can replicate a person's voice with just a three-second audio sample.

The VALL-E model, developed by Microsoft to demonstrate its most recent work in text-to-speech AI, can replicate a person's voice with just a three-second audio sample.

As mentioned by Ars Technica. The speech can be tailored to the speaker's timbre, emotional tone, and even the acoustics of the space.

As mentioned by Ars Technica. The speech can be tailored to the speaker's timbre, emotional tone, and even the acoustics of the space.

Though, like deepfakes, it poses dangers of abuse, it might one day be employed for specialised or high-end text-to-speech applications.

Though, like deepfakes, it poses dangers of abuse, it might one day be employed for specialised or high-end text-to-speech applications.

Microsoft refers to VALL-E as a "neural codec language model." It is based on Meta's

Microsoft refers to VALL-E as a "neural codec language model." It is based on Meta's

Audio is produced using text input and brief samples from the target speaker by an AI-powered compression neural net encoder.

Audio is produced using text input and brief samples from the target speaker by an AI-powered compression neural net encoder.