Audio generation is a major branch of generative AI research. Compared with prior works in this area that are commonly task-specific with heavy domain knowledge, this paper advocates building universal audio generation models that can handle various tasks in a unified manner.
As recent research on large language models (LLMs) has demonstrated their strong ability to handle multiple tasks, this work presents UniAudio, an LLM-based audio generation model that supports a wide range of audio generation tasks.
Based on various input conditions, such as phoneme, text description, or audio itself, UniAudio can generate speech, sound, music, and singing voice.
The proposed UniAudio is built with 100k hours of multi-source open-available audio data and is scaled to 1B parameters. The audio tokenization method and language model architecture are also specifically designed for both performance and efficiency. Experimentally, UniAuido supports 11 audio generation tasks and achieves competitive results on all tasks consistently. We also show that UniAudio can support new tasks seamlessly via simple fine-tuning.
Overview
The overview of UniAudio as following picture shows.
In the following, we will show some generated samples by our proposed method.
Zero-shot TTS.
In the following, we first show some case in LibriTTS test clean set.
Content (The transcirption of the target audio)
Prompt
Generated Speech
GT Speech
IT IS SIXTEEN YEARS SINCE JOHN BERGSON DIED
IF A LAYMAN IN GIVING BAPTISM POUR THE WATER BEFORE SAYING THE WORDS IS THE CHILD BAPTIZED
THAT IS ONE REASON YOU ARE OJO THE UNLUCKY SAID THE WOMAN IN A SYMPATHETIC TONE
THE DEWS WERE SUFFERED TO EXHALE AND THE SUN HAD DISPERSED THE MISTS AND WAS SHEDDING A STRONG AND CLEAR LIGHT IN THE FOREST WHEN THE TRAVELERS RESUMED THEIR JOURNEY
BY THIS TIME LORD CHELFORD AND WYLDER RETURNED AND DISGUSTED RATHER WITH MYSELF I RUMINATED ON MY WANT OF GENERAL SHIP
Cloning famous person’s voice
In the following, we try to using 3 seconds prompt from three famous person: Theresa May, Barack Obama and Taylor Swift, and using their voice to read some text content (randomly choose from LibriTTS).
Name
Content (The transcirption of the target audio)
Prompt
Generated Speech
Barack Obama
YOUNG HAD BEEN COMMANDED TO HIS MOTHER’S CHAMBER SO SOON AS HE HAD COME OUT FROM HIS CONVERSE WITH THE SQUIRE
Barack Obama
I CANNOT ALLOW THE EXAMINATION TO BE HELD IF ONE OF THE PAPERS HAS BEEN TAMPERED WITH THE SITUATION MUST BE FACED
Barack Obama
MUCH LATER WHEN A FRIEND OF HIS WAS PREPARING AN EDITION OF ALL HIS LATIN WORKS HE REMARKED TO HIS HOME CIRCLE IF I HAD MY WAY ABOUT IT THEY WOULD REPUBLISH ONLY THOSE OF MY BOOKS WHICH HAVE DOCTRINE MY GALATIANS FOR INSTANCE
Barack Obama
GRAM ROUGHLY ONE TWENTY EIGHTH OF AN OUNCE
Theresa May
HE GAVE WAY TO THE OTHERS VERY READILY AND RETREATED UNPERCEIVED BY THE SQUIRE AND MISTRESS FITZOOTH TO THE REAR OF THE TENT
Taylor Swift
YOUNG HAD BEEN COMMANDED TO HIS MOTHER’S CHAMBER SO SOON AS HE HAD COME OUT FROM HIS CONVERSE WITH THE SQUIRE
Taylor Swift
REST AND BE STILL UNTIL I WARN YOU
Taylor Swift
THE COMBINED BANDS OF BOTH THE COUNTRIES PLAYED THE MUSIC AND A FINE SUPPER WAS SERVED
Cloning the person’s voice from your daily life
In this part, we show some case that clone our friends’s voice. We directly use our smart phone to record 3 seconds prompts( One use the iphone, the other use VIVO). We directly speak chinese, we expect the model can transfer our voice into English.
Name
Content
Prompt
Generated Speech
Girl
AND EMIL MOWED HIS WAY SLOWLY DOWN TOWARD THE CHERRY TREES
Girl
THE DEPARTURE WAS NOT AT ALL AGREEABLE
Girl
IT’S NOT PARTICULARLY RARE SHE SAID BUT SOME OF IT WAS MY MOTHER’S
Girl
WHAT I MEAN IS THAT I WANT YOU TO PROMISE NEVER TO SEE ME AGAIN NO MATTER HOW OFTEN I COME NO MATTER HOW HARD I BEG
Girl
DO YOU KNOW LAKE OH I REALLY CAN’T TELL BUT HE’LL SOON TIRE OF COUNTRY LIFE
Boy
THE EARTH IS NOT DEVOID OF RESEMBLANCE TO A JAIL
Boy
INDEED THERE WERE ONLY ONE OR TWO STRANGERS WHO COULD BE ADMITTED AMONG THE SISTERS WITHOUT PRODUCING THE SAME RESULT
Boy
ALSO THERE WAS A STRIPLING PAGE WHO TURNED INTO A MAID
Boy
I HAD A NAME I BELIEVE IN MY YOUNG DAYS BUT I HAVE FORGOTTEN IT SINCE I HAVE BEEN IN SERVICE
Long sentence by TTS
Content
Generated Speech
GT Speech
THE DYNAMO ELECTRIC MACHINE THOUGH SMALL WAS ROBUST FOR UNDER ALL THE VARYING SPEEDS OF WATER POWER AND THE VICISSITUDES OF THE PLANT TO WHICH IT BELONGED IT CONTINUED IN ACTIVE USE UNTIL EIGHTEEN NINETY NINE SEVENTEEN YEARS
EVERY ONE COULD OBSERVE HIS AGITATION AND PROSTRATION A PROSTRATION WHICH WAS INDEED THE MORE REMARKABLE SINCE PEOPLE WERE NOT ACCUSTOMED TO SEE HIM WITH HIS ARMS HANGING LISTLESSLY BY HIS SIDE HIS HEAD BEWILDERED AND HIS EYES WITH ALL THEIR BRIGHT INTELLIGENCE BEDIMMED
Zero-shot VC.
In the following, we show some case using VCTK. Similar with previous section, we can easy to use our own voice prompt to realize voice conversion.
Source
Prompt
Generated Speech
Ground Truth Speech
Zero-shot Sing Voice Synthesis
Content (words of song)
Speaker Prompt
Generated Sing
十年之前@我不认识你@你不属于我@我们还是一样@陪在一个陌生人左右
好吧天亮之后总是潦草离场@清醒的人最荒唐
我们这些努力不简单@快乐炼成泪水@是一种勇敢
空空留遗憾多难堪又为难
嗯纵容着任性的随意的放肆的轻易的
你说你好想带我回去你的家
狼狈比失去难受我怀念的是无话不说
我怀念的@是争吵以后还会想要爱你的冲动
我和我最后的倔强握紧双手绝对不放下一站
Zero-shot Speech Enhancement
Noisy Speech
De-noised Speech
Ground Truth
Using UniAudio to denoise the speech from film.
In the following, we find a moive clips from bilibili (https://www.bilibili.com/video/BV1UA411J7H2/?vd_source=e4a11ee6c459009bfa05833e70cd49c3). Due to the actor’s lines are easy to be misunderstanding if you never see the moive “功夫”, which directed and starring by Stephen Chow in 2004 (strongly recommendation!). We only show part of the content.
Noisy Speech
De-noised Speech
Zero-shot Target Speaker Extraction
Mixed Speech
Prompt
Extracted Speech
Ground Truth
Zero-shot Text-to-Sound
Instruction (The text description)
Generated Sound
Someone is running alone on a hardwood floor.
Repeated bursts of fireworks are accompanied by intermittent cheers and whistles of a crowd
A bird frequently vocalizing with a high pitched chirp.
Several musicians warm up on their instruments as people in the audience talk.
Birds chirp and people talk as cars go by.
After a train horn blows, the chugging of the engine Increases.
After a train horn blows, the chugging of the engine Increases.
20s audio genenration
Instruction (The text description)
Generated Sound
An ambulance siren wails continue for more than twenty seconds.
A cricket is chirping as the wind is blowing.
Zero-shot Text-to-Music
Instruction (The text description)
Generated Music
The Pop song features harmonizing vocals singing over energetic crash cymbals, groovy bass guitar, wide electric guitar melody, punchy snare and soft kick hits. There is a short drum break at the very beginning of the loop. It sounds happy, nostalgic and euphoric, as it gives off vibes of a Christmas song.
This is a rap music piece played behind a rollerskating video. The sound of the skaters can be heard faintly throughout the recording. There is a male voice rapping at the forefront while other voices can be heard singing melodically in the background and ad-libbing occasionally. There is a mild keyboard playing the tune while a loud electronic drum beat is playing the rhythm. The atmosphere of this piece is groovy and urban.
The low quality recording features a live performance of fruity male vocal singing over funky piano melody and beat played on playback that consists of punchy snare and kick hits, shimmering hi hats and smooth bass. The crowd is also singing along, harmonizing with the lead vocal. It sounds emotional and groovy, even though the recording is noisy.
This is the recording of a gear showcase jam. There is an overdriven electric guitar playing a groovy solo with the amplified reverb effect. There is a crunchy sound. The piece can be used as the jingle of an advertisement. It could also be used for lifting electric guitar samples to be used in a beat.
The R&B song features a passionate male vocalist singing over a wide funky electric guitar melody, smooth bass guitar, punchy kick and snare hits, shimmering hi hats, snappy rimshots and soft crash cymbals. The rimshots are present in the first half of the loop, while a more energetic second part of the loop consists of punchy snare hits. It sounds emotional and heartfelt, as the vocal is slightly distorted.
The low quality recording features a traditional song that consists of wide, harmonizing female vocals singing over acoustic rhythm guitar. It sounds mellow, soft and emotional.
someone is playing a high pitched melody on a steel drum. The file is of poor audio-quality.
The pop rock music features a male voice singing. An electric guitar with a distortion effect on plays plays two chords every two measures. The drums play a strong rhythm and together with a synth bass drive the pulse of the music.
The low quality recording features a live performance of loud steel pans melody over playback instrumental that consists of shimmering hi hats, open hats and “4 on the floor” kick pattern. There are some crowd talking noises in the background. It sounds exotic and energetic.
This is a pop music piece. The words are being sung by two vocals: one male and one female which lead to a duet for the chorus. There is a banjo and an electric guitar playing the melody while a simple electronic drum beat provides the rhythmic background for the song. It is a slightly melodic and emotional song. This piece could be used in the soundtrack of a romantic drama during a flashback scene.
20s music generation
Instruction (The text description)
Generated Music
The music is purely instrumental and so it features no human voice. More gamelans are played but by using different techniques. Other metallic percussion instruments can be heard.
This song is a sweet duet. The tempo is medium with a melodious, intense piano accompaniment , electric guitar rhythm, steady drumming and synthesiser arrangements. This song is melodic, story telling, spirited, emotional, passionate and sweet. The lyrics are simple and so this song could be a Children’s Song.
Instructed TTS
Instruction (Control the generated samples)
Content
Generated Speech
A man speaks loudly and slowly with a bass tone
Degenerated into deadness and formality
The mournful madame says: loud, slow and high pitched
He flushed crimson
A man who speaks very slowly, his tone is very low, shehehe yowled loudly
Her blank gaze chilled you
A man said with a relatively low tone and saying a fast speed and breathe loudly
He sacrificed the vulgar prizes of life
Audio Edit
Instruction
Source Audio
Generated Audio
Add: loud roar of traffic in the background
Add: a helicopters’ engine is running in the background
drop: birds chirping, wings flap
Drop:a motorcycle idles nearby at moderate speed
super resolution: A clock is ringing a bell
super resolution: A clock is chiming
Super resolution: Man speaking with water sounds
Speech dereverberation
Reverberation Speech
De-reverberated Speech
Ground Truth
Speech Edit
Content
Instruction
Generated Speech
Original Audio
Please wait for me, Marie,” Emil coaxed.
delete the word ‘for’
He is called, as you know, the apostle of the Indies.
replace the word “called” as “named”
It is thou that must tell me!
delete the word ‘tell’
Bartley leaned over her shoulder, without touching her, and whispered in her
replace the word “shoulder” as “feet”
He had three flies on his cast, and, because in these waters there was always
insert a word “large” after “waters”
They could see that a conflict meant serious results