UniAudio: Towards Universal Audio Generation with Large Language Models

Dongchao Yang*¹, Jinchuan Tian*², Xu Tan ³, Rongjie Huang ⁴, Songxiang Liu, HaoHan Guo¹, Xuankai Chang², Jiatong Shi ², Sheng Zhao ³, Jiang Bian ³, Zhou Zhao ⁴, Xixin Wu ¹, Helen Meng¹ * Equal Contribution 1 Chinese University of Hong Kong 2 Carnegie Mellon University 3 Microsoft Research Asia 4 Zhejiang University

Introduction

Audio generation is a major branch of generative AI research. Compared with prior works in this area that are commonly task-specific with heavy domain knowledge, this paper advocates building universal audio generation models that can handle various tasks in a unified manner. As recent research on large language models (LLMs) has demonstrated their strong ability to handle multiple tasks, this work presents UniAudio, an LLM-based audio generation model that supports a wide range of audio generation tasks. Based on various input conditions, such as phoneme, text description, or audio itself, UniAudio can generate speech, sound, music, and singing voice. The proposed UniAudio is built with 100k hours of multi-source open-available audio data and is scaled to 1B parameters. The audio tokenization method and language model architecture are also specifically designed for both performance and efficiency. Experimentally, UniAuido supports 11 audio generation tasks and achieves competitive results on all tasks consistently. We also show that UniAudio can support new tasks seamlessly via simple fine-tuning.

Overview

The overview of UniAudio as following picture shows. In the following, we will show some generated samples by our proposed method.

Zero-shot TTS.

In the following, we first show some case in LibriTTS test clean set.

Content (The transcirption of the target audio)	Prompt	Generated Speech	GT Speech
IT IS SIXTEEN YEARS SINCE JOHN BERGSON DIED
IF A LAYMAN IN GIVING BAPTISM POUR THE WATER BEFORE SAYING THE WORDS IS THE CHILD BAPTIZED
THAT IS ONE REASON YOU ARE OJO THE UNLUCKY SAID THE WOMAN IN A SYMPATHETIC TONE
THE DEWS WERE SUFFERED TO EXHALE AND THE SUN HAD DISPERSED THE MISTS AND WAS SHEDDING A STRONG AND CLEAR LIGHT IN THE FOREST WHEN THE TRAVELERS RESUMED THEIR JOURNEY
BY THIS TIME LORD CHELFORD AND WYLDER RETURNED AND DISGUSTED RATHER WITH MYSELF I RUMINATED ON MY WANT OF GENERAL SHIP

Cloning famous person’s voice

In the following, we try to using 3 seconds prompt from three famous person: Theresa May, Barack Obama and Taylor Swift, and using their voice to read some text content (randomly choose from LibriTTS).

Name	Content (The transcirption of the target audio)	Prompt	Generated Speech
Barack Obama	YOUNG HAD BEEN COMMANDED TO HIS MOTHER’S CHAMBER SO SOON AS HE HAD COME OUT FROM HIS CONVERSE WITH THE SQUIRE
Barack Obama	I CANNOT ALLOW THE EXAMINATION TO BE HELD IF ONE OF THE PAPERS HAS BEEN TAMPERED WITH THE SITUATION MUST BE FACED
Barack Obama	MUCH LATER WHEN A FRIEND OF HIS WAS PREPARING AN EDITION OF ALL HIS LATIN WORKS HE REMARKED TO HIS HOME CIRCLE IF I HAD MY WAY ABOUT IT THEY WOULD REPUBLISH ONLY THOSE OF MY BOOKS WHICH HAVE DOCTRINE MY GALATIANS FOR INSTANCE
Barack Obama	GRAM ROUGHLY ONE TWENTY EIGHTH OF AN OUNCE
Theresa May	HE GAVE WAY TO THE OTHERS VERY READILY AND RETREATED UNPERCEIVED BY THE SQUIRE AND MISTRESS FITZOOTH TO THE REAR OF THE TENT
Taylor Swift	YOUNG HAD BEEN COMMANDED TO HIS MOTHER’S CHAMBER SO SOON AS HE HAD COME OUT FROM HIS CONVERSE WITH THE SQUIRE
Taylor Swift	REST AND BE STILL UNTIL I WARN YOU
Taylor Swift	THE COMBINED BANDS OF BOTH THE COUNTRIES PLAYED THE MUSIC AND A FINE SUPPER WAS SERVED

Cloning the person’s voice from your daily life

In this part, we show some case that clone our friends’s voice. We directly use our smart phone to record 3 seconds prompts( One use the iphone, the other use VIVO). We directly speak chinese, we expect the model can transfer our voice into English.

Name	Content	Prompt	Generated Speech
Girl	AND EMIL MOWED HIS WAY SLOWLY DOWN TOWARD THE CHERRY TREES
Girl	THE DEPARTURE WAS NOT AT ALL AGREEABLE
Girl	IT’S NOT PARTICULARLY RARE SHE SAID BUT SOME OF IT WAS MY MOTHER’S
Girl	WHAT I MEAN IS THAT I WANT YOU TO PROMISE NEVER TO SEE ME AGAIN NO MATTER HOW OFTEN I COME NO MATTER HOW HARD I BEG
Girl	DO YOU KNOW LAKE OH I REALLY CAN’T TELL BUT HE’LL SOON TIRE OF COUNTRY LIFE
Boy	THE EARTH IS NOT DEVOID OF RESEMBLANCE TO A JAIL
Boy	INDEED THERE WERE ONLY ONE OR TWO STRANGERS WHO COULD BE ADMITTED AMONG THE SISTERS WITHOUT PRODUCING THE SAME RESULT
Boy	ALSO THERE WAS A STRIPLING PAGE WHO TURNED INTO A MAID
Boy	I HAD A NAME I BELIEVE IN MY YOUNG DAYS BUT I HAVE FORGOTTEN IT SINCE I HAVE BEEN IN SERVICE

Long sentence by TTS

Content	Generated Speech	GT Speech
THE DYNAMO ELECTRIC MACHINE THOUGH SMALL WAS ROBUST FOR UNDER ALL THE VARYING SPEEDS OF WATER POWER AND THE VICISSITUDES OF THE PLANT TO WHICH IT BELONGED IT CONTINUED IN ACTIVE USE UNTIL EIGHTEEN NINETY NINE SEVENTEEN YEARS
EVERY ONE COULD OBSERVE HIS AGITATION AND PROSTRATION A PROSTRATION WHICH WAS INDEED THE MORE REMARKABLE SINCE PEOPLE WERE NOT ACCUSTOMED TO SEE HIM WITH HIS ARMS HANGING LISTLESSLY BY HIS SIDE HIS HEAD BEWILDERED AND HIS EYES WITH ALL THEIR BRIGHT INTELLIGENCE BEDIMMED

Zero-shot VC.

In the following, we show some case using VCTK. Similar with previous section, we can easy to use our own voice prompt to realize voice conversion.

Source	Prompt	Generated Speech	Ground Truth Speech

Zero-shot Sing Voice Synthesis

Content (words of song)	Speaker Prompt	Generated Sing
十年之前@我不认识你@你不属于我@我们还是一样@陪在一个陌生人左右
好吧天亮之后总是潦草离场@清醒的人最荒唐
我们这些努力不简单@快乐炼成泪水@是一种勇敢
空空留遗憾多难堪又为难
嗯纵容着任性的随意的放肆的轻易的
你说你好想带我回去你的家
狼狈比失去难受我怀念的是无话不说
我怀念的@是争吵以后还会想要爱你的冲动
我和我最后的倔强握紧双手绝对不放下一站

Zero-shot Speech Enhancement

Noisy Speech	De-noised Speech	Ground Truth

Using UniAudio to denoise the speech from film.

In the following, we find a moive clips from bilibili (https://www.bilibili.com/video/BV1UA411J7H2/?vd_source=e4a11ee6c459009bfa05833e70cd49c3). Due to the actor’s lines are easy to be misunderstanding if you never see the moive “功夫”, which directed and starring by Stephen Chow in 2004 (strongly recommendation!). We only show part of the content.

Noisy Speech	De-noised Speech

Zero-shot Target Speaker Extraction

Mixed Speech	Prompt	Extracted Speech	Ground Truth

Zero-shot Text-to-Sound

Instruction (The text description)	Generated Sound
Someone is running alone on a hardwood floor.
Repeated bursts of fireworks are accompanied by intermittent cheers and whistles of a crowd
A bird frequently vocalizing with a high pitched chirp.
Several musicians warm up on their instruments as people in the audience talk.
Birds chirp and people talk as cars go by.
After a train horn blows, the chugging of the engine Increases.
After a train horn blows, the chugging of the engine Increases.

20s audio genenration

Instruction (The text description)	Generated Sound
An ambulance siren wails continue for more than twenty seconds.
A cricket is chirping as the wind is blowing.

Zero-shot Text-to-Music

Instruction (The text description)	Generated Music
The Pop song features harmonizing vocals singing over energetic crash cymbals, groovy bass guitar, wide electric guitar melody, punchy snare and soft kick hits. There is a short drum break at the very beginning of the loop. It sounds happy, nostalgic and euphoric, as it gives off vibes of a Christmas song.
This is a rap music piece played behind a rollerskating video. The sound of the skaters can be heard faintly throughout the recording. There is a male voice rapping at the forefront while other voices can be heard singing melodically in the background and ad-libbing occasionally. There is a mild keyboard playing the tune while a loud electronic drum beat is playing the rhythm. The atmosphere of this piece is groovy and urban.
The low quality recording features a live performance of fruity male vocal singing over funky piano melody and beat played on playback that consists of punchy snare and kick hits, shimmering hi hats and smooth bass. The crowd is also singing along, harmonizing with the lead vocal. It sounds emotional and groovy, even though the recording is noisy.
This is the recording of a gear showcase jam. There is an overdriven electric guitar playing a groovy solo with the amplified reverb effect. There is a crunchy sound. The piece can be used as the jingle of an advertisement. It could also be used for lifting electric guitar samples to be used in a beat.
The R&B song features a passionate male vocalist singing over a wide funky electric guitar melody, smooth bass guitar, punchy kick and snare hits, shimmering hi hats, snappy rimshots and soft crash cymbals. The rimshots are present in the first half of the loop, while a more energetic second part of the loop consists of punchy snare hits. It sounds emotional and heartfelt, as the vocal is slightly distorted.
The low quality recording features a traditional song that consists of wide, harmonizing female vocals singing over acoustic rhythm guitar. It sounds mellow, soft and emotional.
someone is playing a high pitched melody on a steel drum. The file is of poor audio-quality.
The pop rock music features a male voice singing. An electric guitar with a distortion effect on plays plays two chords every two measures. The drums play a strong rhythm and together with a synth bass drive the pulse of the music.
The low quality recording features a live performance of loud steel pans melody over playback instrumental that consists of shimmering hi hats, open hats and “4 on the floor” kick pattern. There are some crowd talking noises in the background. It sounds exotic and energetic.
This is a pop music piece. The words are being sung by two vocals: one male and one female which lead to a duet for the chorus. There is a banjo and an electric guitar playing the melody while a simple electronic drum beat provides the rhythmic background for the song. It is a slightly melodic and emotional song. This piece could be used in the soundtrack of a romantic drama during a flashback scene.

20s music generation

Instruction (The text description)	Generated Music
The music is purely instrumental and so it features no human voice. More gamelans are played but by using different techniques. Other metallic percussion instruments can be heard.
This song is a sweet duet. The tempo is medium with a melodious, intense piano accompaniment , electric guitar rhythm, steady drumming and synthesiser arrangements. This song is melodic, story telling, spirited, emotional, passionate and sweet. The lyrics are simple and so this song could be a Children’s Song.

Instructed TTS

Instruction (Control the generated samples)	Content	Generated Speech
A man speaks loudly and slowly with a bass tone	Degenerated into deadness and formality
The mournful madame says: loud, slow and high pitched	He flushed crimson
A man who speaks very slowly, his tone is very low, shehehe yowled loudly	Her blank gaze chilled you
A man said with a relatively low tone and saying a fast speed and breathe loudly	He sacrificed the vulgar prizes of life

Audio Edit

Instruction	Source Audio	Generated Audio
Add: loud roar of traffic in the background
Add: a helicopters’ engine is running in the background
drop: birds chirping, wings flap
Drop:a motorcycle idles nearby at moderate speed
super resolution: A clock is ringing a bell
super resolution: A clock is chiming
Super resolution: Man speaking with water sounds

Speech dereverberation

Reverberation Speech	De-reverberated Speech	Ground Truth

Speech Edit

Content	Instruction	Generated Speech	Original Audio
Please wait for me, Marie,” Emil coaxed.	delete the word ‘for’
He is called, as you know, the apostle of the Indies.	replace the word “called” as “named”
It is thou that must tell me!	delete the word ‘tell’
Bartley leaned over her shoulder, without touching her, and whispered in her	replace the word “shoulder” as “feet”
He had three flies on his cast, and, because in these waters there was always	insert a word “large” after “waters”
They could see that a conflict meant serious results	insert a word “many” after “meant”

Chinese TTS

Content	Generated Speech
你投了票的
未来在吸引球迷和赞助商以及推广乒乓球运动方面
壤塘县的乡镇有什么
但并不是每家公司的所有产品都会被影响
恐怖片的电影有什么