ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling

Introduction

Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel low-bitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.

Overview

The overview of ALMTokenizer as following picture shows. The overview of SimpleSpeech In the following, we will show some generated samples by our proposed method.

Audio Codec Tokenizer Reconstruction Comparison (Speech)

Original Speech	DAC (1.5kbps)	Encodec (1.5kbps)	WavTokenizer (0.48kbps)	StableCodec (0.4kbps)	SpeechTokenizer (1.5kbps)	Mimi (1.1kbps)	Mimi (0.41kbps)	Ours (0.41kbps)

Audio Codec Tokenizer Reconstruction Comparison (Sound)

Original Audio	DAC (1.5kbps)	Encodec (1.5kbps)	WavTokenizer (0.48kbps)	Ours (0.41kbps)

Audio Codec Tokenizer Reconstruction Comparison (Music)

Original Audio	DAC (1.5kbps)	Encodec (1.5kbps)	WavTokenizer (0.48kbps)	Ours (0.41kbps)

Text-to-Speech Generation.

In the following, we first show some case generated by audio language model with ALMTokenizer

Content (The transcirption of the target audio)	Generated
cosette was no longer in rags she was in mourning.
then tom who had been stunned by some falling debris raised himself to a sitting position
this dressing should stand in the ice box four or five hours to become seasoned
he can’t stand the notion of any cruelty
on the general principles of art mister quilter writes with equal lucidity
the other voice snapped with a harsh urgency clearly used to command
when they entered the stage box on the left the first act was well under way the scene being the interior of a cabin in the south of ireland
sir harry towne bowed and said that he had met mister alexander and his wife in tokyo
hilda sat on the arm of it and put her hands lightly on his shoulders
the son of a virgin generated by the ineffable operation of the holy spirit was a creature without example or resemblance superior in every attribute of mind and body to the children of adam
i hear the t v going for a few minutes then pop turns it off and goes in the kitchen to talk to mom
yes my lord we should try in vain
who came next on the scene some people from the lobby
it has not been running since last night or it would be full of curious people all the time hustling to get a glimpse of this place
cautiously placing a hand against the rocks to steady himself tad wisely concluded that hereafter it would not pay to be too curious
tad is an experienced rider
what did he mean asked pyrrha
then she turned towards the quarter indicated and disappeared round the laurel bushes
and some of the birds who were attentive and careful soon saw how it was done and started nice homes for themselves
no battery in the whole four years war lost so many men in so short a time
randal waited a while in london on the chance that bennydeck might pay him a visit
the only true motive for putting poetry into a fresh language must be to endow a fresh nation as far as possible with one more possession of beauty

More Text-to-Audio Generation Results

In the following, we first show some case generated by audio language model with ALMTokenizer

Discription	Generated
Someone is typing on a computer keyboard
A frog vocalizes as birds chirp
This catchy tune is a dynamic blend of pop and rock, featuring soaring guitar riffs, driving drums, and an infectious chorus that will have you dancing along in no time.
This alternative indie rock anthem is a fierce and empowering manifesto, blending the confrontational intensity of heavy metal and hard rock with a visceral, funk-infused sound, cathartic R&B-inspired vocal hooks and explosive alternative pop rock arrangements, all driven by a street smart attitude and a cool and cocky swagger that make it a truly revolutionary and volatile statement of artistic rebellion.
This upbeat pop rock anthem will get you moving and singing along in no time!
The song Grit and Grime captures the quintessential sound of the dirty south with its heavy bass, gritty vocals, and raw lyrics.
This contemporary pop rock album is an intense, confrontational and visceral musical journey that takes listeners on a freewheeling exploration of self-awareness, searching for freedom and reflection. With acerbic lyrics that are both cynical and sarcastic, the singer songwriter deals with themes of love and romance while also exploring more sentimental and poignant subjects through bittersweet ballads and fiery rocknroll anthems. The album combines amiable good-natured moments of relaxation and hanging out with heartfelt and earnest reflections on life, all delivered with an honest and passionate energy. From playful and humorous tracks to intimate and literate ballads, this cathartic and self-conscious album is a perfect mix of contemporary pop-rock, album rock, and singer-songwriter styles that showcase the artist’s range and depth.