I am a PhD student at The Chinese University of Hong Kong, majoring in Speech and Audio Processing, supervised by Prof. Helen Meng. Before that, I received the Master's Degree from Peking University in 2023.
My research focuses on developing a human-agent that can communicate with humans, e.g. understanding human's speech and environment sounds, and then producing feedback to humans.
A unified audio foundation model capable of generating speech, music, sound effects, and more within a single framework. One of only three audio papers highlighted in the Stanford AI Index Report 2024, alongside Google's MusicLM and Meta's MusicGen.
The first text-to-audio generation work. Proposes a discrete diffusion model to generate diverse and high-quality sound effects directly from text descriptions, opening up the text-to-audio generation research direction.
The first non-autoregressive TTS that supports large-scale speech training without frame-level annotations. Uses scalar latent transformer diffusion models to achieve high-quality speech synthesis with significantly reduced complexity and inference cost.
The first Chinese TTS system that supports natural language prompt-controlled speech style generation. Enables intuitive and flexible control over speech style, emotion, and prosody through simple text instructions.
Demonstrates that an LLM-driven audio codec can serve as a powerful few-shot learner across diverse audio tasks, unifying audio understanding and generation through a codec-based approach.
Proposes a new codec paradigm that achieves extremely low bitrate while preserving rich semantic information, specifically designed for audio language modeling and enabling more efficient audio-LLM integration.
* denotes equal contributions.