CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

Adhiraj Banerjee · Vipul Arora

Video

Paper PDF

Thumbnail of paper pages

Abstract

Text-guided sound separation enables flexible audio editing and assistive applications, but existing open-domain systems such as AudioSep remain too compute-intensive for low-latency edge or codec-mediated deployment. Neural audio codec (NAC)-based separators such as CodecFormer and SDCodec are more efficient, but they are largely restricted to fixed-class or fixed-stem separation. We introduce \textbf{CodecSep}, a \emph{text-guided universal sound separation} framework that operates directly in neural audio codec latent space. CodecSep combines a frozen DAC backbone with a lightweight Transformer \emph{masker} conditioned by CLAP-derived FiLM parameters, enabling open-vocabulary source extraction while preserving the efficiency advantages of codec-native representations. To our knowledge, this is the first prompt-driven universal sound separation system built directly on NAC latents. Across \textbf{dnr-v2} and five additional open-domain benchmarks under matched training and prompting protocols, CodecSep consistently improves over AudioSep in separation fidelity (\textbf{SI\mbox{-}SDR}) while remaining competitive in perceptual quality (\textbf{ViSQOL}), and also shows gains in human \textbf{MOS--LQS}. Further analyses show that finer-grained semantic supervision improves separation more consistently than coarse prompting, and that \emph{explicit masking} is more effective than decoder-style latent generation for codec-domain source separation. Qualitative and diagnostic analyses further support the central design premise: modern NAC latents preserve meaningful \emph{source-dependent structure}, and the learned masks exploit this structure primarily through \emph{channel-wise modulation}, indicating that source extraction can be performed through masking alone without explicit latent generation. From a systems perspective, CodecSep also provides a concrete \emph{deployment path} for codec-mediated audio processing. In deployment-typical \emph{code-stream} settings, where the edge device transmits audio as NAC codes generated by the same codec backbone used by the separator, the server can map the received codes to codec embeddings through codebook lookup and perform separation directly in codec space, avoiding a separate decode--separate--re-encode cycle. In this regime, CodecSep requires only \textbf{1.35~GMACs} end-to-end—about $\mathbf{54\times}$ less compute than AudioSep in the same codec-mediated pipeline (and about $\mathbf{25\times}$ lower separator-only compute)—while also reducing latency and memory footprint substantially and remaining fully compatible with \emph{codes in: codes out} operation. More broadly, this codes-in / codes-out formulation provides a concrete blueprint for \emph{codec-native downstream audio processing}, suggesting that tasks such as enhancement, denoising, dereverberation, and prompt-guided audio editing can be designed to operate directly on NAC representations rather than repeatedly decoding to waveform and re-encoding after each processing stage.