Back to Ear: Perceptually Driven High Fidelity Music Reconstruction

Wang Kangdi

Back to Ear: Perceptually Driven High Fidelity Music Reconstruction

Kangdi Wang¹, Zhiyue Wu², Dinghao Zhou³, Rui Lin, Junyu Dai^*, Tao Jiang

ϵar-LAB

Abstract

Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose ϵar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm.

Our contributions are threefold:
(i) A K-weighting perceptual ﬁlter applied prior to loss calculation to align the objective with auditory perception.
(ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives—Instantaneous Frequency and Group Delay—for precision.
(iii) A new spectral supervision paradigm where magnitude is supervised by all four MSLR (Mid/Side/Left/Right) components, while phase is supervised only by the LR components.

Experiments show ϵar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.