Back to Ear: Perceptually Driven High Fidelity Music Reconstruction
Abstract
Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose ϵar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm.
Our contributions are threefold:
(i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception.
(ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives—Instantaneous Frequency and Group Delay—for precision.
(iii) A new spectral supervision paradigm where magnitude is supervised by all four MSLR (Mid/Side/Left/Right) components, while phase is supervised only by the LR components.
Experiments show ϵar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.

Results on MuChin and In-house validation split, BOLD for the best performance and UNDERLINE for the second best.

THD+N test based on AES17 standard for different codec models, our model demonstrates relatively more reasonable harmonic patterns in music.

K-weighting vs A-weighting & CQT vs STFT comparison for music signals.
Audio Reconstruction Comparison
Compare the original audio with our ϵar-VAE reconstructions.
Sample 1: C Pop
Original Audio
ϵar-VAE Reconstruction
Sample 2: Hip-Hop
Original Audio
Click play to hear the original recording
ϵar-VAE Reconstruction
Compare with our high-fidelity reconstruction
Sample 3: Rock
Original Audio
Note the high-frequency harmonics
ϵar-VAE Reconstruction
Excellent preservation of spatial characteristics
For best audio quality comparison, you could use a stereophonic monitor system.
Back to Ear: Reconstruction Comparison
Compare reconstructions from different popular audio models on the same music sample.
For auditory compensation, files are normalized to the same perceived loudness level (-14 LUFS-I).
Original Audio (Reference)
Ground truth reference for comparison
BigVGAN V2
Encodec
Stable Audio Open
AudioGen
ϵar-VAE (Ours)
Audio Timeline with Key Perceptual Events
Vocal sibilance "So" and plosive "Ko"
Strings positioning and "Ring-Like" harmonic series
Vocal Distortion in mid-high frequency and 10kHz+ "Airy" feeling
Attack & transient of Piano Strikes on both sides
Vocal Esser, Kick & Snare Punchy attack when each-part energy accumulated in high frequency
Vocal "Metal Sound" distortion in mid-high frequency
The process for "Swooshing" effect from LEFT TO RIGHT
Click on the colored markers to jump to specific timestamps across all audio players