Back to Ear: Perceptually Driven High Fidelity Music Reconstruction

ϵar-LAB
ICASSP 2026

Abstract

Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose ϵar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm.

Our contributions are threefold:
(i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception.
(ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives—Instantaneous Frequency and Group Delay—for precision.
(iii) A new spectral supervision paradigm where magnitude is supervised by all four MSLR (Mid/Side/Left/Right) components, while phase is supervised only by the LR components.

Experiments show ϵar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.

Audio Reconstruction Comparison

Compare the original audio with our ϵar-VAE reconstructions.

Sample 1: C Pop

Original Audio
ϵar-VAE Reconstruction

Sample 2: Hip-Hop

Original Audio

Click play to hear the original recording

ϵar-VAE Reconstruction

Compare with our high-fidelity reconstruction

Sample 3: Rock

Original Audio

Note the high-frequency harmonics

ϵar-VAE Reconstruction

Excellent preservation of spatial characteristics

For best audio quality comparison, you could use a stereophonic monitor system.

Back to Ear: Reconstruction Comparison

Compare reconstructions from different popular audio models on the same music sample.

For auditory compensation, files are normalized to the same perceived loudness level (-14 LUFS-I).

Original Audio (Reference)

Ground truth reference for comparison

DAC

Descript Audio Codec 44khz

BigVGAN V2

BigVGAN V2 44khz_128band_512x

Encodec

Encodec from Meta 48khz

Stable Audio Open

Stability AI's Open VAE Model 44khz

AudioGen

AudioGen Codec 48khz_continuous

ϵar-VAE (Ours)

Our perceptually-driven model 44khz

Audio Timeline with Key Perceptual Events
0:03.6
Vocal sibilance "So" and plosive "Ko"
0:11.2
Strings positioning and "Ring-Like" harmonic series
0:27.0
Vocal Distortion in mid-high frequency and 10kHz+ "Airy" feeling
0:29.5
Attack & transient of Piano Strikes on both sides
0:46.1
Vocal Esser, Kick & Snare Punchy attack when each-part energy accumulated in high frequency
1:05.0
Vocal "Metal Sound" distortion in mid-high frequency
1:30.5
The process for "Swooshing" effect from LEFT TO RIGHT
Spatial Info Frequency Coverage Distortion

Click on the colored markers to jump to specific timestamps across all audio players

Poster