LocalVQE

Local Voice Quality Enhancement — compact neural models for acoustic echo cancellation (AEC), noise suppression (NS), and dereverberation of 16 kHz speech, running on commodity CPUs in real time. Causal and streaming (256-sample hop, 16 ms latency).

Try it: https://huggingface.co/spaces/LocalAI-io/LocalVQE-demo
Source, build system, tests: https://github.com/localai-org/LocalVQE

This page hosts the published weights. Inference runs the GGML C++ engine on the GGUF files directly (build instructions on GitHub).

Authors: Richard Palethorpe (richiejp) and Claude (Anthropic). LocalVQE is a streaming, CPU-tuned derivative of DeepVQE (Indenbom et al., Interspeech 2023).

Models

Speed is per 16 ms hop on a Ryzen 9 7900 (Zen4), 4 threads; RT = realtime factor (higher is faster than realtime).

Version	Does	Params	Size (F32)	Speed	Pick it when
v1.3 (current)	AEC + NS + dereverb	4.8 M	~19 MB	3.2 ms · 5.0× RT	best joint quality, CPU budget available
v1.2	AEC + NS + dereverb	1.3 M	~5 MB	1.7 ms · 8.9× RT	tight CPU / low-power devices
v1.4-AEC	echo only (keeps voice, noise, room)	203 K	~3 MB	0.83 ms · 19× RT	NS is handled elsewhere, or you want the room kept
v1.4-AEC 2.7K	echo only, linear filter (no mask)	2.7 K	~17 KB	0.36 ms · 44× RT	lightest echo canceller; echo isn't heavily reverberant
v1.1 / v1	AEC + NS + dereverb	1.3 M	~5 MB	—	superseded by v1.2

Joint models (v1.2 / v1.3) clean echo, noise, and reverb in one pass. v1.3 is wider and filters noise better; v1.2 is ~1/4 the per-hop cost.
v1.4-AEC removes only the far-end echo and passes voice, room, and background through unchanged. It's a classical adaptive filter followed by a small neural mask. The 2.7K build is that filter alone — cheaper and gentler, but it can't remove heavily reverberant echo the way the mask can.
Every model needs a far-end reference signal (a loopback of what your speakers play) in addition to the mic.
bf16 GGUFs are ~12 % smaller with identical quality and speed; pick f32 unless download size matters.

Files in this repository

File	Size	Model
`localvqe-v1.4-aec-200K-f32.gguf`	3 MB	v1.4-AEC (echo only)
`localvqe-v1.4-aec-200K-bf16.gguf`	2.6 MB	v1.4-AEC, conv weights in BF16
`localvqe-v1.4-aec-2.7K-f32.gguf`	17 KB	v1.4-AEC front-end only (adaptive filter, no mask)
`localvqe-v1.3-4.8M-f32.gguf`	19 MB	v1.3 joint — GGUF the engine loads
`localvqe-v1.3-4.8M.pt`	55 MB	v1.3 joint — PyTorch checkpoint (research)
`localvqe-v1.2-1.3M-f32.gguf`	5 MB	v1.2 joint — GGUF
`localvqe-v1.2-1.3M.pt`	11 MB	v1.2 joint — PyTorch checkpoint
`localvqe-v1.1-1.3M-f32.gguf`, `localvqe-v1-1.3M-f32.gguf`	5 MB	older releases

v1.4-AEC is GGUF-only (no .pt). GGUF integrity is checked at load time against a built-in SHA256 allowlist in the engine.

Performance

Full 800-clip eval on the ICASSP 2022 AEC Challenge blind test set (real recordings). AECMOS echo / deg are 1–5 (higher = more echo removed / cleaner speech); blind ERLE is 10·log10(E[mic²]/E[enh²]), only meaningful on far-end-only clips. Unprocessed-mic echo MOS is 2.67 / 2.56 / 1.90 / 2.13 / 5.00 across the five scenarios.

v1.4-AEC — keeps background noise and room by design, so its ERLE and far-end DNSMOS are intentionally lower than the joint models (it isn't deleting the ambience):

Scenario	n	echo ↑	deg ↑	ERLE ↑	OVRL
doubletalk	115	4.20	2.45	—	2.59
doubletalk-with-movement	185	4.19	2.45	—	2.55
farend-singletalk	107	3.80	4.99	14.6 dB	1.37
farend-singletalk-with-movement	193	3.86	4.95	11.1 dB	1.31
nearend-singletalk	200	4.99	3.99	—	3.08

v1.4-AEC 2.7K (front-end only) — matches or beats the full model's perceptual far-end echo at 1/74 the parameters; the mask's extra work shows up as higher ERLE above, not higher echo MOS:

Scenario	n	echo ↑	deg ↑	ERLE ↑	OVRL
doubletalk	115	4.00	2.79	—	2.46
doubletalk-with-movement	185	3.90	2.92	—	2.42
farend-singletalk	107	4.06	5.00	6.5 dB	1.24
farend-singletalk-with-movement	193	4.05	4.97	3.9 dB	1.22
nearend-singletalk	200	4.98	3.77	—	3.03

v1.3 (joint) and v1.2 (joint) — these also delete the background, so their far-end ERLE is much higher and not comparable to v1.4-AEC's:

Scenario	n	v1.3 echo / deg / ERLE / OVRL	v1.2 echo / deg / ERLE / OVRL
doubletalk	115	4.73 / 2.62 / 8.5 dB / 2.89	4.72 / 2.37 / 8.4 dB / 2.83
doubletalk-with-movement	185	4.67 / 2.43 / 8.3 dB / 2.85	4.65 / 2.30 / 8.1 dB / 2.79
farend-singletalk	107	3.69 / 4.83 / 50.9 dB / 1.94	3.78 / 4.91 / 45.7 dB / 1.80
farend-singletalk-with-movement	193	3.88 / 4.98 / 49.9 dB / 1.96	4.12 / 4.96 / 40.6 dB / 1.75
nearend-singletalk	200	5.00 / 4.18 / 2.4 dB / 3.17	5.00 / 4.16 / 2.1 dB / 3.17

Latency

Per-hop p50 / RT factor on a Ryzen 9 7900 (Zen4). 16 kHz, 256-sample hop.

Model	1 thread	4 threads	dGPU (RTX 5070 Ti, Vulkan)
v1.4-AEC (203 K)	1.29 ms · 12.2×	0.83 ms · 18.6×	run on CPU¹
v1.4-AEC 2.7K	0.36 ms · 44× (single-threaded)	—	run on CPU¹
v1.3 (4.8 M)	9.73 ms · 1.58×	3.21 ms · 4.97×	2.57 ms · 6.07×
v1.2 (1.3 M)	4.28 ms · 3.72×	1.65 ms · 8.90×	1.96 ms · 7.85×

¹ v1.4-AEC's adaptive front-end always runs on CPU and the neural stage is too small for GPU offload to pay off. Four threads is the sweet spot on Zen4 for all models; the library defaults to min(4, available CPUs).

Memory (CPU)

Working set the model adds on top of the ~7 MiB binary baseline:

Model	Post-load delta	Peak RSS
v1.3 (4.8 M)	+24.4 MiB	34.1 MiB
v1.2 (1.3 M)	+10.0 MiB	19.6 MiB
v1.4-AEC (203 K)	+6.7 MiB	17.0 MiB

Running inference

Download a GGUF (web UI, huggingface-cli, or hf_hub_download) and run the GGML CLI — same command for every model, just swap the file:

./localvqe localvqe-v1.3-4.8M-f32.gguf --in-wav mic.wav ref.wav --out-wav out.wav

16 kHz mono PCM for both the mic and the far-end reference. Building the engine, the C API (liblocalvqe.so), and the OBS Studio plugin are documented in the GitHub repository.

PyTorch reference

localvqe-v1.3-4.8M.pt and localvqe-v1.2-1.3M.pt are the checkpoints used to produce the GGUF exports — for verification, ablation, and research, not end-user inference (use the GGML build). The model definition lives under pytorch/ in the GitHub repo.

Citing

Cite the repository via CITATION.cff at https://github.com/localai-org/LocalVQE (GitHub's "Cite this repository" button produces APA / BibTeX), and the upstream DeepVQE paper:

@inproceedings{indenbom2023deepvqe,
  title     = {DeepVQE: Real Time Deep Voice Quality Enhancement for Joint
               Acoustic Echo Cancellation, Noise Suppression and Dereverberation},
  author    = {Indenbom, Evgenii and Beltr{\'a}n, Nicolae-C{\u{a}}t{\u{a}}lin
               and Chernov, Mykola and Aichner, Robert},
  booktitle = {Interspeech}, year = {2023},
  doi       = {10.21437/Interspeech.2023-2176}
}

Dataset attribution

Weights are trained on the ICASSP 2023 DNS Challenge (Microsoft, CC BY 4.0) and fine-tuned on the ICASSP 2022/2023 AEC Challenge.

Safety

Training data was filtered by DNSMOS, which can misclassify distressed speech (screaming, crying) as noise. LocalVQE may attenuate such signals and must not be relied upon for emergency or safety-critical applications.

License

Apache License 2.0.

Downloads last month: 1,614

GGUF

Model size

1.29M params

Architecture

localvqe

Hardware compatibility

16-bit

32-bit

Space using LocalAI-io/LocalVQE 1

Paper for LocalAI-io/LocalVQE

DeepVQE: Real Time Deep Voice Quality Enhancement for Joint Acoustic Echo Cancellation, Noise Suppression and Dereverberation

Paper • 2306.03177 • Published Jun 5, 2023 • 1