VoxCPM

Table of content

what it is

tokenizer-free text-to-speech engine from OpenBMB. skips the token bottleneck entirely — works directly with continuous audio representations. multilingual output, 3-second voice cloning.

why it matters

most TTS systems tokenize audio first, then generate tokens, then decode back to audio. VoxCPM cuts out the middle. less quantization noise, better prosody, cleaner multilingual handling.

architecturally different from Voxtral (Mistral’s token-based approach). having two competing architectures for open voice synthesis means the space is maturing fast.

self.md relevance

the open-weight voice stack keeps filling gaps. between Voxtral for token-based synthesis and VoxCPM for tokenizer-free, self-hosted voice is no longer a compromise — it’s a choice between architectures.

→ GitHub