VoxCPM
Table of content
what it is
tokenizer-free text-to-speech engine from OpenBMB. skips the token bottleneck entirely — works directly with continuous audio representations. multilingual output, 3-second voice cloning.
why it matters
most TTS systems tokenize audio first, then generate tokens, then decode back to audio. VoxCPM cuts out the middle. less quantization noise, better prosody, cleaner multilingual handling.
architecturally different from Voxtral (Mistral’s token-based approach). having two competing architectures for open voice synthesis means the space is maturing fast.
self.md relevance
the open-weight voice stack keeps filling gaps. between Voxtral for token-based synthesis and VoxCPM for tokenizer-free, self-hosted voice is no longer a compromise — it’s a choice between architectures.
→ GitHub