Vocal is one small CLI that does speech-to-text, text-to-speech, and voice cloning on your own machine. No API keys, no uploads, no Python or Node runtime to babysit. A single binary, Qwen3 models, GGML under the hood, and GPU acceleration on whatever silicon you’ve got.
Why it exists
Every voice tool I tried either (a) wanted an API key and my audio on someone else’s server, or (b) was a research demo that needed three Python environments and a reboot to turn into anything useful. Vocal is the middle — a grown-up version of the demo, without the cloud and without the tooling tax. I built it because I kept writing the same glue code and was tired of pretending that wasn’t the real project.
What’s in the box
- ASR — transcription in 30+ languages via Qwen3-ASR, GPU-accelerated, streaming-friendly.
- TTS — high-quality synthesis with preset voices, plus style/emotion control via Qwen3-TTS CustomVoice.
- Voice cloning — record a reference clip, save a profile, synthesize in that voice locally.
- HTTP server — load the models once, serve many requests; ASR and TTS run on separate backends so they don’t fight.
- One binary, one install —
brew install vocaland you’re done.
How it’s built
- C17 / C++17, CMake, no interpreters at runtime.
- GGML for tensor inference — same engine family as llama.cpp / whisper.cpp.
- Metal on Apple Silicon, CUDA on Linux/Windows, Accelerate / OpenBLAS as CPU fallback.
- Qwen3-ASR and Qwen3-TTS (0.6B and 1.7B), with F16 / Q8_0 / Q4_K quantization for smaller footprints.
On an Apple Silicon laptop, three seconds of audio transcribes in roughly 200 ms end-to-end.
Design principles
- Local by default. Nothing leaves your machine unless you explicitly ask it to.
- Boring defaults. The first command you’d try should be the right one.
- Single binary, deep CLI. A handful of verbs at the top; everything configurable underneath.
- Fail loud, fail readable. Bad audio, wrong paths, missing models — told in one sentence, not a stack trace.