2.6 KiB
2.6 KiB
Project: Cross-Platform Local Audio Transcription Service
Goal
Develop a lightweight, local-only system service that transcribes audio (mic/system) in real-time and injects the text via OS-level keyboard emulation.
Architecture
- Backend: Rust (Core logic, audio capture, Whisper inference).
- Frontend: Tauri (HTML/CSS/JS for UI, communicates with Rust backend via Tauri Commands).
- Communication: Secure, high-speed IPC bridge via Tauri's built-in command system (no manual pipes/sockets needed).
- Configuration:
config.tomlfor runtime settings (model paths, device selection).
Technology Stack
- Language: Rust
- Transcription: Whisper (
whisper.cppviawhisper-rsor similar bindings). - Audio Capture:
cpal(cross-platform). - Input Emulation:
enigo(cross-platform). - Frontend Framework: Tauri (v2).
- Config Handling:
serde+toml.
Key Design Decisions
- Offline-First: All transcription is performed locally; no network/API calls.
- Efficiency: Lean, background-service focus. Tauri + Rust provides a smaller memory/resource footprint than Electron.
- Dynamic Input Selection: Service supports dynamic audio source switching (hot-swapping) via Tauri Command calls from the UI.
- OS Specifics: Uses Rust
cfgattributes for platform-specific input simulation (X11/Wayland/Win32) without diverging the codebase.
Project Status & Evolution
- Dependencies: Switched from
whisper-rstowhisper-cpp-plusdue to compilation incompatibilities with currentwhisper.cppAPIs. This change prioritizes long-term maintainability over minor library versioning. - Audio Infrastructure: System requires
libasound2-devon Linux (PipeWire/ALSA). - Design Philosophy:
- Lean & Modular: Keep the backend (Rust) strictly for heavy lifting (audio, VAD, transcription) and the frontend (Tauri) for control/config.
- Dynamic Configuration: Real-time updates via
notify(config watcher) allow for sensitivity tuning without restarting the service. - VAD-First: Implemented energy-based VAD for efficient, human-centric transcription triggering, minimizing CPU load and irrelevant noise.
Next Steps
- Re-build: Implement the transition to
whisper-cpp-plusinaudio.rsandmain.rs. - Transcription Logic: Integrate the actual inference engine within the VAD-triggered buffer worker.
- Keyboard Emulation: Implement
enigointegration to inject the transcribed text. - Tauri Integration: Build the minimal HTML/JS GUI for sensitivity sliders and model selection.