# Project: Cross-Platform Local Audio Transcription Service ## Goal Develop a lightweight, local-only system service that transcribes audio (mic/system) in real-time and injects the text via OS-level keyboard emulation. ## Architecture - **Backend:** Rust (Core logic, audio capture, Whisper inference). - **Frontend:** Tauri (HTML/CSS/JS for UI, communicates with Rust backend via Tauri Commands). - **Communication:** Secure, high-speed IPC bridge via Tauri's built-in command system (no manual pipes/sockets needed). - **Configuration:** `config.toml` for runtime settings (model paths, device selection). ## Technology Stack - **Language:** Rust - **Transcription:** Whisper (`whisper.cpp` via `whisper-rs` or similar bindings). - **Audio Capture:** `cpal` (cross-platform). - **Input Emulation:** `enigo` (cross-platform). - **Frontend Framework:** Tauri (v2). - **Config Handling:** `serde` + `toml`. ## Key Design Decisions - **Offline-First:** All transcription is performed locally; no network/API calls. - **Efficiency:** Lean, background-service focus. Tauri + Rust provides a smaller memory/resource footprint than Electron. - **Dynamic Input Selection:** Service supports dynamic audio source switching (hot-swapping) via Tauri Command calls from the UI. - **OS Specifics:** Uses Rust `cfg` attributes for platform-specific input simulation (X11/Wayland/Win32) without diverging the codebase. ## Project Status & Evolution - **Dependencies:** Switched from `whisper-rs` to `whisper-cpp-plus` due to compilation incompatibilities with current `whisper.cpp` APIs. This change prioritizes long-term maintainability over minor library versioning. - **Audio Infrastructure:** System requires `libasound2-dev` on Linux (PipeWire/ALSA). - **Design Philosophy:** - **Lean & Modular:** Keep the backend (Rust) strictly for heavy lifting (audio, VAD, transcription) and the frontend (Tauri) for control/config. - **Dynamic Configuration:** Real-time updates via `notify` (config watcher) allow for sensitivity tuning without restarting the service. - **VAD-First:** Implemented energy-based VAD for efficient, human-centric transcription triggering, minimizing CPU load and irrelevant noise. ## Next Steps 1. **Re-build:** Implement the transition to `whisper-cpp-plus` in `audio.rs` and `main.rs`. 2. **Transcription Logic:** Integrate the actual inference engine within the VAD-triggered buffer worker. 3. **Keyboard Emulation:** Implement `enigo` integration to inject the transcribed text. 4. **Tauri Integration:** Build the minimal HTML/JS GUI for sensitivity sliders and model selection.