39 lines
2.6 KiB
Markdown
39 lines
2.6 KiB
Markdown
# Project: Cross-Platform Local Audio Transcription Service
|
|
|
|
## Goal
|
|
Develop a lightweight, local-only system service that transcribes audio (mic/system) in real-time and injects the text via OS-level keyboard emulation.
|
|
|
|
## Architecture
|
|
- **Backend:** Rust (Core logic, audio capture, Whisper inference).
|
|
- **Frontend:** Tauri (HTML/CSS/JS for UI, communicates with Rust backend via Tauri Commands).
|
|
- **Communication:** Secure, high-speed IPC bridge via Tauri's built-in command system (no manual pipes/sockets needed).
|
|
- **Configuration:** `config.toml` for runtime settings (model paths, device selection).
|
|
|
|
## Technology Stack
|
|
- **Language:** Rust
|
|
- **Transcription:** Whisper (`whisper.cpp` via `whisper-rs` or similar bindings).
|
|
- **Audio Capture:** `cpal` (cross-platform).
|
|
- **Input Emulation:** `enigo` (cross-platform).
|
|
- **Frontend Framework:** Tauri (v2).
|
|
- **Config Handling:** `serde` + `toml`.
|
|
|
|
## Key Design Decisions
|
|
- **Offline-First:** All transcription is performed locally; no network/API calls.
|
|
- **Efficiency:** Lean, background-service focus. Tauri + Rust provides a smaller memory/resource footprint than Electron.
|
|
- **Dynamic Input Selection:** Service supports dynamic audio source switching (hot-swapping) via Tauri Command calls from the UI.
|
|
- **OS Specifics:** Uses Rust `cfg` attributes for platform-specific input simulation (X11/Wayland/Win32) without diverging the codebase.
|
|
|
|
## Project Status & Evolution
|
|
- **Dependencies:** Switched from `whisper-rs` to `whisper-cpp-plus` due to compilation incompatibilities with current `whisper.cpp` APIs. This change prioritizes long-term maintainability over minor library versioning.
|
|
- **Audio Infrastructure:** System requires `libasound2-dev` on Linux (PipeWire/ALSA).
|
|
- **Design Philosophy:**
|
|
- **Lean & Modular:** Keep the backend (Rust) strictly for heavy lifting (audio, VAD, transcription) and the frontend (Tauri) for control/config.
|
|
- **Dynamic Configuration:** Real-time updates via `notify` (config watcher) allow for sensitivity tuning without restarting the service.
|
|
- **VAD-First:** Implemented energy-based VAD for efficient, human-centric transcription triggering, minimizing CPU load and irrelevant noise.
|
|
|
|
## Next Steps
|
|
1. **Re-build:** Implement the transition to `whisper-cpp-plus` in `audio.rs` and `main.rs`.
|
|
2. **Transcription Logic:** Integrate the actual inference engine within the VAD-triggered buffer worker.
|
|
3. **Keyboard Emulation:** Implement `enigo` integration to inject the transcribed text.
|
|
4. **Tauri Integration:** Build the minimal HTML/JS GUI for sensitivity sliders and model selection.
|