Files
Knowledge/projects/audio-transcription-service/README.md
T
2026-05-05 09:40:28 +10:00

39 lines
2.6 KiB
Markdown

# Project: Cross-Platform Local Audio Transcription Service
## Goal
Develop a lightweight, local-only system service that transcribes audio (mic/system) in real-time and injects the text via OS-level keyboard emulation.
## Architecture
- **Backend:** Rust (Core logic, audio capture, Whisper inference).
- **Frontend:** Tauri (HTML/CSS/JS for UI, communicates with Rust backend via Tauri Commands).
- **Communication:** Secure, high-speed IPC bridge via Tauri's built-in command system (no manual pipes/sockets needed).
- **Configuration:** `config.toml` for runtime settings (model paths, device selection).
## Technology Stack
- **Language:** Rust
- **Transcription:** Whisper (`whisper.cpp` via `whisper-rs` or similar bindings).
- **Audio Capture:** `cpal` (cross-platform).
- **Input Emulation:** `enigo` (cross-platform).
- **Frontend Framework:** Tauri (v2).
- **Config Handling:** `serde` + `toml`.
## Key Design Decisions
- **Offline-First:** All transcription is performed locally; no network/API calls.
- **Efficiency:** Lean, background-service focus. Tauri + Rust provides a smaller memory/resource footprint than Electron.
- **Dynamic Input Selection:** Service supports dynamic audio source switching (hot-swapping) via Tauri Command calls from the UI.
- **OS Specifics:** Uses Rust `cfg` attributes for platform-specific input simulation (X11/Wayland/Win32) without diverging the codebase.
## Project Status & Evolution
- **Dependencies:** Switched from `whisper-rs` to `whisper-cpp-plus` due to compilation incompatibilities with current `whisper.cpp` APIs. This change prioritizes long-term maintainability over minor library versioning.
- **Audio Infrastructure:** System requires `libasound2-dev` on Linux (PipeWire/ALSA).
- **Design Philosophy:**
- **Lean & Modular:** Keep the backend (Rust) strictly for heavy lifting (audio, VAD, transcription) and the frontend (Tauri) for control/config.
- **Dynamic Configuration:** Real-time updates via `notify` (config watcher) allow for sensitivity tuning without restarting the service.
- **VAD-First:** Implemented energy-based VAD for efficient, human-centric transcription triggering, minimizing CPU load and irrelevant noise.
## Next Steps
1. **Re-build:** Implement the transition to `whisper-cpp-plus` in `audio.rs` and `main.rs`.
2. **Transcription Logic:** Integrate the actual inference engine within the VAD-triggered buffer worker.
3. **Keyboard Emulation:** Implement `enigo` integration to inject the transcribed text.
4. **Tauri Integration:** Build the minimal HTML/JS GUI for sensitivity sliders and model selection.