Knowledge/projects/audio-transcription-service/README.md

# Project: Cross-Platform Local Audio Transcription Service

## Goal
Develop a lightweight, local-only system service that transcribes audio (mic/system) in real-time and injects the text via OS-level keyboard emulation.

## Architecture
- **Backend:** Rust (Core logic, audio capture, Whisper inference).
- **Frontend:** Tauri (HTML/CSS/JS for UI, communicates with Rust backend via Tauri Commands).
- **Communication:** Secure, high-speed IPC bridge via Tauri's built-in command system (no manual pipes/sockets needed).
- **Configuration:** `config.toml` for runtime settings (model paths, device selection).

## Technology Stack
- **Language:** Rust
- **Transcription:** Whisper (`whisper.cpp` via `whisper-rs` or similar bindings).
- **Audio Capture:** `cpal` (cross-platform).
- **Input Emulation:** `enigo` (cross-platform).
- **Frontend Framework:** Tauri (v2).
- **Config Handling:** `serde` + `toml`.

## Key Design Decisions
- **Offline-First:** All transcription is performed locally; no network/API calls.
- **Efficiency:** Lean, background-service focus. Tauri + Rust provides a smaller memory/resource footprint than Electron.
- **Dynamic Input Selection:** Service supports dynamic audio source switching (hot-swapping) via Tauri Command calls from the UI.
- **OS Specifics:** Uses Rust `cfg` attributes for platform-specific input simulation (X11/Wayland/Win32) without diverging the codebase.

## Project Status & Evolution
- **Dependencies:** Switched from `whisper-rs` to `whisper-cpp-plus` due to compilation incompatibilities with current `whisper.cpp` APIs. This change prioritizes long-term maintainability over minor library versioning.
- **Audio Infrastructure:** System requires `libasound2-dev` on Linux (PipeWire/ALSA).
- **Design Philosophy:**
  - **Lean & Modular:** Keep the backend (Rust) strictly for heavy lifting (audio, VAD, transcription) and the frontend (Tauri) for control/config.
  - **Dynamic Configuration:** Real-time updates via `notify` (config watcher) allow for sensitivity tuning without restarting the service.
  - **VAD-First:** Implemented energy-based VAD for efficient, human-centric transcription triggering, minimizing CPU load and irrelevant noise.

## Next Steps
1.  **Re-build:** Implement the transition to `whisper-cpp-plus` in `audio.rs` and `main.rs`.
2.  **Transcription Logic:** Integrate the actual inference engine within the VAD-triggered buffer worker.
3.  **Keyboard Emulation:** Implement `enigo` integration to inject the transcribed text.
4.  **Tauri Integration:** Build the minimal HTML/JS GUI for sensitivity sliders and model selection.