Files
2026-05-05 09:40:28 +10:00

2.6 KiB

Project: Cross-Platform Local Audio Transcription Service

Goal

Develop a lightweight, local-only system service that transcribes audio (mic/system) in real-time and injects the text via OS-level keyboard emulation.

Architecture

  • Backend: Rust (Core logic, audio capture, Whisper inference).
  • Frontend: Tauri (HTML/CSS/JS for UI, communicates with Rust backend via Tauri Commands).
  • Communication: Secure, high-speed IPC bridge via Tauri's built-in command system (no manual pipes/sockets needed).
  • Configuration: config.toml for runtime settings (model paths, device selection).

Technology Stack

  • Language: Rust
  • Transcription: Whisper (whisper.cpp via whisper-rs or similar bindings).
  • Audio Capture: cpal (cross-platform).
  • Input Emulation: enigo (cross-platform).
  • Frontend Framework: Tauri (v2).
  • Config Handling: serde + toml.

Key Design Decisions

  • Offline-First: All transcription is performed locally; no network/API calls.
  • Efficiency: Lean, background-service focus. Tauri + Rust provides a smaller memory/resource footprint than Electron.
  • Dynamic Input Selection: Service supports dynamic audio source switching (hot-swapping) via Tauri Command calls from the UI.
  • OS Specifics: Uses Rust cfg attributes for platform-specific input simulation (X11/Wayland/Win32) without diverging the codebase.

Project Status & Evolution

  • Dependencies: Switched from whisper-rs to whisper-cpp-plus due to compilation incompatibilities with current whisper.cpp APIs. This change prioritizes long-term maintainability over minor library versioning.
  • Audio Infrastructure: System requires libasound2-dev on Linux (PipeWire/ALSA).
  • Design Philosophy:
    • Lean & Modular: Keep the backend (Rust) strictly for heavy lifting (audio, VAD, transcription) and the frontend (Tauri) for control/config.
    • Dynamic Configuration: Real-time updates via notify (config watcher) allow for sensitivity tuning without restarting the service.
    • VAD-First: Implemented energy-based VAD for efficient, human-centric transcription triggering, minimizing CPU load and irrelevant noise.

Next Steps

  1. Re-build: Implement the transition to whisper-cpp-plus in audio.rs and main.rs.
  2. Transcription Logic: Integrate the actual inference engine within the VAD-triggered buffer worker.
  3. Keyboard Emulation: Implement enigo integration to inject the transcribed text.
  4. Tauri Integration: Build the minimal HTML/JS GUI for sensitivity sliders and model selection.