Parlor is an on-device, real-time multimodal AI that facilitates natural voice and vision conversations with an AI, eliminating server costs and enhancing privacy.
Source: README View on GitHub →Parlor is gaining attention due to its innovative approach to on-device AI, addressing privacy concerns and server costs. It leverages cutting-edge models like Gemma 4 E2B and Kokoro, making it a game-changer for language learning and real-time interaction.
Source: README, project traitsParlor combines voice and vision capabilities, allowing users to have conversations with an AI using both speech and visual inputs.
Source: READMEThe AI runs entirely on the user's machine, ensuring privacy and eliminating the need for server infrastructure.
Source: READMEParlor supports real-time processing of voice and vision data, providing immediate responses to user inputs.
Source: READMEIntegrated voice activity detection allows for hands-free operation without the need for push-to-talk.
Source: READMEThe text-to-speech feature streams audio before the full response is generated, enhancing the user experience.
Source: READMEParlor's architecture is a client-server model, with a browser-based frontend communicating via WebSocket with a FastAPI server. The server uses Gemma 4 E2B for speech and vision understanding and Kokoro for text-to-speech. The project employs a modular design with separate components for the server, text-to-speech, and frontend UI.
Source: Code tree + READMEinfra: Local machine with macOS Apple Silicon or Linux GPU | key_deps: Gemma 4 E2B, Kokoro, LiteRT-LM, Silero VAD | language: Python | framework: FastAPI for the server, HTML for the frontend UI
Source: Code tree, READMEParlor is suitable for language learning, interactive AI applications, and privacy-conscious users who prefer on-device processing over cloud-based solutions.
Source: READMENot enough information.
Source: GitHub ReleasesParlor is a promising project for developers and users interested in on-device AI solutions. It offers a unique combination of privacy, real-time interaction, and multimodal capabilities, making it particularly valuable for language learning and interactive applications. It is best suited for technically inclined individuals or teams with access to powerful local hardware.