Building Voice AI Apps in React Native (Whisper & The Realtime API)

By Malik Chohra

Text-based chatbots are giving way to voice-first interfaces. With the release of OpenAI's Realtime API and Gemini's Live capabilities, true duplex (two-way interrupting) voice conversation on mobile is now reality. But handling real-time audio streams in React Native requires diving deep into native code and WebRTC. Here is how we build low-latency voice AI features at CasaInnov.

The Death of Turn-Based Voice

Before 2025, voice AI was "turn-based." You record audio, upload it to Whisper for STT (Speech-to-Text), send the text to GPT-4, send the response to TTS (Text-to-Speech), and play the audio file. Total latency: 3 to 5 seconds. It felt like talking to a walkie-talkie.

OpenAI's Realtime API changed this. It accepts raw PCM audio streams via WebSocket and returns raw PCM audio streams instantly, allowing the AI to laugh, interrupt, and react to your tone of voice in under 300ms.

React Native Audio Streaming

To send continuous audio to a WebSocket in React Native, we cannot rely on standard recording libraries that create `.wav` files. We need real-time chunking.

Option 1: WebRTC. We use react-native-webrtc to establish a peer-to-peer connection with a WebRTC gateway (like LiveKit), which proxies the raw audio to OpenAI's servers. This is incredibly stable and handles echo cancellation automatically.

Option 2: Native Audio Worklets. For tighter control, we write custom Swift/Kotlin modules (using AVAudioEngine and AudioRecord) to capture PCM 16-bit 24kHz audio, base64 encode it on a background thread, and pipe it across the bridge to standard WebSockets.

Echo Cancellation & Interruptions

When the AI speaks through the phone's speaker, the microphone picks it up. If you don't use hardware Acoustic Echo Cancellation (AEC), the AI will hear itself and hallucinate wildly.

Always request hardware AEC when configuring your audio session in React Native. On iOS, you must set the AVAudioSessionCategoryOptions to defaultToSpeaker | allowBluetooth while specifically enabling VoiceProcessingIO.

To handle interruptions (when the user speaks while the AI is talking), the Realtime API requires you to instantly mute local playback and send a truncate event via WebSocket so the AI knows you cut it off.

Conclusion

Building high-quality voice AI in React Native separates the hobbyists from the pros. If you want a voice assistant that feels like "Her", you must leave HTTP behind and embrace WebRTC and Native audio buffers.

Want a Voice-First App?

We build custom WebRTC bridges and integrate the OpenAI Realtime API directly into your Expo app.

Explore Our Expo Services

The Death of Turn-Based Voice

React Native Audio Streaming

To send continuous audio to a WebSocket in React Native, we cannot rely on standard recording libraries that create `.wav` files. We need real-time chunking.

Echo Cancellation & Interruptions

When the AI speaks through the phone's speaker, the microphone picks it up. If you don't use hardware Acoustic Echo Cancellation (AEC), the AI will hear itself and hallucinate wildly.

Conclusion

Want a Voice-First App?

We build custom WebRTC bridges and integrate the OpenAI Realtime API directly into your Expo app.

Explore Our Expo Services

Building Voice AI Apps in React Native (Whisper & The Realtime API)

The Death of Turn-Based Voice

React Native Audio Streaming

Echo Cancellation & Interruptions

Conclusion

Want a Voice-First App?

Loading...

Building Voice AI Apps in React Native (Whisper & The Realtime API)

The Death of Turn-Based Voice

React Native Audio Streaming

Echo Cancellation & Interruptions

Conclusion

Want a Voice-First App?