What is RAG and Why Mobile Apps Need It
Retrieval-Augmented Generation (RAG) lets your AI answer questions using your data, product docs, user history, knowledge bases, instead of only the LLM's training data. In a React Native app, RAG is the architecture behind "chat with your documents," personalized AI assistants, and intelligent search.
Without RAG, your AI answers from its training data, which is frozen at some point in the past and knows nothing about your product. With RAG, it answers from your actual docs, the user's own history, or your private knowledge base. That gap in quality is why it's become the default architecture for enterprise mobile AI features.
How RAG Works (Plain English)
- Ingest: Your documents (PDFs, text, markdown) are split into chunks and converted into vector embeddings (numbers that represent meaning)
- Store: Embeddings are stored in a vector database (Pinecone, Supabase pgvector, Chroma, Weaviate)
- Retrieve: When a user asks a question, the question is also embedded and the most semantically similar chunks are retrieved
- Generate: The retrieved chunks are inserted into the LLM prompt as context, and the LLM generates an answer grounded in your data
Architecture Options for Mobile RAG
Web RAG is simpler. Mobile adds bandwidth constraints, battery drain from embeddings, and the question of what happens when the user goes offline. Three main architectures handle these differently:
| Architecture | How it Works | Pros | Cons | Best For |
|---|---|---|---|---|
| Server-Side RAG | App sends query → your API does embedding + retrieval + generation → returns answer | Simple app code, secure, works on low-end devices | Requires internet, API latency (300–800ms) | Most production apps |
| Hybrid (Edge RAG) | Small vector store cached on device, embeddings computed locally, generation via cloud | Works offline for cached docs, faster retrieval | Complex, storage overhead | Offline-first apps with known document sets |
| Full On-Device RAG | Local LLM + local embeddings + local vector store (SQLite-vec) | 100% offline, full privacy | Requires 6B+ RAM device, large model files | Medical/legal privacy apps |
We recommend Server-Side RAG for 90% of apps. This guide covers that architecture in full, with notes on where to add on-device caching.
The Backend: RAG API with Node.js + Pinecone
Your React Native app calls a RAG API endpoint. Here's the minimal backend you need, deployable as a single Vercel function or Express route:
// api/rag-query.ts (Vercel/Next.js API route)
import OpenAI from 'openai'
import { Pinecone } from '@pinecone-database/pinecone'
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! })
const index = pinecone.index('your-index-name')
export async function POST(req: Request) {
const { query, userId, topK = 5 } = await req.json()
// Step 1: Embed the user's query
const embeddingRes = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query,
})
const queryVector = embeddingRes.data[0].embedding
// Step 2: Retrieve relevant chunks from vector DB
const results = await index.query({
vector: queryVector,
topK,
filter: { userId }, // Scope to user's own documents
includeMetadata: true,
})
const context = results.matches
.filter(m => (m.score ?? 0) > 0.75) // Only include high-relevance chunks
.map(m => m.metadata?.text as string)
.join('\n\n---\n\n')
// Step 3: Generate answer with retrieved context
const stream = openai.beta.chat.completions.stream({
model: 'gpt-4o-mini',
messages: [
{
role: 'system',
content: `You are a helpful assistant. Answer based ONLY on the provided context.
If the answer is not in the context, say "I don't have that information."
Context:
${context}`,
},
{ role: 'user', content: query },
],
max_tokens: 600,
})
// Return as streaming response
return new Response(stream.toReadableStream())
}Document Ingestion: Adding Your Data to the Vector DB
Before your RAG API can retrieve anything, you need to ingest your documents. Run this once (or on a schedule for live data):
// scripts/ingest-documents.ts
import OpenAI from 'openai'
import { Pinecone } from '@pinecone-database/pinecone'
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter'
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! })
// Split long docs into 512-token chunks with 50-token overlap
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 512,
chunkOverlap: 50,
})
async function ingestDocument(text: string, metadata: Record<string, string>) {
const chunks = await splitter.splitText(text)
// Batch embed chunks (max 100 per request)
const embeddings = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: chunks,
})
// Upsert into Pinecone
const index = pinecone.index('your-index-name')
await index.upsert(
chunks.map((chunk, i) => ({
id: `${metadata.docId}-chunk-${i}`,
values: embeddings.data[i].embedding,
metadata: {
text: chunk,
...metadata,
},
}))
)
console.log(`Ingested ${chunks.length} chunks for doc ${metadata.docId}`)
}
// Example: ingest a product FAQ
await ingestDocument(
"Your FAQ content here...",
{ docId: 'faq-v2', source: 'product-docs', userId: 'global' }
)The React Native Client: Streaming RAG Answers
Now the mobile app side, a custom hook that streams RAG answers token by token for a ChatGPT-like experience:
import { useState, useCallback } from 'react'
const RAG_API_URL = 'https://your-api.vercel.app/api/rag-query'
export function useRAG() {
const [answer, setAnswer] = useState('')
const [sources, setSources] = useState<string[]>([])
const [isLoading, setIsLoading] = useState(false)
const [error, setError] = useState<string | null>(null)
const query = useCallback(async (question: string, userId: string) => {
setIsLoading(true)
setAnswer('')
setError(null)
try {
const response = await fetch(RAG_API_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ query: question, userId }),
})
if (!response.ok) throw new Error('RAG API error')
if (!response.body) throw new Error('No response body')
// Stream tokens as they arrive
const reader = response.body.getReader()
const decoder = new TextDecoder()
while (true) {
const { done, value } = await reader.read()
if (done) break
const chunk = decoder.decode(value, { stream: true })
// Parse SSE chunks from OpenAI streaming format
const lines = chunk.split('\n').filter(l => l.startsWith('data: '))
for (const line of lines) {
const data = line.replace('data: ', '')
if (data === '[DONE]') break
try {
const parsed = JSON.parse(data)
const token = parsed.choices?.[0]?.delta?.content
if (token) setAnswer(prev => prev + token)
} catch {
// Skip malformed chunks
}
}
}
} catch (err) {
setError(err instanceof Error ? err.message : 'Query failed')
} finally {
setIsLoading(false)
}
}, [])
return { query, answer, sources, isLoading, error }
}Choosing a Vector Database for Mobile Apps
| Database | Best For | Free Tier | Latency | Verdict |
|---|---|---|---|---|
| Pinecone | Production, large scale | Yes (100k vectors) | 50–100ms | Best for scale |
| Supabase pgvector | Already using Supabase/Postgres | Yes | 30–80ms | Best for simplicity |
| Weaviate | Complex filtering + hybrid search | Yes (cloud sandbox) | 40–90ms | Best for filtering |
| Qdrant | Self-hosted, GDPR strict | Self-host only | 20–60ms | Best for privacy |
| SQLite-vec (on-device) | Offline/hybrid RAG | Free (open source) | <5ms | Best for offline |
Our recommendation: Start with Supabase pgvector if you're already using Supabase. Migrate to Pinecone when you exceed 500k vectors or need sub-50ms retrieval at scale.
Performance Optimization for Mobile RAG
- Cache embeddings client-side: Store recently queried question embeddings in AsyncStorage, if the user asks the same question twice, skip re-embedding (saves 50–100ms)
- Pre-warm the vector index: Make a lightweight "ping" query on app launch to warm Pinecone's serverless index before the user's first real query
- Cap chunk size at 512 tokens: Larger chunks degrade retrieval precision significantly. Smaller chunks (256 tokens) with overlap often outperform large ones
- Use
text-embedding-3-small: It's 5× cheaper thantext-embedding-3-largewith only a marginal quality difference for retrieval tasks - Set a relevance threshold: Filter out chunks with cosine similarity below 0.75, returning low-quality context makes answers worse, not better
Need a Production RAG Implementation?
CasaInnov builds complete RAG pipelines, from document ingestion to streaming mobile UI. We've shipped RAG features for healthcare, legal, and enterprise clients.
Trusted by 10+ companies | Free consultation | 100% confidential