Why Run Llama 3 On-Device in React Native?
Running Llama 3 directly on a user's mobile device, no server, no API key, no internet, is now possible in production React Native apps. With the right quantization and a library like llama.rn, you get sub-100ms time-to-first-token on modern iPhones and high-end Android devices, complete data privacy, and zero per-query API costs.
Meta released Llama 3 in April 2024, and the 1B and 3B Instruct variants are specifically designed for on-device deployment. They fit comfortably within the memory constraints of flagship smartphones when quantized to 4-bit precision, making them the go-to choice for privacy-first AI features in mobile apps.
Below: model selection, the llama.rn integration walkthrough, performance tuning, and when on-device actually beats cloud, and when it doesn't.
Cloud AI vs. On-Device AI: When to Choose Each
Before diving into code, understand when on-device Llama 3 is the right architectural choice, and when it isn't.
| Dimension | Cloud LLM (GPT-4o, Claude) | On-Device Llama 3 |
|---|---|---|
| Latency (TTFT) | 800ms – 3s | 50 – 150ms |
| Privacy | Data leaves the device | 100% local, never shared |
| Offline support | No | Yes, fully offline |
| Per-query cost | $0.002 – $0.06 / 1K tokens | $0 (one-time model download) |
| Model intelligence | Modern | Good (smaller models) |
| App size impact | None | +800MB – 3GB for model |
| Battery usage | Low (network only) | High during inference |
Choose on-device when: your users handle sensitive data (health records, private notes, legal docs), you need offline functionality, your use case involves short, repeated queries (autocomplete, local summarization, on-device search), or you want to eliminate API costs for high-volume features.
Stick with cloud when: you need frontier-level reasoning (complex code, multi-step analysis), your target audience includes low-end Android devices with limited RAM, or the model download size is a deal-breaker for your distribution strategy.
Choosing the Right Llama 3 Model for Mobile
Meta provides several Llama 3 model sizes. For mobile, you're working in the 1B–8B parameter range. Here's how to pick:
Llama 3.2 1B Instruct (Recommended for most apps)
At Q4_K_M quantization, the 1B model is ~600MB and runs at 20–40 tokens/second on an iPhone 15. Classification, structured extraction, and short Q&A work well. Complex reasoning is where it starts to stumble. For most consumer apps targeting devices beyond just the latest flagships, start here.
Llama 3.2 3B Instruct
The 3B Instruct model at Q4_K_M is ~1.8GB. It runs at 12–20 tokens/second on flagship devices (iPhone 15 Pro, Pixel 8 Pro) and delivers meaningfully better output quality for complex tasks like multi-turn conversation and longer-form generation. Target this if your app requires richer reasoning and you can handle the larger download.
Llama 3.1 8B Instruct
The 8B model at Q4_K_M is ~4.9GB. It only fits on devices with 6GB+ RAM, so iPhone 15 Pro Max and a handful of Samsung flagships. The reasoning is noticeably better, but a 4.9GB download will kill your install conversion. Niche professional tools only.
The Best Library: llama.rn
llama.rn is the React Native library for running GGUF models locally. It wraps llama.cpp, the C++ inference engine that most on-device tooling is built on, and exposes a JavaScript API with streaming, context management, and Metal/GPU acceleration on iOS.
- iOS: Metal GPU acceleration, models run on the Neural Engine and GPU, not just CPU
- Android: CPU inference with OpenBLAS, GPU support for Android is improving but not production-stable at time of writing
- Streaming: Built-in token streaming for real-time, ChatGPT-like UX
- Context management: Load once, reuse across multiple requests without reloading the model
- Format: Supports GGUF files (the standard format for quantized models from Hugging Face)
Installation and Setup
1. Install the package
npm install llama.rn
# or
yarn add llama.rn2. iOS, Link native code and enable Metal
cd ios && pod installIn your ios/YourApp/Info.plist, add a usage description for local file access if you're downloading the model at runtime:
<key>NSLocalNetworkUsageDescription</key>
<string>Used to download AI model files for on-device processing</string>3. Android, Configure large heap
In android/app/build.gradle, ensure you have enough memory headroom for model loading:
android {
defaultConfig {
// ...
}
packagingOptions {
// Avoid conflicts with llama.cpp native libs
pickFirst '**/libllama.so'
}
}Also add android:largeHeap="true" to your AndroidManifest.xml <application> tag to prevent OOM crashes during model loading.
Downloading the Model at Runtime
Never bundle GGUF model files inside your app binary, they're too large for App Store review. Instead, download the model to the device's document directory on first launch. Here's a complete model manager with progress tracking:
import * as FileSystem from 'expo-file-system'
import { useState, useCallback } from 'react'
const MODEL_URL =
'https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf'
const MODEL_FILENAME = 'Llama-3.2-1B-Instruct-Q4_K_M.gguf'
const MODEL_PATH = FileSystem.documentDirectory + MODEL_FILENAME
export function useModelDownload() {
const [progress, setProgress] = useState(0)
const [isDownloading, setIsDownloading] = useState(false)
const [isReady, setIsReady] = useState(false)
const checkModel = useCallback(async () => {
const info = await FileSystem.getInfoAsync(MODEL_PATH)
setIsReady(info.exists)
return info.exists
}, [])
const downloadModel = useCallback(async () => {
setIsDownloading(true)
setProgress(0)
const callback: FileSystem.DownloadProgressCallback = ({ totalBytesWritten, totalBytesExpectedToWrite }) => {
setProgress(totalBytesWritten / totalBytesExpectedToWrite)
}
const downloadResumable = FileSystem.createDownloadResumable(
MODEL_URL,
MODEL_PATH,
{},
callback
)
try {
await downloadResumable.downloadAsync()
setIsReady(true)
} catch (error) {
console.error('Model download failed:', error)
} finally {
setIsDownloading(false)
}
}, [])
return { progress, isDownloading, isReady, checkModel, downloadModel, MODEL_PATH }
}Core Integration: Loading and Running Llama 3
Here's the complete hook for initializing the Llama 3 context and running streaming inference in React Native:
import { initLlama, LlamaContext } from 'llama.rn'
import { useState, useRef, useCallback } from 'react'
export function useLlama3(modelPath: string) {
const contextRef = useRef<LlamaContext | null>(null)
const [isLoading, setIsLoading] = useState(false)
const [isGenerating, setIsGenerating] = useState(false)
// Load model into memory, do this once on app start
const loadModel = useCallback(async () => {
if (contextRef.current) return // Already loaded
setIsLoading(true)
try {
contextRef.current = await initLlama({
model: modelPath,
use_mlock: true, // Lock model in RAM (prevents swapping)
n_ctx: 2048, // Context window size
n_threads: 4, // CPU threads (4 is optimal for most devices)
n_gpu_layers: 99, // Use GPU layers on iOS (Metal). Set 0 for CPU-only
})
console.log('Llama 3 loaded successfully')
} catch (err) {
console.error('Failed to load Llama 3:', err)
} finally {
setIsLoading(false)
}
}, [modelPath])
// Stream a completion response token by token
const complete = useCallback(
async (
prompt: string,
onToken: (token: string) => void,
systemPrompt = 'You are a helpful assistant. Be concise.'
) => {
if (!contextRef.current || isGenerating) return
setIsGenerating(true)
const messages = [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: prompt },
]
try {
await contextRef.current.completion(
{
messages,
n_predict: 512, // Max tokens to generate
temperature: 0.7,
top_p: 0.9,
stop: ['<|eot_id|>', '<|end_of_text|>'],
},
data => {
if (data.token && !data.token.includes('<|')) {
onToken(data.token)
}
}
)
} finally {
setIsGenerating(false)
}
},
[isGenerating]
)
// Release model from memory when done
const releaseModel = useCallback(async () => {
await contextRef.current?.release()
contextRef.current = null
}, [])
return { loadModel, complete, releaseModel, isLoading, isGenerating }
}Building the Chat UI with Streaming
Here's a minimal but built to ship chat component that uses the hook above to deliver a real-time streaming experience:
import React, { useState, useEffect } from 'react'
import { View, TextInput, Text, ScrollView, Pressable, ActivityIndicator } from 'react-native'
import { useLlama3 } from './useLlama3'
import { useModelDownload } from './useModelDownload'
export default function AIChat() {
const { MODEL_PATH, isReady, checkModel, downloadModel, isDownloading, progress } = useModelDownload()
const { loadModel, complete, isLoading, isGenerating } = useLlama3(MODEL_PATH)
const [input, setInput] = useState('')
const [response, setResponse] = useState('')
useEffect(() => {
checkModel()
}, [])
useEffect(() => {
if (isReady) loadModel()
}, [isReady])
const handleSend = async () => {
if (!input.trim() || isGenerating) return
const userInput = input.trim()
setInput('')
setResponse('')
await complete(userInput, token => {
setResponse(prev => prev + token)
})
}
if (!isReady) {
return (
<View style={{ flex: 1, alignItems: 'center', justifyContent: 'center', padding: 24 }}>
<Text style={{ fontSize: 18, fontWeight: '600', marginBottom: 12 }}>
Llama 3 Model Required
</Text>
<Text style={{ color: '#666', textAlign: 'center', marginBottom: 24 }}>
Download the AI model (~600MB) to enable on-device AI. Works offline after download.
</Text>
{isDownloading ? (
<View style={{ width: '100%' }}>
<View style={{ height: 8, backgroundColor: '#eee', borderRadius: 4, overflow: 'hidden' }}>
<View style={{ height: '100%', width: `${progress * 100}%`, backgroundColor: '#6366f1' }} />
</View>
<Text style={{ textAlign: 'center', marginTop: 8, color: '#666' }}>
{Math.round(progress * 100)}% downloaded...
</Text>
</View>
) : (
<Pressable
onPress={downloadModel}
style={{ backgroundColor: '#6366f1', paddingHorizontal: 24, paddingVertical: 12, borderRadius: 8 }}
>
<Text style={{ color: '#fff', fontWeight: '600' }}>Download Model</Text>
</Pressable>
)}
</View>
)
}
if (isLoading) {
return (
<View style={{ flex: 1, alignItems: 'center', justifyContent: 'center' }}>
<ActivityIndicator size="large" />
<Text style={{ marginTop: 12, color: '#666' }}>Loading Llama 3...</Text>
</View>
)
}
return (
<View style={{ flex: 1, padding: 16 }}>
<ScrollView style={{ flex: 1, marginBottom: 12 }}>
{response ? (
<View style={{ backgroundColor: '#f5f5f5', padding: 16, borderRadius: 12 }}>
<Text style={{ lineHeight: 22 }}>{response}</Text>
{isGenerating && <ActivityIndicator size="small" style={{ marginTop: 8 }} />}
</View>
) : null}
</ScrollView>
<View style={{ flexDirection: 'row', gap: 8 }}>
<TextInput
value={input}
onChangeText={setInput}
placeholder="Ask Llama 3 anything..."
style={{ flex: 1, borderWidth: 1, borderColor: '#ddd', borderRadius: 8, padding: 12 }}
onSubmitEditing={handleSend}
/>
<Pressable
onPress={handleSend}
disabled={isGenerating}
style={{
backgroundColor: isGenerating ? '#a5b4fc' : '#6366f1',
paddingHorizontal: 16,
borderRadius: 8,
justifyContent: 'center',
}}
>
<Text style={{ color: '#fff', fontWeight: '600' }}>Send</Text>
</Pressable>
</View>
</View>
)
}Performance Tuning for Production
Getting the model to load is 20% of the work. Here's what actually matters in production:
1. Use the Right Quantization
Not all quantization levels are equal. For mobile, the sweet spot is Q4_K_M, it offers the best quality-to-size ratio. Avoid Q2_K (too lossy for real apps) and Q8_0 (too large for marginal quality gain).
| Quantization | 1B Model Size | Quality | Speed (iPhone 15) | Recommendation |
|---|---|---|---|---|
| Q2_K | ~380MB | Poor | 45 tok/s | Avoid |
| Q4_K_M | ~600MB | Good | 35 tok/s | Best for most apps |
| Q5_K_M | ~730MB | Very Good | 28 tok/s | Premium quality |
| Q8_0 | ~1.1GB | Near-lossless | 20 tok/s | Overkill for mobile |
2. Preload the Model on App Start
The model load time (2–5 seconds) is the biggest UX problem in on-device AI. Load it eagerly in the background when the app launches, don't wait for the user to first interact with the AI feature. Use React Native's background task mechanism or simply call loadModel() in your root component's useEffect.
3. Keep the Context Alive
initLlama() is expensive. Once you initialize the context, keep the LlamaContext object in a ref or global store (Zustand/Redux). Never reinitialize per request. Only call release() when the user navigates completely away from your AI feature.
4. Cap Context Window for Speed
A smaller n_ctx (context window) directly reduces memory usage and speeds up inference. For single-turn features (summarization, classification), set n_ctx: 512. Only use n_ctx: 2048+ for true multi-turn chat.
5. Always Use GPU on iOS
Set n_gpu_layers: 99 to offload all model layers to the Metal GPU on iOS. This alone gives 2–3× faster inference vs CPU-only. It's enabled in our example above. On Android, leave it at 0 for now, Android Vulkan GPU inference in llama.cpp is still maturing.
Practical Use Cases and System Prompt Patterns
Llama 3 Instruct models respond well to clear, structured system prompts. Here are patterns for common mobile AI features:
Smart Text Summarizer
const systemPrompt = `You are a concise summarizer.
When given text, respond with a 2-3 sentence summary only.
Never add commentary or ask questions. Output the summary immediately.`On-Device Private Journal Coach
const systemPrompt = `You are a supportive journaling coach.
Help users reflect on their thoughts with empathy.
Ask one follow-up question at the end of each response.
Keep responses under 150 words.`Structured Data Extractor
const systemPrompt = `Extract structured data from the user's input.
Always respond with valid JSON only, no markdown fencing.
Schema: { "name": string, "date": string, "amount": number, "category": string }
If a field is not found, use null.`Using Llama 3 with Expo
llama.rn requires native code, so it won't work with Expo Go. You'll need a custom development build:
# Create a development build
npx expo install llama.rn
npx expo run:ios # or run:android
# For EAS Build (recommended for production)
eas build --platform ios --profile developmentThe model file management approach stays the same, download to FileSystem.documentDirectory using expo-file-system and pass the local path to initLlama().
Common Pitfalls and How to Avoid Them
OOM Crashes on Android
The 3B model at Q4 needs ~2.5GB RAM. Android will kill the app if heap pressure is too high. Use the 1B model for broad Android compatibility and add android:largeHeap="true".
Slow First Response
Prefill (processing the input prompt) is slow for long inputs. Keep system prompts under 200 tokens and don't send long conversation histories, summarize older turns instead.
Model Not Found Error
Always call FileSystem.getInfoAsync(path) before initLlama(). If the model doesn't exist, trigger the download flow, never assume it's cached.
App Store Rejection
Apple will reject apps that bundle large model files in the IPA. Always download models post-install. Make the download optional or offer a cloud fallback so users aren't forced to download ~600MB before using the app.
Recommended Architecture: Hybrid On-Device + Cloud
The best production pattern isn't a binary choice between cloud and on-device, it's a smart hybrid:
- Short queries, sensitive data, offline mode → Llama 3 on-device
- Complex reasoning, longer outputs, first-time users → Cloud (GPT-4o, Claude)
- Users without the model downloaded → Transparent cloud fallback
- Pro users / power features → Download prompt + on-device premium experience
A strategy pattern that selects the backend based on device RAM, network availability, and model download state handles this cleanly. Users on older hardware always hit the cloud. Power users who opted into the model download get local inference with no API costs.
Need On-Device AI in Your React Native App?
CasaInnov builds ready to ship on-device AI features, from model selection and quantization to hybrid architectures that balance privacy, performance, and cost.
Trusted by 10+ companies | Free first call | Kept confidential