Why Run Llama 3 On-Device in React Native?

Running Llama 3 directly on a user's mobile device, no server, no API key, no internet, is now possible in production React Native apps. With the right quantization and a library like llama.rn, you get sub-100ms time-to-first-token on modern iPhones and high-end Android devices, complete data privacy, and zero per-query API costs.

Meta released Llama 3 in April 2024, and the 1B and 3B Instruct variants are specifically designed for on-device deployment. They fit comfortably within the memory constraints of flagship smartphones when quantized to 4-bit precision, making them the go-to choice for privacy-first AI features in mobile apps.

Below: model selection, the llama.rn integration walkthrough, performance tuning, and when on-device actually beats cloud, and when it doesn't.

Cloud AI vs. On-Device AI: When to Choose Each

Before diving into code, understand when on-device Llama 3 is the right architectural choice, and when it isn't.

Dimension	Cloud LLM (GPT-4o, Claude)	On-Device Llama 3
Latency (TTFT)	800ms – 3s	50 – 150ms
Privacy	Data leaves the device	100% local, never shared
Offline support	No	Yes, fully offline
Per-query cost	$0.002 – $0.06 / 1K tokens	$0 (one-time model download)
Model intelligence	Modern	Good (smaller models)
App size impact	None	+800MB – 3GB for model
Battery usage	Low (network only)	High during inference

Choose on-device when: your users handle sensitive data (health records, private notes, legal docs), you need offline functionality, your use case involves short, repeated queries (autocomplete, local summarization, on-device search), or you want to eliminate API costs for high-volume features.

Stick with cloud when: you need frontier-level reasoning (complex code, multi-step analysis), your target audience includes low-end Android devices with limited RAM, or the model download size is a deal-breaker for your distribution strategy.

Choosing the Right Llama 3 Model for Mobile

Meta provides several Llama 3 model sizes. For mobile, you're working in the 1B–8B parameter range. Here's how to pick:

Llama 3.2 1B Instruct (Recommended for most apps)

At Q4_K_M quantization, the 1B model is ~600MB and runs at 20–40 tokens/second on an iPhone 15. Classification, structured extraction, and short Q&A work well. Complex reasoning is where it starts to stumble. For most consumer apps targeting devices beyond just the latest flagships, start here.

Llama 3.2 3B Instruct

The 3B Instruct model at Q4_K_M is ~1.8GB. It runs at 12–20 tokens/second on flagship devices (iPhone 15 Pro, Pixel 8 Pro) and delivers meaningfully better output quality for complex tasks like multi-turn conversation and longer-form generation. Target this if your app requires richer reasoning and you can handle the larger download.

Llama 3.1 8B Instruct

The 8B model at Q4_K_M is ~4.9GB. It only fits on devices with 6GB+ RAM, so iPhone 15 Pro Max and a handful of Samsung flagships. The reasoning is noticeably better, but a 4.9GB download will kill your install conversion. Niche professional tools only.

The Best Library: llama.rn

llama.rn is the React Native library for running GGUF models locally. It wraps llama.cpp, the C++ inference engine that most on-device tooling is built on, and exposes a JavaScript API with streaming, context management, and Metal/GPU acceleration on iOS.

iOS: Metal GPU acceleration, models run on the Neural Engine and GPU, not just CPU
Android: CPU inference with OpenBLAS, GPU support for Android is improving but not production-stable at time of writing
Streaming: Built-in token streaming for real-time, ChatGPT-like UX
Context management: Load once, reuse across multiple requests without reloading the model
Format: Supports GGUF files (the standard format for quantized models from Hugging Face)

Installation and Setup

1. Install the package

bash

npm install llama.rn
# or
yarn add llama.rn

2. iOS, Link native code and enable Metal

bash

cd ios && pod install

In your ios/YourApp/Info.plist, add a usage description for local file access if you're downloading the model at runtime:

xml

<key>NSLocalNetworkUsageDescription</key>
<string>Used to download AI model files for on-device processing</string>

3. Android, Configure large heap

In android/app/build.gradle, ensure you have enough memory headroom for model loading:

gradle

android {
  defaultConfig {
    // ...
  }

  packagingOptions {
    // Avoid conflicts with llama.cpp native libs
    pickFirst '**/libllama.so'
  }
}

Also add android:largeHeap="true" to your AndroidManifest.xml <application> tag to prevent OOM crashes during model loading.

Downloading the Model at Runtime

Never bundle GGUF model files inside your app binary, they're too large for App Store review. Instead, download the model to the device's document directory on first launch. Here's a complete model manager with progress tracking:

typescript

import * as FileSystem from 'expo-file-system'
import { useState, useCallback } from 'react'

const MODEL_URL =
  'https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf'

const MODEL_FILENAME = 'Llama-3.2-1B-Instruct-Q4_K_M.gguf'
const MODEL_PATH = FileSystem.documentDirectory + MODEL_FILENAME

export function useModelDownload() {
  const [progress, setProgress] = useState(0)
  const [isDownloading, setIsDownloading] = useState(false)
  const [isReady, setIsReady] = useState(false)

  const checkModel = useCallback(async () => {
    const info = await FileSystem.getInfoAsync(MODEL_PATH)
    setIsReady(info.exists)
    return info.exists
  }, [])

  const downloadModel = useCallback(async () => {
    setIsDownloading(true)
    setProgress(0)

    const callback: FileSystem.DownloadProgressCallback = ({ totalBytesWritten, totalBytesExpectedToWrite }) => {
      setProgress(totalBytesWritten / totalBytesExpectedToWrite)
    }

    const downloadResumable = FileSystem.createDownloadResumable(
      MODEL_URL,
      MODEL_PATH,
      {},
      callback
    )

    try {
      await downloadResumable.downloadAsync()
      setIsReady(true)
    } catch (error) {
      console.error('Model download failed:', error)
    } finally {
      setIsDownloading(false)
    }
  }, [])

  return { progress, isDownloading, isReady, checkModel, downloadModel, MODEL_PATH }
}

Core Integration: Loading and Running Llama 3

Here's the complete hook for initializing the Llama 3 context and running streaming inference in React Native:

typescript

import { initLlama, LlamaContext } from 'llama.rn'
import { useState, useRef, useCallback } from 'react'

export function useLlama3(modelPath: string) {
  const contextRef = useRef<LlamaContext | null>(null)
  const [isLoading, setIsLoading] = useState(false)
  const [isGenerating, setIsGenerating] = useState(false)

  // Load model into memory, do this once on app start
  const loadModel = useCallback(async () => {
    if (contextRef.current) return // Already loaded

    setIsLoading(true)
    try {
      contextRef.current = await initLlama({
        model: modelPath,
        use_mlock: true,      // Lock model in RAM (prevents swapping)
        n_ctx: 2048,          // Context window size
        n_threads: 4,         // CPU threads (4 is optimal for most devices)
        n_gpu_layers: 99,     // Use GPU layers on iOS (Metal). Set 0 for CPU-only
      })
      console.log('Llama 3 loaded successfully')
    } catch (err) {
      console.error('Failed to load Llama 3:', err)
    } finally {
      setIsLoading(false)
    }
  }, [modelPath])

  // Stream a completion response token by token
  const complete = useCallback(
    async (
      prompt: string,
      onToken: (token: string) => void,
      systemPrompt = 'You are a helpful assistant. Be concise.'
    ) => {
      if (!contextRef.current || isGenerating) return

      setIsGenerating(true)
      const messages = [
        { role: 'system', content: systemPrompt },
        { role: 'user', content: prompt },
      ]

      try {
        await contextRef.current.completion(
          {
            messages,
            n_predict: 512,     // Max tokens to generate
            temperature: 0.7,
            top_p: 0.9,
            stop: ['<|eot_id|>', '<|end_of_text|>'],
          },
          data => {
            if (data.token && !data.token.includes('<|')) {
              onToken(data.token)
            }
          }
        )
      } finally {
        setIsGenerating(false)
      }
    },
    [isGenerating]
  )

  // Release model from memory when done
  const releaseModel = useCallback(async () => {
    await contextRef.current?.release()
    contextRef.current = null
  }, [])

  return { loadModel, complete, releaseModel, isLoading, isGenerating }
}

Building the Chat UI with Streaming

Here's a minimal but built to ship chat component that uses the hook above to deliver a real-time streaming experience:

tsx

import React, { useState, useEffect } from 'react'
import { View, TextInput, Text, ScrollView, Pressable, ActivityIndicator } from 'react-native'
import { useLlama3 } from './useLlama3'
import { useModelDownload } from './useModelDownload'

export default function AIChat() {
  const { MODEL_PATH, isReady, checkModel, downloadModel, isDownloading, progress } = useModelDownload()
  const { loadModel, complete, isLoading, isGenerating } = useLlama3(MODEL_PATH)
  const [input, setInput] = useState('')
  const [response, setResponse] = useState('')

  useEffect(() => {
    checkModel()
  }, [])

  useEffect(() => {
    if (isReady) loadModel()
  }, [isReady])

  const handleSend = async () => {
    if (!input.trim() || isGenerating) return
    const userInput = input.trim()
    setInput('')
    setResponse('')

    await complete(userInput, token => {
      setResponse(prev => prev + token)
    })
  }

  if (!isReady) {
    return (
      <View style={{ flex: 1, alignItems: 'center', justifyContent: 'center', padding: 24 }}>
        <Text style={{ fontSize: 18, fontWeight: '600', marginBottom: 12 }}>
          Llama 3 Model Required
        </Text>
        <Text style={{ color: '#666', textAlign: 'center', marginBottom: 24 }}>
          Download the AI model (~600MB) to enable on-device AI. Works offline after download.
        </Text>
        {isDownloading ? (
          <View style={{ width: '100%' }}>
            <View style={{ height: 8, backgroundColor: '#eee', borderRadius: 4, overflow: 'hidden' }}>
              <View style={{ height: '100%', width: `${progress * 100}%`, backgroundColor: '#6366f1' }} />
            </View>
            <Text style={{ textAlign: 'center', marginTop: 8, color: '#666' }}>
              {Math.round(progress * 100)}% downloaded...
            </Text>
          </View>
        ) : (
          <Pressable
            onPress={downloadModel}
            style={{ backgroundColor: '#6366f1', paddingHorizontal: 24, paddingVertical: 12, borderRadius: 8 }}
          >
            <Text style={{ color: '#fff', fontWeight: '600' }}>Download Model</Text>
          </Pressable>
        )}
      </View>
    )
  }

  if (isLoading) {
    return (
      <View style={{ flex: 1, alignItems: 'center', justifyContent: 'center' }}>
        <ActivityIndicator size="large" />
        <Text style={{ marginTop: 12, color: '#666' }}>Loading Llama 3...</Text>
      </View>
    )
  }

  return (
    <View style={{ flex: 1, padding: 16 }}>
      <ScrollView style={{ flex: 1, marginBottom: 12 }}>
        {response ? (
          <View style={{ backgroundColor: '#f5f5f5', padding: 16, borderRadius: 12 }}>
            <Text style={{ lineHeight: 22 }}>{response}</Text>
            {isGenerating && <ActivityIndicator size="small" style={{ marginTop: 8 }} />}
          </View>
        ) : null}
      </ScrollView>

      <View style={{ flexDirection: 'row', gap: 8 }}>
        <TextInput
          value={input}
          onChangeText={setInput}
          placeholder="Ask Llama 3 anything..."
          style={{ flex: 1, borderWidth: 1, borderColor: '#ddd', borderRadius: 8, padding: 12 }}
          onSubmitEditing={handleSend}
        />
        <Pressable
          onPress={handleSend}
          disabled={isGenerating}
          style={{
            backgroundColor: isGenerating ? '#a5b4fc' : '#6366f1',
            paddingHorizontal: 16,
            borderRadius: 8,
            justifyContent: 'center',
          }}
        >
          <Text style={{ color: '#fff', fontWeight: '600' }}>Send</Text>
        </Pressable>
      </View>
    </View>
  )
}

Performance Tuning for Production

Getting the model to load is 20% of the work. Here's what actually matters in production:

1. Use the Right Quantization

Not all quantization levels are equal. For mobile, the sweet spot is Q4_K_M, it offers the best quality-to-size ratio. Avoid Q2_K (too lossy for real apps) and Q8_0 (too large for marginal quality gain).

Quantization	1B Model Size	Quality	Speed (iPhone 15)	Recommendation
Q2_K	~380MB	Poor	45 tok/s	Avoid
Q4_K_M	~600MB	Good	35 tok/s	Best for most apps
Q5_K_M	~730MB	Very Good	28 tok/s	Premium quality
Q8_0	~1.1GB	Near-lossless	20 tok/s	Overkill for mobile

2. Preload the Model on App Start

The model load time (2–5 seconds) is the biggest UX problem in on-device AI. Load it eagerly in the background when the app launches, don't wait for the user to first interact with the AI feature. Use React Native's background task mechanism or simply call loadModel() in your root component's useEffect.

3. Keep the Context Alive

initLlama() is expensive. Once you initialize the context, keep the LlamaContext object in a ref or global store (Zustand/Redux). Never reinitialize per request. Only call release() when the user navigates completely away from your AI feature.

4. Cap Context Window for Speed

A smaller n_ctx (context window) directly reduces memory usage and speeds up inference. For single-turn features (summarization, classification), set n_ctx: 512. Only use n_ctx: 2048+ for true multi-turn chat.

5. Always Use GPU on iOS

Set n_gpu_layers: 99 to offload all model layers to the Metal GPU on iOS. This alone gives 2–3× faster inference vs CPU-only. It's enabled in our example above. On Android, leave it at 0 for now, Android Vulkan GPU inference in llama.cpp is still maturing.

Practical Use Cases and System Prompt Patterns

Llama 3 Instruct models respond well to clear, structured system prompts. Here are patterns for common mobile AI features:

Smart Text Summarizer

typescript

const systemPrompt = `You are a concise summarizer.
When given text, respond with a 2-3 sentence summary only.
Never add commentary or ask questions. Output the summary immediately.`

On-Device Private Journal Coach

typescript

const systemPrompt = `You are a supportive journaling coach.
Help users reflect on their thoughts with empathy.
Ask one follow-up question at the end of each response.
Keep responses under 150 words.`

Structured Data Extractor

typescript

const systemPrompt = `Extract structured data from the user's input.
Always respond with valid JSON only, no markdown fencing.
Schema: { "name": string, "date": string, "amount": number, "category": string }
If a field is not found, use null.`

Using Llama 3 with Expo

llama.rn requires native code, so it won't work with Expo Go. You'll need a custom development build:

bash

# Create a development build
npx expo install llama.rn
npx expo run:ios   # or run:android

# For EAS Build (recommended for production)
eas build --platform ios --profile development

The model file management approach stays the same, download to FileSystem.documentDirectory using expo-file-system and pass the local path to initLlama().

Common Pitfalls and How to Avoid Them

OOM Crashes on Android

The 3B model at Q4 needs ~2.5GB RAM. Android will kill the app if heap pressure is too high. Use the 1B model for broad Android compatibility and add android:largeHeap="true".

Slow First Response

Prefill (processing the input prompt) is slow for long inputs. Keep system prompts under 200 tokens and don't send long conversation histories, summarize older turns instead.

Model Not Found Error

Always call FileSystem.getInfoAsync(path) before initLlama(). If the model doesn't exist, trigger the download flow, never assume it's cached.

App Store Rejection

Apple will reject apps that bundle large model files in the IPA. Always download models post-install. Make the download optional or offer a cloud fallback so users aren't forced to download ~600MB before using the app.

Recommended Architecture: Hybrid On-Device + Cloud

The best production pattern isn't a binary choice between cloud and on-device, it's a smart hybrid:

Short queries, sensitive data, offline mode → Llama 3 on-device
Complex reasoning, longer outputs, first-time users → Cloud (GPT-4o, Claude)
Users without the model downloaded → Transparent cloud fallback
Pro users / power features → Download prompt + on-device premium experience

A strategy pattern that selects the backend based on device RAM, network availability, and model download state handles this cleanly. Users on older hardware always hit the cloud. Power users who opted into the model download get local inference with no API costs.

Hands-on help

Need On-Device AI in Your React Native App?

CasaInnov builds ready to ship on-device AI features, from model selection and quantization to hybrid architectures that balance privacy, performance, and cost.

Free 30-minute call

A clear plan for your project

No obligation either way

Explore AI Mobile Development Book a free call

Trusted by 10+ companies | Free first call | Kept confidential

Why Run Llama 3 On-Device in React Native?

Below: model selection, the llama.rn integration walkthrough, performance tuning, and when on-device actually beats cloud, and when it doesn't.

Cloud AI vs. On-Device AI: When to Choose Each

Before diving into code, understand when on-device Llama 3 is the right architectural choice, and when it isn't.

Dimension	Cloud LLM (GPT-4o, Claude)	On-Device Llama 3
Latency (TTFT)	800ms – 3s	50 – 150ms
Privacy	Data leaves the device	100% local, never shared
Offline support	No	Yes, fully offline
Per-query cost	$0.002 – $0.06 / 1K tokens	$0 (one-time model download)
Model intelligence	Modern	Good (smaller models)
App size impact	None	+800MB – 3GB for model
Battery usage	Low (network only)	High during inference

Choosing the Right Llama 3 Model for Mobile

Meta provides several Llama 3 model sizes. For mobile, you're working in the 1B–8B parameter range. Here's how to pick:

Llama 3.2 1B Instruct (Recommended for most apps)

Llama 3.2 3B Instruct

Llama 3.1 8B Instruct

The Best Library: llama.rn

iOS: Metal GPU acceleration, models run on the Neural Engine and GPU, not just CPU
Android: CPU inference with OpenBLAS, GPU support for Android is improving but not production-stable at time of writing
Streaming: Built-in token streaming for real-time, ChatGPT-like UX
Context management: Load once, reuse across multiple requests without reloading the model
Format: Supports GGUF files (the standard format for quantized models from Hugging Face)

Installation and Setup

1. Install the package

bash

npm install llama.rn
# or
yarn add llama.rn

2. iOS, Link native code and enable Metal

bash

cd ios && pod install

In your ios/YourApp/Info.plist, add a usage description for local file access if you're downloading the model at runtime:

xml

<key>NSLocalNetworkUsageDescription</key>
<string>Used to download AI model files for on-device processing</string>

3. Android, Configure large heap

In android/app/build.gradle, ensure you have enough memory headroom for model loading:

gradle

android {
  defaultConfig {
    // ...
  }

  packagingOptions {
    // Avoid conflicts with llama.cpp native libs
    pickFirst '**/libllama.so'
  }
}

Also add android:largeHeap="true" to your AndroidManifest.xml <application> tag to prevent OOM crashes during model loading.

Downloading the Model at Runtime

typescript

import * as FileSystem from 'expo-file-system'
import { useState, useCallback } from 'react'

const MODEL_URL =
  'https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q4_K_M.gguf'

const MODEL_FILENAME = 'Llama-3.2-1B-Instruct-Q4_K_M.gguf'
const MODEL_PATH = FileSystem.documentDirectory + MODEL_FILENAME

export function useModelDownload() {
  const [progress, setProgress] = useState(0)
  const [isDownloading, setIsDownloading] = useState(false)
  const [isReady, setIsReady] = useState(false)

  const checkModel = useCallback(async () => {
    const info = await FileSystem.getInfoAsync(MODEL_PATH)
    setIsReady(info.exists)
    return info.exists
  }, [])

  const downloadModel = useCallback(async () => {
    setIsDownloading(true)
    setProgress(0)

    const callback: FileSystem.DownloadProgressCallback = ({ totalBytesWritten, totalBytesExpectedToWrite }) => {
      setProgress(totalBytesWritten / totalBytesExpectedToWrite)
    }

    const downloadResumable = FileSystem.createDownloadResumable(
      MODEL_URL,
      MODEL_PATH,
      {},
      callback
    )

    try {
      await downloadResumable.downloadAsync()
      setIsReady(true)
    } catch (error) {
      console.error('Model download failed:', error)
    } finally {
      setIsDownloading(false)
    }
  }, [])

  return { progress, isDownloading, isReady, checkModel, downloadModel, MODEL_PATH }
}

Core Integration: Loading and Running Llama 3

Here's the complete hook for initializing the Llama 3 context and running streaming inference in React Native:

typescript

import { initLlama, LlamaContext } from 'llama.rn'
import { useState, useRef, useCallback } from 'react'

export function useLlama3(modelPath: string) {
  const contextRef = useRef<LlamaContext | null>(null)
  const [isLoading, setIsLoading] = useState(false)
  const [isGenerating, setIsGenerating] = useState(false)

  // Load model into memory, do this once on app start
  const loadModel = useCallback(async () => {
    if (contextRef.current) return // Already loaded

    setIsLoading(true)
    try {
      contextRef.current = await initLlama({
        model: modelPath,
        use_mlock: true,      // Lock model in RAM (prevents swapping)
        n_ctx: 2048,          // Context window size
        n_threads: 4,         // CPU threads (4 is optimal for most devices)
        n_gpu_layers: 99,     // Use GPU layers on iOS (Metal). Set 0 for CPU-only
      })
      console.log('Llama 3 loaded successfully')
    } catch (err) {
      console.error('Failed to load Llama 3:', err)
    } finally {
      setIsLoading(false)
    }
  }, [modelPath])

  // Stream a completion response token by token
  const complete = useCallback(
    async (
      prompt: string,
      onToken: (token: string) => void,
      systemPrompt = 'You are a helpful assistant. Be concise.'
    ) => {
      if (!contextRef.current || isGenerating) return

      setIsGenerating(true)
      const messages = [
        { role: 'system', content: systemPrompt },
        { role: 'user', content: prompt },
      ]

      try {
        await contextRef.current.completion(
          {
            messages,
            n_predict: 512,     // Max tokens to generate
            temperature: 0.7,
            top_p: 0.9,
            stop: ['<|eot_id|>', '<|end_of_text|>'],
          },
          data => {
            if (data.token && !data.token.includes('<|')) {
              onToken(data.token)
            }
          }
        )
      } finally {
        setIsGenerating(false)
      }
    },
    [isGenerating]
  )

  // Release model from memory when done
  const releaseModel = useCallback(async () => {
    await contextRef.current?.release()
    contextRef.current = null
  }, [])

  return { loadModel, complete, releaseModel, isLoading, isGenerating }
}

Building the Chat UI with Streaming

Here's a minimal but built to ship chat component that uses the hook above to deliver a real-time streaming experience:

tsx

import React, { useState, useEffect } from 'react'
import { View, TextInput, Text, ScrollView, Pressable, ActivityIndicator } from 'react-native'
import { useLlama3 } from './useLlama3'
import { useModelDownload } from './useModelDownload'

export default function AIChat() {
  const { MODEL_PATH, isReady, checkModel, downloadModel, isDownloading, progress } = useModelDownload()
  const { loadModel, complete, isLoading, isGenerating } = useLlama3(MODEL_PATH)
  const [input, setInput] = useState('')
  const [response, setResponse] = useState('')

  useEffect(() => {
    checkModel()
  }, [])

  useEffect(() => {
    if (isReady) loadModel()
  }, [isReady])

  const handleSend = async () => {
    if (!input.trim() || isGenerating) return
    const userInput = input.trim()
    setInput('')
    setResponse('')

    await complete(userInput, token => {
      setResponse(prev => prev + token)
    })
  }

  if (!isReady) {
    return (
      <View style={{ flex: 1, alignItems: 'center', justifyContent: 'center', padding: 24 }}>
        <Text style={{ fontSize: 18, fontWeight: '600', marginBottom: 12 }}>
          Llama 3 Model Required
        </Text>
        <Text style={{ color: '#666', textAlign: 'center', marginBottom: 24 }}>
          Download the AI model (~600MB) to enable on-device AI. Works offline after download.
        </Text>
        {isDownloading ? (
          <View style={{ width: '100%' }}>
            <View style={{ height: 8, backgroundColor: '#eee', borderRadius: 4, overflow: 'hidden' }}>
              <View style={{ height: '100%', width: `${progress * 100}%`, backgroundColor: '#6366f1' }} />
            </View>
            <Text style={{ textAlign: 'center', marginTop: 8, color: '#666' }}>
              {Math.round(progress * 100)}% downloaded...
            </Text>
          </View>
        ) : (
          <Pressable
            onPress={downloadModel}
            style={{ backgroundColor: '#6366f1', paddingHorizontal: 24, paddingVertical: 12, borderRadius: 8 }}
          >
            <Text style={{ color: '#fff', fontWeight: '600' }}>Download Model</Text>
          </Pressable>
        )}
      </View>
    )
  }

  if (isLoading) {
    return (
      <View style={{ flex: 1, alignItems: 'center', justifyContent: 'center' }}>
        <ActivityIndicator size="large" />
        <Text style={{ marginTop: 12, color: '#666' }}>Loading Llama 3...</Text>
      </View>
    )
  }

  return (
    <View style={{ flex: 1, padding: 16 }}>
      <ScrollView style={{ flex: 1, marginBottom: 12 }}>
        {response ? (
          <View style={{ backgroundColor: '#f5f5f5', padding: 16, borderRadius: 12 }}>
            <Text style={{ lineHeight: 22 }}>{response}</Text>
            {isGenerating && <ActivityIndicator size="small" style={{ marginTop: 8 }} />}
          </View>
        ) : null}
      </ScrollView>

      <View style={{ flexDirection: 'row', gap: 8 }}>
        <TextInput
          value={input}
          onChangeText={setInput}
          placeholder="Ask Llama 3 anything..."
          style={{ flex: 1, borderWidth: 1, borderColor: '#ddd', borderRadius: 8, padding: 12 }}
          onSubmitEditing={handleSend}
        />
        <Pressable
          onPress={handleSend}
          disabled={isGenerating}
          style={{
            backgroundColor: isGenerating ? '#a5b4fc' : '#6366f1',
            paddingHorizontal: 16,
            borderRadius: 8,
            justifyContent: 'center',
          }}
        >
          <Text style={{ color: '#fff', fontWeight: '600' }}>Send</Text>
        </Pressable>
      </View>
    </View>
  )
}

Performance Tuning for Production

Getting the model to load is 20% of the work. Here's what actually matters in production:

1. Use the Right Quantization

Quantization	1B Model Size	Quality	Speed (iPhone 15)	Recommendation
Q2_K	~380MB	Poor	45 tok/s	Avoid
Q4_K_M	~600MB	Good	35 tok/s	Best for most apps
Q5_K_M	~730MB	Very Good	28 tok/s	Premium quality
Q8_0	~1.1GB	Near-lossless	20 tok/s	Overkill for mobile

2. Preload the Model on App Start

3. Keep the Context Alive

4. Cap Context Window for Speed

5. Always Use GPU on iOS

Practical Use Cases and System Prompt Patterns

Llama 3 Instruct models respond well to clear, structured system prompts. Here are patterns for common mobile AI features:

Smart Text Summarizer

typescript

const systemPrompt = `You are a concise summarizer.
When given text, respond with a 2-3 sentence summary only.
Never add commentary or ask questions. Output the summary immediately.`

On-Device Private Journal Coach

typescript

const systemPrompt = `You are a supportive journaling coach.
Help users reflect on their thoughts with empathy.
Ask one follow-up question at the end of each response.
Keep responses under 150 words.`

Structured Data Extractor

typescript

const systemPrompt = `Extract structured data from the user's input.
Always respond with valid JSON only, no markdown fencing.
Schema: { "name": string, "date": string, "amount": number, "category": string }
If a field is not found, use null.`

Using Llama 3 with Expo

llama.rn requires native code, so it won't work with Expo Go. You'll need a custom development build:

bash

# Create a development build
npx expo install llama.rn
npx expo run:ios   # or run:android

# For EAS Build (recommended for production)
eas build --platform ios --profile development

The model file management approach stays the same, download to FileSystem.documentDirectory using expo-file-system and pass the local path to initLlama().

Common Pitfalls and How to Avoid Them

OOM Crashes on Android

The 3B model at Q4 needs ~2.5GB RAM. Android will kill the app if heap pressure is too high. Use the 1B model for broad Android compatibility and add android:largeHeap="true".

Slow First Response

Prefill (processing the input prompt) is slow for long inputs. Keep system prompts under 200 tokens and don't send long conversation histories, summarize older turns instead.

Model Not Found Error

Always call FileSystem.getInfoAsync(path) before initLlama(). If the model doesn't exist, trigger the download flow, never assume it's cached.

App Store Rejection

Recommended Architecture: Hybrid On-Device + Cloud

The best production pattern isn't a binary choice between cloud and on-device, it's a smart hybrid:

Short queries, sensitive data, offline mode → Llama 3 on-device
Complex reasoning, longer outputs, first-time users → Cloud (GPT-4o, Claude)
Users without the model downloaded → Transparent cloud fallback
Pro users / power features → Download prompt + on-device premium experience

Hands-on help

Need On-Device AI in Your React Native App?

CasaInnov builds ready to ship on-device AI features, from model selection and quantization to hybrid architectures that balance privacy, performance, and cost.

Free 30-minute call

A clear plan for your project

No obligation either way

Explore AI Mobile Development Book a free call

Trusted by 10+ companies | Free first call | Kept confidential

llama.rn Guide: Run Llama 3 On-Device in React Native + Expo

Why Run Llama 3 On-Device in React Native?

Cloud AI vs. On-Device AI: When to Choose Each

Choosing the Right Llama 3 Model for Mobile

Llama 3.2 1B Instruct (Recommended for most apps)

Llama 3.2 3B Instruct

Llama 3.1 8B Instruct

The Best Library: llama.rn

Installation and Setup

1. Install the package

2. iOS, Link native code and enable Metal

3. Android, Configure large heap

Downloading the Model at Runtime

Core Integration: Loading and Running Llama 3

Building the Chat UI with Streaming

Performance Tuning for Production

1. Use the Right Quantization

2. Preload the Model on App Start

3. Keep the Context Alive

4. Cap Context Window for Speed

5. Always Use GPU on iOS

Practical Use Cases and System Prompt Patterns

Smart Text Summarizer

On-Device Private Journal Coach

Structured Data Extractor

Using Llama 3 with Expo

Common Pitfalls and How to Avoid Them

OOM Crashes on Android

Slow First Response

Model Not Found Error

App Store Rejection

Recommended Architecture: Hybrid On-Device + Cloud

Need On-Device AI in Your React Native App?

Loading...

llama.rn Guide: Run Llama 3 On-Device in React Native + Expo

Why Run Llama 3 On-Device in React Native?

Cloud AI vs. On-Device AI: When to Choose Each

Choosing the Right Llama 3 Model for Mobile

Llama 3.2 1B Instruct (Recommended for most apps)

Llama 3.2 3B Instruct

Llama 3.1 8B Instruct

The Best Library: llama.rn

Installation and Setup

1. Install the package

2. iOS, Link native code and enable Metal

3. Android, Configure large heap

Downloading the Model at Runtime

Core Integration: Loading and Running Llama 3

Building the Chat UI with Streaming

Performance Tuning for Production

1. Use the Right Quantization

2. Preload the Model on App Start

3. Keep the Context Alive

4. Cap Context Window for Speed

5. Always Use GPU on iOS

Practical Use Cases and System Prompt Patterns

Smart Text Summarizer

On-Device Private Journal Coach

Structured Data Extractor

Using Llama 3 with Expo

Common Pitfalls and How to Avoid Them

OOM Crashes on Android

Slow First Response

Model Not Found Error

App Store Rejection

Recommended Architecture: Hybrid On-Device + Cloud

Need On-Device AI in Your React Native App?