Skip to content

Quick and dirty launcher scripts for Llama-3.2-3B-Instruct GGUF models using llama-cpp-python. Includes full version with optimizations and bare-bones minimal version. Just update the model path and run.

License

Notifications You must be signed in to change notification settings

dougeeai/llama_32_3b_instruct_launcher_gguf

Repository files navigation

Llama 3.2 3B Instruct GGUF Launcher

Quick and dirty launcher scripts for running Llama-3.2-3B-Instruct GGUF models with llama-cpp-python.

Description

Simple Python scripts to load and chat with Llama 3.2 3B Instruct GGUF quantized models. Includes a full version with hardware detection and optimizations, and a bare-bones version for minimal setup.

Requirements

  • Python 3.13
  • llama-cpp-python with CUDA support (see installation below)
  • 4-5GB VRAM for Q8_0 quantization
  • GGUF model file (Q8_0 recommended for best quality)

Setup

  1. Create conda environment:
conda env create -f environment.yml
conda activate llama_32_3b_instruct_launcher_gguf
  1. Install llama-cpp-python wheel:
# Option A: Download pre-built wheel from https://github.com/dougeeai/llama-cpp-python-wheels
pip install llama_cpp_python-X.X.X-cpXXX-cpXXX-win_amd64.whl

# Option B: Build from source
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/13.0
  1. IMPORTANT: Update model path in the script
MODEL_PATH = r"E:\ai\models\llama_32_3b_instruct_gguf\llama_32_3b_instruct_q8_0.gguf"  # Change to your model location
  1. Run:
python llama_32_3b_instruct_launcher_gguf.py  # Full version
# or
python llama_32_3b_instruct_launcher_gguf_bare.py  # Minimal version

Files

  • llama_32_3b_instruct_launcher_gguf.py - Full launcher with hardware detection, GPU layer optimization, streaming
  • llama_32_3b_instruct_launcher_gguf_bare.py - Bare minimum code to load and chat
  • environment.yml - Conda environment specification

Configuration

Key settings in the script:

  • GPU_LAYERS - Number of layers to offload to GPU (28 = all layers for 3B model)
  • CONTEXT_LENGTH - Context window (131072 = full 128K context)
  • THREADS - CPU threads for computation

Usage

Type messages at the prompt. Commands:

  • quit - Exit the chat
  • clear - Reset conversation history (full version only)

Notes

  • GGUF models are more memory efficient than safetensors
  • Q8_0 provides best quality/size ratio at ~3.2GB
  • Bare version skips all validation and optimization features
  • Default uses 28 GPU layers (full GPU offload)

About

Quick and dirty launcher scripts for Llama-3.2-3B-Instruct GGUF models using llama-cpp-python. Includes full version with optimizations and bare-bones minimal version. Just update the model path and run.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages