Installing Ollama on AWS EC2

Ollama + Qwen 2.5 running on a single-GPU AWS EC2 instance, expose the API on port 11434, and verify access.

Steps to configure Ollama in AWS EC2 instance

Follow the following instructions to set-up Ollama on AWS EC2 instance:

  1. Pick the right instance & AMI

    InstanceGPU (VRAM)Why it works for Qwen 2.5-7BOn-Demand price*
    g5.xlarge1 × A10G 24 GBFits the full 7 B model in VRAM; good $/token-s≈ $0.92 hr
    g5.2xlarge1 × A10G 24 GBSame GPU, more CPU/RAM if you run other services≈ $1.21 hr

    NOTE: US-East - N. Virginia, on-demand.

    AMI options:

    • Fastest – use an AWS Deep Learning Base AMI (Ubuntu 22.04) ; it ships with the matching NVIDIA driver & CUDA libs pre-installed(Reference).
    • DIY – start from vanilla Ubuntu 22.04 and install the driver/CUDA yourself.
  2. Provision the instance

    # 1. Launch instance
    #    - g5.xlarge, Ubuntu 22.04 (DL Base AMI or vanilla)
    #    - 100 GB gp3 root volume (models + checkpoints)
    #    - SG rules: 22/tcp (SSH), 11434/tcp (Ollama API) or 443 if reverse-proxy
    # 2. Attach/allocate an Elastic IP for a stable endpoint
    ssh -i ~/.ssh/your-key.pem ubuntu@<elastic-ip>
    
  3. Install NVIDIA driver & CUDA 12(OPTIONAL)

    NOTE: If not using a DL AMI, then follow this step else skip.

    NOTE: To know more about driver install flow on AWS by using this step, refer AWS Documentation.

    sudo apt update && sudo apt -y upgrade
    sudo ubuntu-drivers autoinstall          # installs the recommended driver (≥535)
    reboot
    nvidia-smi                               # verify the GPU is visible
    
  4. Install Ollama with GPU support

    curl -fsSL https://ollama.ai/install.sh | sh
    sudo systemctl enable --now ollama
    
    • Expose the API on the network

      Ollama binds to 127.0.0.1 by default. Add an override that the systemd points it to all interfaces:
     sudo systemctl edit ollama
    
    [Service]
    Environment="OLLAMA_HOST=0.0.0.0:11434" 
    
    sudo systemctl daemon-reload
    sudo systemctl restart ollama
    

📘

NOTE:

  • If YAML config is preferred then set listen: 0.0.0.0:11434 in /etc/ollama/ollama.yaml .
  • Multiple GPUs? Pin the fast card so Ollama doesn’t fall back to CPU:
    Environment="CUDA_VISIBLE_DEVICES=0"
    
  1. Pull and cache the Qwen 2.5 model

    # 7-B parameter variant;
    ollama pull qwen2.5:7b
    
  2. Smoke-test the endpoint

    curl -X POST http://<elastic-ip>:11434/api/generate \
      -d '{"model":"qwen2.5:7b","prompt":"Describe AWS EC2 in one line."}'
    
    In response one should see a streamed JSON response and nvidia-smi should briefly show ~85–100 % GPU utilization.
  3. Enable HTTPS in front(Optional)

    sudo apt install nginx
    # create /etc/nginx/sites-available/ollama.conf
    server {
        listen 443 ssl;
        server_name <your-domain>;
        ssl_certificate /etc/letsencrypt/live/<your-domain>/fullchain.pem;
        ssl_certificate_key /etc/letsencrypt/live/<your-domain>/privkey.pem;
        location / {
            proxy_pass http://127.0.0.1:11434;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }
    }
    sudo ln -s /etc/nginx/sites-available/ollama.conf /etc/nginx/sites-enabled/
    sudo nginx -t && sudo systemctl reload nginx
    
    Run certbot --nginx once to obtain the free TLS certificate.
  4. Cost & scaling tips

    • Spot instances can shave 60-80 % off the on-demand rate but may be interrupted.
    • Autoscale by putting Ollama behind an ALB and using an EC2 Auto Scaling group keyed on GPU utilization.

Quick health checklist

# GPU visible
nvidia-smi

# Ollama listening on public address
ss -ltnp | grep 11434

# Model loaded in VRAM
ollama list | grep qwen2.5

# Sample chat
ollama run qwen2.5:7b

❗️

Important:

Config file searchai-config.yml needs to be updated with the searchblox-llm which is pointing to the new AWS based Ollama service. Config file is present in the path : /opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml

searchblox-llm: http://localhost:11434/
llm-platform: "ollama"
searchai-agents-server:
num-thread:

models:
  chat: "qwen2.5"
  document-enrichment: "qwen2.5"
  smart-faq: "qwen2.5"
  searchai-assist-text: "qwen2.5"
  searchai-assist-image: "llama3.2-vision"

cache-settings:
  use-cache: true
  fact-score-threshold: 40

prompts:
  standalone-question: |
    Given the conversation history and a follow-up question, rephrase the follow-up question to be a standalone question that includes all necessary context.
...