Installing Ollama on AWS EC2

This guide walks you through setting up Ollama with the Qwen 2.5 model on a single-GPU AWS EC2 instance, configuring it to expose its API on port 11434, and verifying that it is accessible and working correctly.

Steps to configure Ollama in AWS EC2 instance

Follow the following instructions to set-up Ollama on AWS EC2 instance:

  1. Pick the right instance & AMI

    InstanceGPU (VRAM)Why it worksOn-Demand price*
    g6e.2xlarge1x L40S 46 GBRecommended — fits model + 4 parallel slots~$2.20/hr
    g6e.4xlarge1x L40S 46 GBMore CPU/RAM, same GPU~$3.97/hr

    NOTE: US-East - N. Virginia, on-demand.

    For the AMI (Amazon Machine Image), choose one of:

    • AWS Deep Learning Base AMI (Ubuntu 22.04 or 24.04) — recommended. This pre-installs the correct NVIDIA drivers and CUDA libraries, saving 10–15 minutes of manual setup.
    • Vanilla Ubuntu 24.04 — install drivers manually in Step 3.
  2. Provision the instance

    # Launch g6e.2xlarge with:
    #   - Ubuntu 24.04 (DL Base AMI recommended)
    #   - 100 GB gp3 root volume (model is 19 GB)
        - SG rules: 22/tcp (SSH), 11434/tcp (Ollama API) or 443 if reverse-proxy
      2. Attach/allocate an Elastic IP for a stable endpoint
         ssh -i ~/.ssh/your-key.pem ubuntu@<elastic-ip>
    
  3. Install NVIDIA Driver & CUDA (Skip if using DL AMI)

    Ollama needs the NVIDIA CUDA runtime to offload computation to the GPU. The commands below install the recommended driver version automatically.

    NOTE: If not using a DL AMI, then follow this step else skip.

    NOTE: To know more about driver install flow on AWS by using this step, refer AWS Documentation.

  4. Install Ollama

  • Run the install script
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
  • Confirm the service is running:
systemctl status ollama
  1. Configure the System Service

Edit the service file directly:

sudo vi /etc/systemd/system/ollama.service

Replace the entire contents with:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_GPU=999"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_NUM_PARALLEL=4"


[Install]
WantedBy=multi-user.target

Apply and restart:

sudo systemctl daemon-reload && sudo systemctl restart ollama
  1. Pull the Model

ollama pull qwen3-vl:30b-a3b-instruct
  1. Create Optimised Model file

cat > /tmp/Modelfile <<EOF
FROM qwen3-vl:30b-a3b-instruct
PARAMETER num_ctx   16000
PARAMETER num_batch 512
PARAMETER num_gpu   999
PARAMETER num_thread 8
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 20
EOF


ollama create qwen3-vl:30b-a3b-instruct -f /tmp/Modelfile

Verify parameters were applied:

ollama show qwen3-vl:30b-a3b-instruct --modelfile

Restart Ollama to load the updated model:

sudo systemctl restart ollama

  1. Configure SearchBlox (searchai-config.yml)

    This file defines which LLM provider to use, the API endpoint, and which model handles each AI task.
    File location: /opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml
active-llm-provider: "ollama"

task-providers:
  chat: "ollama"
  document-enrichment: "ollama"
  smart-faq: "ollama"
  searchai-assist-text: "ollama"
  searchai-assist-image: "ollama"
  recommendations: "ollama"
  knowledge-graph: "ollama"
  document-query-decomposition: "ollama"
  product-query-decomposition: "ollama"
  product-kg-extraction: "ollama"
  analytics: "ollama"
  testing: "ollama"
  admin: "ollama"
  analysis: "ollama"
  extraction: "ollama"
  log-analysis: "ollama"
  agent-chat: "ollama"
  agent-analytics: "ollama"
  agent-testing: "ollama"
  agent-analysis: "ollama"
  agent-admin: "ollama"
  agent-extraction: "ollama"

llm-providers:
  ollama:
    platform: "ollama"
    url: "http://localhost:11434/"
    models:
      chat: "qwen3-vl:30b-a3b-instruct"
      document-enrichment: "qwen3-vl:30b-a3b-instruct"
      smart-faq: "qwen3-vl:30b-a3b-instruct"
      searchai-assist-text: "qwen3-vl:30b-a3b-instruct"
      searchai-assist-image: "qwen3-vl:30b-a3b-instruct"
      recommendations: "qwen3-vl:30b-a3b-instruct"
      knowledge-graph: "qwen3-vl:30b-a3b-instruct"
      document-query-decomposition: "qwen3-vl:30b-a3b-instruct"
      product-query-decomposition: "qwen3-vl:30b-a3b-instruct"
      product-kg-extraction: "qwen3-vl:30b-a3b-instruct"
      analytics: "qwen3-vl:30b-a3b-instruct"
      testing: "qwen3-vl:30b-a3b-instruct"
      admin: "qwen3-vl:30b-a3b-instruct"
      analysis: "qwen3-vl:30b-a3b-instruct"
      extraction: "qwen3-vl:30b-a3b-instruct"
      log-analysis: "qwen3-vl:30b-a3b-instruct"
      agent-chat: "qwen3-vl:30b-a3b-instruct"
      agent-analytics: "qwen3-vl:30b-a3b-instruct"
      agent-testing: "qwen3-vl:30b-a3b-instruct"
      agent-analysis: "qwen3-vl:30b-a3b-instruct"
      agent-admin: "qwen3-vl:30b-a3b-instruct"
      agent-extraction: "qwen3-vl:30b-a3b-instruct"

Run certbot --nginx once to obtain the free TLS certificate.

  1. Verify & Smoke Test

  • Spot instances can shave 60-80 % off the on-demand rate but may be interrupted.
  • Autoscale by putting Ollama behind an ALB and using an EC2 Auto Scaling group keyed on GPU utilization.

Quick health checklist

Health Checklist

# GPU is visible
nvidia-smi


# Ollama is listening on the correct address
ss -tlnp | grep 11434


# Model is loaded
ollama list 


# Local connectivity
curl http://localhost:11434/
# Expected: Ollama is running


# Models API endpoint (used by SearchBlox)
curl http://localhost:11434/v1/models
# Expected: JSON with qwen3-vl:30b-a3b-instruct