Installing Ollama on AWS EC2

Ollama + qwen3-vl:30b-a3b-instruct running on a single-GPU AWS EC2 instance, expose the API on port 11434, and verify access.

Steps to configure Ollama in AWS EC2 instance

Follow the following instructions to set-up Ollama on AWS EC2 instance:

  1. Pick the right instance & AMI

    InstanceGPU (VRAM)Why it worksOn-Demand price*
    g6e.2xlarge1x L40S 46 GBRecommended — fits model + 4 parallel slots~$2.20/hr
    g6e.4xlarge1x L40S 46 GBMore CPU/RAM, same GPU~$3.97/hr

    NOTE: US-East - N. Virginia, on-demand.

    For the AMI (Amazon Machine Image), choose one of:

    • AWS Deep Learning Base AMI (Ubuntu 22.04 or 24.04) — recommended. This pre-installs the correct NVIDIA drivers and CUDA libraries, saving 10–15 minutes of manual setup.
    • Vanilla Ubuntu 24.04 — install drivers manually in Step 3.
  2. Provision the instance

    # Launch g6e.2xlarge with:
    #   - Ubuntu 24.04 (DL Base AMI recommended)
    #   - 100 GB gp3 root volume (model is 19 GB)
        - SG rules: 22/tcp (SSH), 11434/tcp (Ollama API) or 443 if reverse-proxy
      2. Attach/allocate an Elastic IP for a stable endpoint
         ssh -i ~/.ssh/your-key.pem ubuntu@<elastic-ip>
    
  3. Install NVIDIA Driver & CUDA (Skip if using DL AMI)

    Ollama needs the NVIDIA CUDA runtime to offload computation to the GPU. The commands below install the recommended driver version automatically.

    NOTE: If not using a DL AMI, then follow this step else skip.

    NOTE: To know more about driver install flow on AWS by using this step, refer AWS Documentation.

  4. Install Ollama

  • Run the install script
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
  • Confirm the service is running:
systemctl status ollama
  1. Configure the System Service

Edit the service file directly:

sudo vi /etc/systemd/system/ollama.service

Replace the entire contents with:

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_GPU=999"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_NUM_PARALLEL=4"


[Install]
WantedBy=multi-user.target

Apply and restart:

sudo systemctl daemon-reload && sudo systemctl restart ollama
  1. Pull the Model

ollama pull qwen3-vl:30b-a3b-instruct
  1. Create Optimised Model file

cat > /tmp/Modelfile <<EOF
FROM qwen3-vl:30b-a3b-instruct
PARAMETER num_ctx   16000
PARAMETER num_batch 512
PARAMETER num_gpu   999
PARAMETER num_thread 8
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 20
EOF


ollama create qwen3-vl:30b-a3b-instruct -f /tmp/Modelfile

Verify parameters were applied:

ollama show qwen3-vl:30b-a3b-instruct --modelfile

Restart Ollama to load the updated model:

sudo systemctl restart ollama

  1. Configure SearchBlox (searchai-config.yml)

    This file defines which LLM provider to use, the API endpoint, and which model handles each AI task.
    File location: /opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml
active-llm-provider: "ollama"

task-providers:
  chat: "ollama"
  document-enrichment: "ollama"
  smart-faq: "ollama"
  searchai-assist-text: "ollama"
  searchai-assist-image: "ollama"
  recommendations: "ollama"
  knowledge-graph: "ollama"
  document-query-decomposition: "ollama"
  product-query-decomposition: "ollama"
  product-kg-extraction: "ollama"
  analytics: "ollama"
  testing: "ollama"
  admin: "ollama"
  analysis: "ollama"
  extraction: "ollama"
  log-analysis: "ollama"
  agent-chat: "ollama"
  agent-analytics: "ollama"
  agent-testing: "ollama"
  agent-analysis: "ollama"
  agent-admin: "ollama"
  agent-extraction: "ollama"

llm-providers:
  ollama:
    platform: "ollama"
    url: "http://localhost:11434/"
    models:
      chat: "qwen3-vl:30b-a3b-instruct"
      document-enrichment: "qwen3-vl:30b-a3b-instruct"
      smart-faq: "qwen3-vl:30b-a3b-instruct"
      searchai-assist-text: "qwen3-vl:30b-a3b-instruct"
      searchai-assist-image: "qwen3-vl:30b-a3b-instruct"
      recommendations: "qwen3-vl:30b-a3b-instruct"
      knowledge-graph: "qwen3-vl:30b-a3b-instruct"
      document-query-decomposition: "qwen3-vl:30b-a3b-instruct"
      product-query-decomposition: "qwen3-vl:30b-a3b-instruct"
      product-kg-extraction: "qwen3-vl:30b-a3b-instruct"
      analytics: "qwen3-vl:30b-a3b-instruct"
      testing: "qwen3-vl:30b-a3b-instruct"
      admin: "qwen3-vl:30b-a3b-instruct"
      analysis: "qwen3-vl:30b-a3b-instruct"
      extraction: "qwen3-vl:30b-a3b-instruct"
      log-analysis: "qwen3-vl:30b-a3b-instruct"
      agent-chat: "qwen3-vl:30b-a3b-instruct"
      agent-analytics: "qwen3-vl:30b-a3b-instruct"
      agent-testing: "qwen3-vl:30b-a3b-instruct"
      agent-analysis: "qwen3-vl:30b-a3b-instruct"
      agent-admin: "qwen3-vl:30b-a3b-instruct"
      agent-extraction: "qwen3-vl:30b-a3b-instruct"

Run certbot --nginx once to obtain the free TLS certificate.

  1. Verify & Smoke Test

  • Spot instances can shave 60-80 % off the on-demand rate but may be interrupted.
  • Autoscale by putting Ollama behind an ALB and using an EC2 Auto Scaling group keyed on GPU utilization.

Quick health checklist

Health Checklist

# GPU is visible
nvidia-smi


# Ollama is listening on the correct address
ss -tlnp | grep 11434


# Model is loaded
ollama list 


# Local connectivity
curl http://localhost:11434/
# Expected: Ollama is running


# Models API endpoint (used by SearchBlox)
curl http://localhost:11434/v1/models
# Expected: JSON with qwen3-vl:30b-a3b-instruct