Installing Ollama on AWS EC2
This guide walks you through setting up Ollama with the Qwen 2.5 model on a single-GPU AWS EC2 instance, configuring it to expose its API on port 11434, and verifying that it is accessible and working correctly.
Steps to configure Ollama in AWS EC2 instance
Follow the following instructions to set-up Ollama on AWS EC2 instance:
-
Pick the right instance & AMI
Instance GPU (VRAM) Why it works On-Demand price *g6e.2xlarge 1x L40S 46 GB Recommended — fits model + 4 parallel slots ~$2.20/hr g6e.4xlarge 1x L40S 46 GB More CPU/RAM, same GPU ~$3.97/hr NOTE: US-East - N. Virginia, on-demand.
For the AMI (Amazon Machine Image), choose one of:
- AWS Deep Learning Base AMI (Ubuntu 22.04 or 24.04) — recommended. This pre-installs the correct NVIDIA drivers and CUDA libraries, saving 10–15 minutes of manual setup.
- Vanilla Ubuntu 24.04 — install drivers manually in Step 3.
-
Provision the instance
# Launch g6e.2xlarge with: # - Ubuntu 24.04 (DL Base AMI recommended) # - 100 GB gp3 root volume (model is 19 GB) - SG rules: 22/tcp (SSH), 11434/tcp (Ollama API) or 443 if reverse-proxy 2. Attach/allocate an Elastic IP for a stable endpoint ssh -i ~/.ssh/your-key.pem ubuntu@<elastic-ip> -
Install NVIDIA Driver & CUDA (Skip if using DL AMI)
Ollama needs the NVIDIA CUDA runtime to offload computation to the GPU. The commands below install the recommended driver version automatically.
NOTE: If not using a DL AMI, then follow this step else skip.
NOTE: To know more about driver install flow on AWS by using this step, refer AWS Documentation.
-
Install Ollama
- Run the install script
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
- Confirm the service is running:
systemctl status ollama
Edit the service file directly:
sudo vi /etc/systemd/system/ollama.service
Replace the entire contents with:
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_GPU=999"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_NUM_PARALLEL=4"
[Install]
WantedBy=multi-user.target
Apply and restart:
sudo systemctl daemon-reload && sudo systemctl restart ollama
ollama pull qwen3-vl:30b-a3b-instruct
cat > /tmp/Modelfile <<EOF
FROM qwen3-vl:30b-a3b-instruct
PARAMETER num_ctx 16000
PARAMETER num_batch 512
PARAMETER num_gpu 999
PARAMETER num_thread 8
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 20
EOF
ollama create qwen3-vl:30b-a3b-instruct -f /tmp/Modelfile
Verify parameters were applied:
ollama show qwen3-vl:30b-a3b-instruct --modelfile
Restart Ollama to load the updated model:
sudo systemctl restart ollama
-
This file defines which LLM provider to use, the API endpoint, and which model handles each AI task.Configure SearchBlox (searchai-config.yml)
File location:/opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml
active-llm-provider: "ollama"
task-providers:
chat: "ollama"
document-enrichment: "ollama"
smart-faq: "ollama"
searchai-assist-text: "ollama"
searchai-assist-image: "ollama"
recommendations: "ollama"
knowledge-graph: "ollama"
document-query-decomposition: "ollama"
product-query-decomposition: "ollama"
product-kg-extraction: "ollama"
analytics: "ollama"
testing: "ollama"
admin: "ollama"
analysis: "ollama"
extraction: "ollama"
log-analysis: "ollama"
agent-chat: "ollama"
agent-analytics: "ollama"
agent-testing: "ollama"
agent-analysis: "ollama"
agent-admin: "ollama"
agent-extraction: "ollama"
llm-providers:
ollama:
platform: "ollama"
url: "http://localhost:11434/"
models:
chat: "qwen3-vl:30b-a3b-instruct"
document-enrichment: "qwen3-vl:30b-a3b-instruct"
smart-faq: "qwen3-vl:30b-a3b-instruct"
searchai-assist-text: "qwen3-vl:30b-a3b-instruct"
searchai-assist-image: "qwen3-vl:30b-a3b-instruct"
recommendations: "qwen3-vl:30b-a3b-instruct"
knowledge-graph: "qwen3-vl:30b-a3b-instruct"
document-query-decomposition: "qwen3-vl:30b-a3b-instruct"
product-query-decomposition: "qwen3-vl:30b-a3b-instruct"
product-kg-extraction: "qwen3-vl:30b-a3b-instruct"
analytics: "qwen3-vl:30b-a3b-instruct"
testing: "qwen3-vl:30b-a3b-instruct"
admin: "qwen3-vl:30b-a3b-instruct"
analysis: "qwen3-vl:30b-a3b-instruct"
extraction: "qwen3-vl:30b-a3b-instruct"
log-analysis: "qwen3-vl:30b-a3b-instruct"
agent-chat: "qwen3-vl:30b-a3b-instruct"
agent-analytics: "qwen3-vl:30b-a3b-instruct"
agent-testing: "qwen3-vl:30b-a3b-instruct"
agent-analysis: "qwen3-vl:30b-a3b-instruct"
agent-admin: "qwen3-vl:30b-a3b-instruct"
agent-extraction: "qwen3-vl:30b-a3b-instruct"
Run certbot --nginx once to obtain the free TLS certificate.
- Spot instances can shave 60-80 % off the on-demand rate but may be interrupted.
- Autoscale by putting Ollama behind an ALB and using an EC2 Auto Scaling group keyed on GPU utilization.
Quick health checklist
Health Checklist
# GPU is visible
nvidia-smi
# Ollama is listening on the correct address
ss -tlnp | grep 11434
# Model is loaded
ollama list
# Local connectivity
curl http://localhost:11434/
# Expected: Ollama is running
# Models API endpoint (used by SearchBlox)
curl http://localhost:11434/v1/models
# Expected: JSON with qwen3-vl:30b-a3b-instruct
Updated 8 days ago
