Installing Ollama on AWS EC2
Ollama + qwen3-vl:30b-a3b-instruct running on a single-GPU AWS EC2 instance, expose the API on port 11434, and verify access.
Steps to configure Ollama in AWS EC2 instance
Follow the following instructions to set-up Ollama on AWS EC2 instance:
-
Pick the right instance & AMI
Instance GPU (VRAM) Why it works On-Demand price *g6e.2xlarge 1x L40S 46 GB Recommended — fits model + 4 parallel slots ~$2.20/hr g6e.4xlarge 1x L40S 46 GB More CPU/RAM, same GPU ~$3.97/hr NOTE: US-East - N. Virginia, on-demand.
For the AMI (Amazon Machine Image), choose one of:
- AWS Deep Learning Base AMI (Ubuntu 22.04 or 24.04) — recommended. This pre-installs the correct NVIDIA drivers and CUDA libraries, saving 10–15 minutes of manual setup.
- Vanilla Ubuntu 24.04 — install drivers manually in Step 3.
-
Provision the instance
# Launch g6e.2xlarge with: # - Ubuntu 24.04 (DL Base AMI recommended) # - 100 GB gp3 root volume (model is 19 GB) - SG rules: 22/tcp (SSH), 11434/tcp (Ollama API) or 443 if reverse-proxy 2. Attach/allocate an Elastic IP for a stable endpoint ssh -i ~/.ssh/your-key.pem ubuntu@<elastic-ip> -
Install NVIDIA Driver & CUDA (Skip if using DL AMI)
Ollama needs the NVIDIA CUDA runtime to offload computation to the GPU. The commands below install the recommended driver version automatically.
NOTE: If not using a DL AMI, then follow this step else skip.
NOTE: To know more about driver install flow on AWS by using this step, refer AWS Documentation.
-
Install Ollama
- Run the install script
curl -fsSL https://ollama.com/install.sh | sh
sudo systemctl enable --now ollama
- Confirm the service is running:
systemctl status ollama
Edit the service file directly:
sudo vi /etc/systemd/system/ollama.service
Replace the entire contents with:
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_GPU=999"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_NUM_PARALLEL=4"
[Install]
WantedBy=multi-user.target
Apply and restart:
sudo systemctl daemon-reload && sudo systemctl restart ollama
ollama pull qwen3-vl:30b-a3b-instruct
cat > /tmp/Modelfile <<EOF
FROM qwen3-vl:30b-a3b-instruct
PARAMETER num_ctx 16000
PARAMETER num_batch 512
PARAMETER num_gpu 999
PARAMETER num_thread 8
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 20
EOF
ollama create qwen3-vl:30b-a3b-instruct -f /tmp/Modelfile
Verify parameters were applied:
ollama show qwen3-vl:30b-a3b-instruct --modelfile
Restart Ollama to load the updated model:
sudo systemctl restart ollama
-
This file defines which LLM provider to use, the API endpoint, and which model handles each AI task.Configure SearchBlox (searchai-config.yml)
File location:/opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml
active-llm-provider: "ollama"
task-providers:
chat: "ollama"
document-enrichment: "ollama"
smart-faq: "ollama"
searchai-assist-text: "ollama"
searchai-assist-image: "ollama"
recommendations: "ollama"
knowledge-graph: "ollama"
document-query-decomposition: "ollama"
product-query-decomposition: "ollama"
product-kg-extraction: "ollama"
analytics: "ollama"
testing: "ollama"
admin: "ollama"
analysis: "ollama"
extraction: "ollama"
log-analysis: "ollama"
agent-chat: "ollama"
agent-analytics: "ollama"
agent-testing: "ollama"
agent-analysis: "ollama"
agent-admin: "ollama"
agent-extraction: "ollama"
llm-providers:
ollama:
platform: "ollama"
url: "http://localhost:11434/"
models:
chat: "qwen3-vl:30b-a3b-instruct"
document-enrichment: "qwen3-vl:30b-a3b-instruct"
smart-faq: "qwen3-vl:30b-a3b-instruct"
searchai-assist-text: "qwen3-vl:30b-a3b-instruct"
searchai-assist-image: "qwen3-vl:30b-a3b-instruct"
recommendations: "qwen3-vl:30b-a3b-instruct"
knowledge-graph: "qwen3-vl:30b-a3b-instruct"
document-query-decomposition: "qwen3-vl:30b-a3b-instruct"
product-query-decomposition: "qwen3-vl:30b-a3b-instruct"
product-kg-extraction: "qwen3-vl:30b-a3b-instruct"
analytics: "qwen3-vl:30b-a3b-instruct"
testing: "qwen3-vl:30b-a3b-instruct"
admin: "qwen3-vl:30b-a3b-instruct"
analysis: "qwen3-vl:30b-a3b-instruct"
extraction: "qwen3-vl:30b-a3b-instruct"
log-analysis: "qwen3-vl:30b-a3b-instruct"
agent-chat: "qwen3-vl:30b-a3b-instruct"
agent-analytics: "qwen3-vl:30b-a3b-instruct"
agent-testing: "qwen3-vl:30b-a3b-instruct"
agent-analysis: "qwen3-vl:30b-a3b-instruct"
agent-admin: "qwen3-vl:30b-a3b-instruct"
agent-extraction: "qwen3-vl:30b-a3b-instruct"
Run certbot --nginx once to obtain the free TLS certificate.
- Spot instances can shave 60-80 % off the on-demand rate but may be interrupted.
- Autoscale by putting Ollama behind an ALB and using an EC2 Auto Scaling group keyed on GPU utilization.
Quick health checklist
Health Checklist
# GPU is visible
nvidia-smi
# Ollama is listening on the correct address
ss -tlnp | grep 11434
# Model is loaded
ollama list
# Local connectivity
curl http://localhost:11434/
# Expected: Ollama is running
# Models API endpoint (used by SearchBlox)
curl http://localhost:11434/v1/models
# Expected: JSON with qwen3-vl:30b-a3b-instruct
Updated 6 days ago
