Installing Ollama on AWS EC2
Ollama + Qwen 2.5 running on a single-GPU AWS EC2 instance, expose the API on port 11434, and verify access.
Steps to configure Ollama in AWS EC2 instance
Follow the following instructions to set-up Ollama on AWS EC2 instance:
-
Pick the right instance & AMI
Instance GPU (VRAM) Why it works for Qwen 2.5-7B On-Demand price *
g5.xlarge 1 × A10G 24 GB Fits the full 7 B model in VRAM; good $/token-s ≈ $0.92 hr g5.2xlarge 1 × A10G 24 GB Same GPU, more CPU/RAM if you run other services ≈ $1.21 hr NOTE: US-East - N. Virginia, on-demand.
AMI options:
- Fastest – use an AWS Deep Learning Base AMI (Ubuntu 22.04) ; it ships with the matching NVIDIA driver & CUDA libs pre-installed(Reference).
- DIY – start from vanilla Ubuntu 22.04 and install the driver/CUDA yourself.
-
Provision the instance
# 1. Launch instance # - g5.xlarge, Ubuntu 22.04 (DL Base AMI or vanilla) # - 100 GB gp3 root volume (models + checkpoints) # - SG rules: 22/tcp (SSH), 11434/tcp (Ollama API) or 443 if reverse-proxy # 2. Attach/allocate an Elastic IP for a stable endpoint ssh -i ~/.ssh/your-key.pem ubuntu@<elastic-ip>
-
Install NVIDIA driver & CUDA 12(OPTIONAL)
NOTE: If not using a DL AMI, then follow this step else skip.
NOTE: To know more about driver install flow on AWS by using this step, refer AWS Documentation.
sudo apt update && sudo apt -y upgrade sudo ubuntu-drivers autoinstall # installs the recommended driver (≥535) reboot nvidia-smi # verify the GPU is visible
-
Install Ollama with GPU support
curl -fsSL https://ollama.ai/install.sh | sh sudo systemctl enable --now ollama
-
Ollama binds toExpose the API on the network127.0.0.1
by default. Add an override that thesystemd
points it to all interfaces:
sudo systemctl edit ollama
[Service] Environment="OLLAMA_HOST=0.0.0.0:11434"
sudo systemctl daemon-reload sudo systemctl restart ollama
-
NOTE:
- If YAML config is preferred then set
listen: 0.0.0.0:11434
in/etc/ollama/ollama.yaml
.- Multiple GPUs? Pin the fast card so Ollama doesn’t fall back to CPU:
Environment="CUDA_VISIBLE_DEVICES=0"
-
Pull and cache the Qwen 2.5 model
# 7-B parameter variant; ollama pull qwen2.5:7b
-
Smoke-test the endpointIn response one should see a streamed JSON response and
curl -X POST http://<elastic-ip>:11434/api/generate \ -d '{"model":"qwen2.5:7b","prompt":"Describe AWS EC2 in one line."}'
nvidia-smi
should briefly show~85–100 % GPU
utilization. -
Enable HTTPS in front(Optional)Run
sudo apt install nginx # create /etc/nginx/sites-available/ollama.conf server { listen 443 ssl; server_name <your-domain>; ssl_certificate /etc/letsencrypt/live/<your-domain>/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/<your-domain>/privkey.pem; location / { proxy_pass http://127.0.0.1:11434; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; } } sudo ln -s /etc/nginx/sites-available/ollama.conf /etc/nginx/sites-enabled/ sudo nginx -t && sudo systemctl reload nginx
certbot --nginx
once to obtain the free TLS certificate. -
Cost & scaling tips
- Spot instances can shave 60-80 % off the on-demand rate but may be interrupted.
- Autoscale by putting Ollama behind an ALB and using an EC2 Auto Scaling group keyed on GPU utilization.
Quick health checklist
# GPU visible
nvidia-smi
# Ollama listening on public address
ss -ltnp | grep 11434
# Model loaded in VRAM
ollama list | grep qwen2.5
# Sample chat
ollama run qwen2.5:7b
Important:
Config file
searchai-config.yml
needs to be updated with thesearchblox-llm
which is pointing to the new AWS basedOllama service
. Config file is present in the path :/opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml
searchblox-llm: http://localhost:11434/ llm-platform: "ollama" searchai-agents-server: num-thread: models: chat: "qwen2.5" document-enrichment: "qwen2.5" smart-faq: "qwen2.5" searchai-assist-text: "qwen2.5" searchai-assist-image: "llama3.2-vision" cache-settings: use-cache: true fact-score-threshold: 40 prompts: standalone-question: | Given the conversation history and a follow-up question, rephrase the follow-up question to be a standalone question that includes all necessary context. ...
Updated 13 days ago