Installing Ollama on AWS EC2

Ollama + Qwen 2.5 running on a single-GPU AWS EC2 instance, expose the API on port 11434, and verify access.

Steps to configure Ollama in AWS EC2 instance

Follow the following instructions to set-up Ollama on AWS EC2 instance:

Pick the right instance & AMI

Instance	GPU (VRAM)	Why it works for Qwen 2.5-7B	On-Demand price`*`
g5.xlarge	1 × A10G 24 GB	Fits the full 7 B model in VRAM; good $/token-s	≈ $0.92 hr
g5.2xlarge	1 × A10G 24 GB	Same GPU, more CPU/RAM if you run other services	≈ $1.21 hr

NOTE: US-East - N. Virginia, on-demand.

AMI options:

Fastest – use an AWS Deep Learning Base AMI (Ubuntu 22.04) ; it ships with the matching NVIDIA driver & CUDA libs pre-installed(Reference).
DIY – start from vanilla Ubuntu 22.04 and install the driver/CUDA yourself.

Provision the instance

# 1. Launch instance
#    - g5.xlarge, Ubuntu 22.04 (DL Base AMI or vanilla)
#    - 100 GB gp3 root volume (models + checkpoints)
#    - SG rules: 22/tcp (SSH), 11434/tcp (Ollama API) or 443 if reverse-proxy
# 2. Attach/allocate an Elastic IP for a stable endpoint
ssh -i ~/.ssh/your-key.pem ubuntu@<elastic-ip>

Install NVIDIA driver & CUDA 12(OPTIONAL)

NOTE: If not using a DL AMI, then follow this step else skip.

NOTE: To know more about driver install flow on AWS by using this step, refer AWS Documentation.

sudo apt update && sudo apt -y upgrade
sudo ubuntu-drivers autoinstall          # installs the recommended driver (≥535)
reboot
nvidia-smi                               # verify the GPU is visible

Install Ollama with GPU support

curl -fsSL https://ollama.ai/install.sh | sh
sudo systemctl enable --now ollama

Expose the API on the network
Ollama binds to 127.0.0.1 by default. Add an override that the systemd points it to all interfaces:

 sudo systemctl edit ollama

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

sudo systemctl daemon-reload
sudo systemctl restart ollama

📘
NOTE:
If YAML config is preferred then set listen: 0.0.0.0:11434 in /etc/ollama/ollama.yaml .
Multiple GPUs? Pin the fast card so Ollama doesn’t fall back to CPU:
Environment="CUDA_VISIBLE_DEVICES=0"

Pull and cache the Qwen 2.5 model

# 7-B parameter variant;
ollama pull qwen2.5:7b

Smoke-test the endpoint
```
curl -X POST http://<elastic-ip>:11434/api/generate \
  -d '{"model":"qwen2.5:7b","prompt":"Describe AWS EC2 in one line."}'
```
In response one should see a streamed JSON response and nvidia-smi should briefly show ~85–100 % GPU utilization.

Enable HTTPS in front(Optional)

sudo apt install nginx
# create /etc/nginx/sites-available/ollama.conf
server {
    listen 443 ssl;
    server_name <your-domain>;
    ssl_certificate /etc/letsencrypt/live/<your-domain>/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/<your-domain>/privkey.pem;
    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}
sudo ln -s /etc/nginx/sites-available/ollama.conf /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx

Run certbot --nginx once to obtain the free TLS certificate.

Cost & scaling tips
- Spot instances can shave 60-80 % off the on-demand rate but may be interrupted.
- Autoscale by putting Ollama behind an ALB and using an EC2 Auto Scaling group keyed on GPU utilization.

Quick health checklist

# GPU visible
nvidia-smi

# Ollama listening on public address
ss -ltnp | grep 11434

# Model loaded in VRAM
ollama list | grep qwen2.5

# Sample chat
ollama run qwen2.5:7b

❗️
Important:
Config file searchai-config.yml needs to be updated with the searchblox-llm which is pointing to the new AWS based Ollama service. Config file is present in the path : /opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml
searchblox-llm: http://localhost:11434/
llm-platform: "ollama"
searchai-agents-server:
num-thread:

models:
  chat: "qwen2.5"
  document-enrichment: "qwen2.5"
  smart-faq: "qwen2.5"
  searchai-assist-text: "qwen2.5"
  searchai-assist-image: "llama3.2-vision"

cache-settings:
  use-cache: true
  fact-score-threshold: 40

prompts:
  standalone-question: |
    Given the conversation history and a follow-up question, rephrase the follow-up question to be a standalone question that includes all necessary context.
...

Updated about 2 months ago

Installing Ollama on AWS EC2

Steps to configure Ollama in AWS EC2 instance

Pick the right instance & AMI

Provision the instance

Install NVIDIA driver & CUDA 12(OPTIONAL)

Install Ollama with GPU support

Expose the API on the network

📘
NOTE:

Pull and cache the Qwen 2.5 model

Smoke-test the endpoint

Enable HTTPS in front(Optional)

Cost & scaling tips

Quick health checklist

❗️
Important:

Steps to configure Ollama in AWS EC2 instance

Pick the right instance & AMI

Provision the instance

Install NVIDIA driver & CUDA 12(OPTIONAL)

Install Ollama with GPU support

Expose the API on the network

📘NOTE:

Pull and cache the Qwen 2.5 model

Smoke-test the endpoint

Enable HTTPS in front(Optional)

Cost & scaling tips

Quick health checklist

❗️Important:

📘
NOTE:

❗️
Important: