Steps to deploy Ollama on Azure VM:

Choose VM: Pick N-series on Azure, like NC-series for GPU.
Install drivers: Install NVIDIA drivers, CUDA, and NCCL.
Install Ollama: Use binary.
Download: Download model into folder.
Run Ollama: Launch it as a server with the model loaded.
Expose endpoint: Use Nginx, configure ports.
Smoke-test the endpoint
(Recommended) Secure & scale
Update SearchBlox config to the external endpoint for LLM inference.

Ollama 0.21+ with full NVIDIA-GPU acceleration on an Azure Linux VM and exposing an HTTP endpoint that serves Qwen 2.5.

Detail Guide to Install Ollama on Azure VM

Azure VM & images

Scenario	Recommended VM sizes	Notes
Dev / PoC (<15 tokens/s)	Standard_NC4as_T4_v3 (T4 16 GB)	Cheapest N-series with CUDA 12 support
Prod (7B model, ~50 tokens/s)	Standard_NC6s_v3 (V100 16 GB) or Standard_NC8ads_A100_v4 (A100 40 GB)	A100 gives the best perf/$$
Heavy-load / multiple models	Standard_NC24ads_A100_v4 (4×A100 40 GB)	Use GPU partitions or run 4 Ollama instances

Image: Ubuntu 22.04 LTS (Canonical marketplace), enable Accelerated Networking.

Open the service port

Add an inbound rule in the VM’s Network Security Group for following
```
Protocol: TCP | Port: 11434 | Source: Your-IP/Load-Balancer
```
11434 is Ollama’s default; change later if desired.

SSH in & install NVIDIA drivers + CUDA

# ── prerequisites
sudo apt-get update && sudo apt-get install -y build-essential git wget curl

# ── NVIDIA driver & CUDA 12 (works with V100 / T4 / A100)
curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-drivers-535  # or newer

# ── reboot to load nvidia modules
sudo reboot

After reboot

nvidia-smi   or dkms status       # should list your GPU
nvcc --version        # confirms CUDA install

Install Ollama with cuBLAS support

There are two options for this step:
- Option A – Native binary (simplest)
```
curl https://ollama.ai/install.sh | sh   # script auto-detects CUDA & builds cublas kernel
```
  If you do not see GPU support: enabled (cuBLAS) in the install logs, remove /usr/local/bin/ollama, install the CUDA toolkit, and re-run the script.
- Option B – Docker (handy for multiple instances)
```
sudo apt-get -y install docker.io
sudo systemctl enable --now docker

sudo docker run --gpus all \
  -d -p 11434:11434 \
  -v ollama:/root/.ollama \
  --name ollama \
  ollama/ollama:latest
```
  NOTE: Docker image already contains a cuBLAS-built llama.cpp.
Fetch the Qwen 2.5 model

Ollama looks for models in /usr/share/ollama/.ollama/models (native) or the mounted volume (Docker).
- Check the public registry
```
ollama pull qwen2.5   # succeeds if an official GGUF exists
```

Run Ollama as a service

sudo tee /etc/systemd/system/ollama.service <<'EOF'
[Unit]
Description=Ollama LLM Server
After=network.target

[Service]
User=ubuntu
ExecStart=/usr/local/bin/ollama serve
Restart=always
Environment=OLLAMA_MODELS=/home/ubuntu/.ollama/models
 
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now ollama

(If Docker is used, enable the container with --restart unless-stopped.)

Smoke-test the endpoint

curl http://<VM-PUBLIC-IP>:11434/api/generate \
     -d '{"model":"qwen2.5","prompt":"Hello from Azure!"}'

As a response a JSON stream with generated tokens will be received.

(Recommended) Secure & scale

Task	Quick Pointer
TLS	Terminate TLS with Nginx (proxy_pass http://localhost:11434) and use a free cert from Let’s Encrypt
Auth	Put an API gateway (Azure API Management or Nginx auth-request) in front; Ollama itself has no auth today
Autoscale	Front the VM with an Azure VMSS or use Azure Container Apps + GPU nodepools
Monitoring	Enable nvidia-dcgm-exporter + Prometheus for GPU metrics; APM via OpenTelemetry

One-liner cheat-sheet

# Dev box in one go (native):
az vm create -g rg-ollama -n ollama-dev \
  --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
  --size Standard_NC4as_T4_v3 --public-ip-sku Standard \
  --admin-username ubuntu --generate-ssh-keys \
  --custom-data cloud_init_ollama.yaml

(Populate cloud_init_ollama.yaml with the script from sections 3–6.)

Performance Tips:

Use A100 for the best price-per-token if you have >2 QPS steady load.
Pin the process to the GPU with CUDA_VISIBLE_DEVICES=0 in multi-GPU VMs.
Enable mmap offload (PARAMETER mmap true) for faster cold starts.

❗️
Important:
Config file searchai-config.yml needs to be updated with the searchblox-llm which is pointing to the new Azure based Ollama service. Config file is present in the path : /opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml
searchblox-llm: http://localhost:11434/
llm-platform: "ollama"
searchai-agents-server:
num-thread:

models:
  chat: "qwen2.5"
  document-enrichment: "qwen2.5"
  smart-faq: "qwen2.5"
  searchai-assist-text: "qwen2.5"
  searchai-assist-image: "llama3.2-vision"

cache-settings:
  use-cache: true
  fact-score-threshold: 40

prompts:
  standalone-question: |
    Given the conversation history and a follow-up question, rephrase the follow-up question to be a standalone question that includes all necessary context.
...

Installing Ollama on Azure VM

Detail Guide to Install Ollama on Azure VM

Azure VM & images

Open the service port

SSH in & install NVIDIA drivers + CUDA

Install Ollama with cuBLAS support

Fetch the Qwen 2.5 model

Run Ollama as a service

Smoke-test the endpoint

(Recommended) Secure & scale

Performance Tips:

❗️
Important:

Detail Guide to Install Ollama on Azure VM

Azure VM & images

Open the service port

SSH in & install NVIDIA drivers + CUDA

Install Ollama with cuBLAS support

Fetch the Qwen 2.5 model

Run Ollama as a service

Smoke-test the endpoint

(Recommended) Secure & scale

Performance Tips:

❗️Important:

❗️
Important: