Installing Ollama on Azure VM
Steps to deploy Ollama on Azure VM:
- Choose VM: Pick N-series on Azure, like NC-series for GPU.
- Install drivers: Install NVIDIA drivers, CUDA, and NCCL.
- Install Ollama: Use binary.
- Download: Download model into folder.
- Run Ollama: Launch it as a server with the model loaded.
- Expose endpoint: Use Nginx, configure ports.
- Smoke-test the endpoint
- (Recommended) Secure & scale
- Update SearchBlox config to the external endpoint for LLM inference.
Ollama 0.21+ with full NVIDIA-GPU acceleration on an Azure Linux VM and exposing an HTTP endpoint that serves Qwen 2.5.
Detail Guide to Install Ollama on Azure VM
-
Azure VM & images
Scenario Recommended VM sizes Notes Dev / PoC (<15 tokens/s) Standard_NC4as_T4_v3 (T4 16 GB) Cheapest N-series with CUDA 12 support Prod (7B model, ~50 tokens/s) Standard_NC6s_v3 (V100 16 GB) or Standard_NC8ads_A100_v4 (A100 40 GB) A100 gives the best perf/$$ Heavy-load / multiple models Standard_NC24ads_A100_v4 (4×A100 40 GB) Use GPU partitions or run 4 Ollama instances
Image:
Ubuntu 22.04 LTS
(Canonical marketplace), enable Accelerated Networking. -
Open the service port
Add an inbound rule in the VM’s Network Security Group for following
Protocol: TCP | Port: 11434 | Source: Your-IP/Load-Balancer
11434 is Ollama’s default; change later if desired.
-
SSH in & install NVIDIA drivers + CUDA
# ── prerequisites sudo apt-get update && sudo apt-get install -y build-essential git wget curl # ── NVIDIA driver & CUDA 12 (works with V100 / T4 / A100) curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update sudo apt-get -y install cuda-drivers-535 # or newer # ── reboot to load nvidia modules sudo reboot
After reboot
nvidia-smi or dkms status # should list your GPU nvcc --version # confirms CUDA install
-
Install Ollama with cuBLAS support
There are two options for this step:
-
Option A – Native binary (simplest)
curl https://ollama.ai/install.sh | sh # script auto-detects CUDA & builds cublas kernel
If you do not see
GPU support: enabled (cuBLAS)
in the install logs, remove/usr/local/bin/ollama
, install the CUDA toolkit, and re-run the script. -
Option B – Docker (handy for multiple instances)
sudo apt-get -y install docker.io sudo systemctl enable --now docker sudo docker run --gpus all \ -d -p 11434:11434 \ -v ollama:/root/.ollama \ --name ollama \ ollama/ollama:latest
NOTE: Docker image already contains a cuBLAS-built llama.cpp.
-
-
Fetch the Qwen 2.5 model
Ollama looks for models in
/usr/share/ollama/.ollama/models
(native) or the mounted volume (Docker).- Check the public registry
ollama pull qwen2.5 # succeeds if an official GGUF exists
- Check the public registry
-
Run Ollama as a service
sudo tee /etc/systemd/system/ollama.service <<'EOF' [Unit] Description=Ollama LLM Server After=network.target [Service] User=ubuntu ExecStart=/usr/local/bin/ollama serve Restart=always Environment=OLLAMA_MODELS=/home/ubuntu/.ollama/models EOF sudo systemctl daemon-reload sudo systemctl enable --now ollama
(If Docker is used, enable the container with
--restart unless-stopped
.) -
Smoke-test the endpoint
curl http://<VM-PUBLIC-IP>:11434/api/generate \ -d '{"model":"qwen2.5","prompt":"Hello from Azure!"}'
As a response a JSON stream with generated tokens will be received.
-
(Recommended) Secure & scale
Task Quick Pointer TLS Terminate TLS with Nginx (proxy_pass http://localhost:11434) and use a free cert from Let’s Encrypt Auth Put an API gateway (Azure API Management or Nginx auth-request) in front; Ollama itself has no auth today Autoscale Front the VM with an Azure VMSS or use Azure Container Apps + GPU nodepools Monitoring Enable nvidia-dcgm-exporter + Prometheus for GPU metrics; APM via OpenTelemetry One-liner cheat-sheet
# Dev box in one go (native): az vm create -g rg-ollama -n ollama-dev \ --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \ --size Standard_NC4as_T4_v3 --public-ip-sku Standard \ --admin-username ubuntu --generate-ssh-keys \ --custom-data cloud_init_ollama.yaml
(Populate
cloud_init_ollama.yaml
with the script from sections 3–6.)
Performance Tips:
- Use A100 for the best price-per-token if you have
>2 QPS
steady load. - Pin the process to the GPU with
CUDA_VISIBLE_DEVICES=0
in multi-GPU VMs. - Enable
mmap
offload (PARAMETERmmap true
) for faster cold starts.
Important:
Config file
searchai-config.yml
needs to be updated with thesearchblox-llm
which is pointing to the new Azure basedOllama service
. Config file is present in the path :/opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml
searchblox-llm: http://localhost:11434/ llm-platform: "ollama" searchai-agents-server: num-thread: models: chat: "qwen2.5" document-enrichment: "qwen2.5" smart-faq: "qwen2.5" searchai-assist-text: "qwen2.5" searchai-assist-image: "llama3.2-vision" cache-settings: use-cache: true fact-score-threshold: 40 prompts: standalone-question: | Given the conversation history and a follow-up question, rephrase the follow-up question to be a standalone question that includes all necessary context. ...
Updated 13 days ago