Installing Ollama on Azure VM
Steps to deploy Ollama on Azure VM:
- Choose VM: Pick N-series on Azure, like NC-series for GPU.
- Install drivers: Install NVIDIA drivers, CUDA, and NCCL.
- Install Ollama: Use binary.
- Download: Download model into folder.
- Run Ollama: Launch it as a server with the model loaded.
- Expose endpoint: Use Nginx, configure ports.
- Smoke-test the endpoint
- (Recommended) Secure & scale
- Update SearchBlox config to the external endpoint for LLM inference.
Ollama 0.21+ with full NVIDIA-GPU acceleration on an Azure Linux VM and exposing an HTTP endpoint that serves Qwen 2.5.
Detail Guide to Install Ollama on Azure VM
-
Azure VM & images
Scenario Recommended VM sizes Notes Dev / PoC (<15 tokens/s) Standard_NC4as_T4_v3 (T4 16 GB) Cheapest N-series with CUDA 12 support Prod (7B model, ~50 tokens/s) Standard_NC6s_v3 (V100 16 GB) or Standard_NC8ads_A100_v4 (A100 40 GB) A100 gives the best perf/$$ Heavy-load / multiple models Standard_NC24ads_A100_v4 (4×A100 40 GB) Use GPU partitions or run 4 Ollama instances
Image:
Ubuntu 22.04 LTS(Canonical marketplace), enable Accelerated Networking. -
Open the service port
Add an inbound rule in the VM’s Network Security Group for following
Protocol: TCP | Port: 11434 | Source: Your-IP/Load-Balancer11434 is Ollama’s default; change later if desired.
-
SSH in & install NVIDIA drivers + CUDA
# ── prerequisites sudo apt-get update && sudo apt-get install -y build-essential git wget curl # ── NVIDIA driver & CUDA 12 (works with V100 / T4 / A100) curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb sudo dpkg -i cuda-keyring_1.1-1_all.deb sudo apt-get update sudo apt-get -y install cuda-drivers-535 # or newer # ── reboot to load nvidia modules sudo rebootAfter reboot
nvidia-smi or dkms status # should list your GPU nvcc --version # confirms CUDA install -
Install Ollama with cuda support
There are two options for this step:
-
Option A – Native binary (simplest)
curl https://ollama.ai/install.sh | sh # script auto-detects CUDA & builds cublas kernelIf you do not see
GPU support: enabled (cuBLAS)in the install logs, remove/usr/local/bin/ollama, install the CUDA toolkit, and re-run the script. -
Option B – Docker (handy for multiple instances)
sudo apt-get -y install docker.io sudo systemctl enable --now docker sudo docker run --gpus all \ -d -p 11434:11434 \ -v ollama:/root/.ollama \ --name ollama \ ollama/ollama:latestNOTE: Docker image already contains a cuBLAS-built llama.cpp.
-
-
Fetch the Qwen 2.5 model
Before pulling the model, note that Ollama stores downloaded model files in different locations depending on how it was installed. For the native binary installation, models are stored at
/usr/share/ollama/.ollama/models/usr/share/ollama/.ollama/models. For the Docker installation, models are stored in the mounted volume you defined when running the container. Make sure the storage location has sufficient disk space before downloading.- Check the public registry
ollama pull qwen2.5 # succeeds if an official GGUF exists
- Check the public registry
-
Run Ollama as a service
sudo tee /etc/systemd/system/ollama.service <<'EOF' [Unit] Description=Ollama LLM Server After=network.target [Service] User=ubuntu ExecStart=/usr/local/bin/ollama serve Restart=always Environment=OLLAMA_MODELS=/home/ubuntu/.ollama/models EOF sudo systemctl daemon-reload sudo systemctl enable --now ollama(If Docker is used, enable the container with
--restart unless-stopped.) -
Smoke-test the endpoint
curl http://<VM-PUBLIC-IP>:11434/api/generate \ -d '{"model":"qwen2.5","prompt":"Hello from Azure!"}'The command will return a streamed JSON response containing the generated text tokens.
-
(Recommended) Secure & scale
Task Quick Pointer TLS Terminate TLS with Nginx (proxy_pass http://localhost:11434) and use a free cert from Let’s Encrypt Auth Put an API gateway (Azure API Management or Nginx auth-request) in front; Ollama itself has no auth today Autoscale Front the VM with an Azure VMSS or use Azure Container Apps + GPU nodepools Monitoring Enable nvidia-dcgm-exporter + Prometheus for GPU metrics; APM via OpenTelemetry One-liner cheat-sheet
# Dev box in one go (native): az vm create -g rg-ollama -n ollama-dev \ --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \ --size Standard_NC4as_T4_v3 --public-ip-sku Standard \ --admin-username ubuntu --generate-ssh-keys \ --custom-data cloud_init_ollama.yaml(Populate
cloud_init_ollama.yamlwith the script from sections 3–6.)
Performance Tips:
- Use A100 for the best price-per-token if you have
>2 QPSsteady load. - Pin the process to the GPU with
CUDA_VISIBLE_DEVICES=0in multi-GPU VMs. - Enable
mmapoffload (PARAMETERmmap true) for faster cold starts.
Important:
Config file
searchai-config.ymlneeds to be updated with thesearchblox-llmwhich is pointing to the new Azure basedOllama service. Config file is present in the path :/opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.ymlsearchblox-llm: http://localhost:11434/ llm-platform: "ollama" searchai-agents-server: num-thread: models: chat: "qwen2.5" document-enrichment: "qwen2.5" smart-faq: "qwen2.5" searchai-assist-text: "qwen2.5" searchai-assist-image: "llama3.2-vision" cache-settings: use-cache: true fact-score-threshold: 40 prompts: standalone-question: | Given the conversation history and a follow-up question, rephrase the follow-up question to be a standalone question that includes all necessary context. ...
Updated 4 days ago
