Installing Ollama on Azure VM

Steps to deploy Ollama on Azure VM:

  1. Choose VM: Pick N-series on Azure, like NC-series for GPU.
  2. Install drivers: Install NVIDIA drivers, CUDA, and NCCL.
  3. Install Ollama: Use binary.
  4. Download: Download model into folder.
  5. Run Ollama: Launch it as a server with the model loaded.
  6. Expose endpoint: Use Nginx, configure ports.
  7. Smoke-test the endpoint
  8. (Recommended) Secure & scale
  9. Update SearchBlox config to the external endpoint for LLM inference.

Ollama 0.21+ with full NVIDIA-GPU acceleration on an Azure Linux VM and exposing an HTTP endpoint that serves Qwen 2.5.

Detail Guide to Install Ollama on Azure VM

  1. Azure VM & images


    ScenarioRecommended VM sizesNotes
    Dev / PoC (<15 tokens/s)Standard_NC4as_T4_v3 (T4 16 GB)Cheapest N-series with CUDA 12 support
    Prod (7B model, ~50 tokens/s)Standard_NC6s_v3 (V100 16 GB) or Standard_NC8ads_A100_v4 (A100 40 GB)A100 gives the best perf/$$
    Heavy-load / multiple modelsStandard_NC24ads_A100_v4 (4×A100 40 GB)Use GPU partitions or run 4 Ollama instances

    Image: Ubuntu 22.04 LTS (Canonical marketplace), enable Accelerated Networking.

  2. Open the service port

    Add an inbound rule in the VM’s Network Security Group for following

    Protocol: TCP | Port: 11434 | Source: Your-IP/Load-Balancer
    

    11434 is Ollama’s default; change later if desired.

  3. SSH in & install NVIDIA drivers + CUDA

    # ── prerequisites
    sudo apt-get update && sudo apt-get install -y build-essential git wget curl
    
    # ── NVIDIA driver & CUDA 12 (works with V100 / T4 / A100)
    curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    sudo apt-get update
    sudo apt-get -y install cuda-drivers-535  # or newer
    
    # ── reboot to load nvidia modules
    sudo reboot
    

    After reboot

    nvidia-smi   or dkms status       # should list your GPU
    nvcc --version        # confirms CUDA install
    
  4. Install Ollama with cuBLAS support

    There are two options for this step:

    • Option A – Native binary (simplest)

      curl https://ollama.ai/install.sh | sh   # script auto-detects CUDA & builds cublas kernel
      

      If you do not see GPU support: enabled (cuBLAS) in the install logs, remove /usr/local/bin/ollama, install the CUDA toolkit, and re-run the script.

    • Option B – Docker (handy for multiple instances)

      sudo apt-get -y install docker.io
      sudo systemctl enable --now docker
      
      sudo docker run --gpus all \
        -d -p 11434:11434 \
        -v ollama:/root/.ollama \
        --name ollama \
        ollama/ollama:latest
      

      NOTE: Docker image already contains a cuBLAS-built llama.cpp.

  5. Fetch the Qwen 2.5 model

    Ollama looks for models in /usr/share/ollama/.ollama/models (native) or the mounted volume (Docker).

    • Check the public registry
      ollama pull qwen2.5   # succeeds if an official GGUF exists
      
  6. Run Ollama as a service

    sudo tee /etc/systemd/system/ollama.service <<'EOF'
    [Unit]
    Description=Ollama LLM Server
    After=network.target
    
    [Service]
    User=ubuntu
    ExecStart=/usr/local/bin/ollama serve
    Restart=always
    Environment=OLLAMA_MODELS=/home/ubuntu/.ollama/models
     
    EOF
    
    sudo systemctl daemon-reload
    sudo systemctl enable --now ollama
    

    (If Docker is used, enable the container with --restart unless-stopped.)

  7. Smoke-test the endpoint

    curl http://<VM-PUBLIC-IP>:11434/api/generate \
         -d '{"model":"qwen2.5","prompt":"Hello from Azure!"}'
    

    As a response a JSON stream with generated tokens will be received.

  8. (Recommended) Secure & scale


    TaskQuick Pointer
    TLSTerminate TLS with Nginx (proxy_pass http://localhost:11434) and use a free cert from Let’s Encrypt
    AuthPut an API gateway (Azure API Management or Nginx auth-request) in front; Ollama itself has no auth today
    AutoscaleFront the VM with an Azure VMSS or use Azure Container Apps + GPU nodepools
    MonitoringEnable nvidia-dcgm-exporter + Prometheus for GPU metrics; APM via OpenTelemetry

    One-liner cheat-sheet

    # Dev box in one go (native):
    az vm create -g rg-ollama -n ollama-dev \
      --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
      --size Standard_NC4as_T4_v3 --public-ip-sku Standard \
      --admin-username ubuntu --generate-ssh-keys \
      --custom-data cloud_init_ollama.yaml
    

    (Populate cloud_init_ollama.yaml with the script from sections 3–6.)

Performance Tips:

  • Use A100 for the best price-per-token if you have >2 QPS steady load.
  • Pin the process to the GPU with CUDA_VISIBLE_DEVICES=0 in multi-GPU VMs.
  • Enable mmap offload (PARAMETER mmap true) for faster cold starts.

❗️

Important:

Config file searchai-config.yml needs to be updated with the searchblox-llm which is pointing to the new Azure based Ollama service. Config file is present in the path : /opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml

searchblox-llm: http://localhost:11434/
llm-platform: "ollama"
searchai-agents-server:
num-thread:

models:
  chat: "qwen2.5"
  document-enrichment: "qwen2.5"
  smart-faq: "qwen2.5"
  searchai-assist-text: "qwen2.5"
  searchai-assist-image: "llama3.2-vision"

cache-settings:
  use-cache: true
  fact-score-threshold: 40

prompts:
  standalone-question: |
    Given the conversation history and a follow-up question, rephrase the follow-up question to be a standalone question that includes all necessary context.
...