Installing Ollama on Azure VM

Steps to deploy Ollama on Azure VM:

  1. Choose VM: Pick N-series on Azure, like NC-series for GPU.
  2. Install drivers: Install NVIDIA drivers, CUDA, and NCCL.
  3. Install Ollama: Use binary.
  4. Download: Download model into folder.
  5. Run Ollama: Launch it as a server with the model loaded.
  6. Expose endpoint: Use Nginx, configure ports.
  7. Smoke-test the endpoint
  8. (Recommended) Secure & scale
  9. Update SearchBlox config to the external endpoint for LLM inference.

Ollama 0.21+ with full NVIDIA-GPU acceleration on an Azure Linux VM and exposing an HTTP endpoint that serves Qwen 2.5.

Detail Guide to Install Ollama on Azure VM

  1. Azure VM & images


    ScenarioRecommended VM sizesNotes
    Dev / PoC (<15 tokens/s)Standard_NC4as_T4_v3 (T4 16 GB)Cheapest N-series with CUDA 12 support
    Prod (7B model, ~50 tokens/s)Standard_NC6s_v3 (V100 16 GB) or Standard_NC8ads_A100_v4 (A100 40 GB)A100 gives the best perf/$$
    Heavy-load / multiple modelsStandard_NC24ads_A100_v4 (4×A100 40 GB)Use GPU partitions or run 4 Ollama instances

    Image: Ubuntu 22.04 LTS (Canonical marketplace), enable Accelerated Networking.

  2. Open the service port

    Add an inbound rule in the VM’s Network Security Group for following

    Protocol: TCP | Port: 11434 | Source: Your-IP/Load-Balancer
    

    11434 is Ollama’s default; change later if desired.

  3. SSH in & install NVIDIA drivers + CUDA

    # ── prerequisites
    sudo apt-get update && sudo apt-get install -y build-essential git wget curl
    
    # ── NVIDIA driver & CUDA 12 (works with V100 / T4 / A100)
    curl -O https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
    sudo dpkg -i cuda-keyring_1.1-1_all.deb
    sudo apt-get update
    sudo apt-get -y install cuda-drivers-535  # or newer
    
    # ── reboot to load nvidia modules
    sudo reboot
    

    After reboot

    nvidia-smi   or dkms status       # should list your GPU
    nvcc --version        # confirms CUDA install
    
  4. Install Ollama with cuda support

    There are two options for this step:

    • Option A – Native binary (simplest)

      curl https://ollama.ai/install.sh | sh   # script auto-detects CUDA & builds cublas kernel
      

      If you do not see GPU support: enabled (cuBLAS) in the install logs, remove /usr/local/bin/ollama, install the CUDA toolkit, and re-run the script.

    • Option B – Docker (handy for multiple instances)

      sudo apt-get -y install docker.io
      sudo systemctl enable --now docker
      
      sudo docker run --gpus all \
        -d -p 11434:11434 \
        -v ollama:/root/.ollama \
        --name ollama \
        ollama/ollama:latest
      

      NOTE: Docker image already contains a cuBLAS-built llama.cpp.

  5. Fetch the Qwen 2.5 model

    Before pulling the model, note that Ollama stores downloaded model files in different locations depending on how it was installed. For the native binary installation, models are stored at/usr/share/ollama/.ollama/models /usr/share/ollama/.ollama/models . For the Docker installation, models are stored in the mounted volume you defined when running the container. Make sure the storage location has sufficient disk space before downloading.

    • Check the public registry
      ollama pull qwen2.5   # succeeds if an official GGUF exists
      
  6. Run Ollama as a service

    sudo tee /etc/systemd/system/ollama.service <<'EOF'
    [Unit]
    Description=Ollama LLM Server
    After=network.target
    
    [Service]
    User=ubuntu
    ExecStart=/usr/local/bin/ollama serve
    Restart=always
    Environment=OLLAMA_MODELS=/home/ubuntu/.ollama/models
     
    EOF
    
    sudo systemctl daemon-reload
    sudo systemctl enable --now ollama
    

    (If Docker is used, enable the container with --restart unless-stopped.)

  7. Smoke-test the endpoint

    curl http://<VM-PUBLIC-IP>:11434/api/generate \
         -d '{"model":"qwen2.5","prompt":"Hello from Azure!"}'
    

    The command will return a streamed JSON response containing the generated text tokens.

  8. (Recommended) Secure & scale


    TaskQuick Pointer
    TLSTerminate TLS with Nginx (proxy_pass http://localhost:11434) and use a free cert from Let’s Encrypt
    AuthPut an API gateway (Azure API Management or Nginx auth-request) in front; Ollama itself has no auth today
    AutoscaleFront the VM with an Azure VMSS or use Azure Container Apps + GPU nodepools
    MonitoringEnable nvidia-dcgm-exporter + Prometheus for GPU metrics; APM via OpenTelemetry

    One-liner cheat-sheet

    # Dev box in one go (native):
    az vm create -g rg-ollama -n ollama-dev \
      --image Canonical:0001-com-ubuntu-server-jammy:22_04-lts:latest \
      --size Standard_NC4as_T4_v3 --public-ip-sku Standard \
      --admin-username ubuntu --generate-ssh-keys \
      --custom-data cloud_init_ollama.yaml
    

    (Populate cloud_init_ollama.yaml with the script from sections 3–6.)

Performance Tips:

  • Use A100 for the best price-per-token if you have >2 QPS steady load.
  • Pin the process to the GPU with CUDA_VISIBLE_DEVICES=0 in multi-GPU VMs.
  • Enable mmap offload (PARAMETER mmap true) for faster cold starts.

❗️

Important:

Config file searchai-config.yml needs to be updated with the searchblox-llm which is pointing to the new Azure based Ollama service. Config file is present in the path : /opt/searchblox/webapps/ROOT/WEB-INF/searchai-config.yml

searchblox-llm: http://localhost:11434/
llm-platform: "ollama"
searchai-agents-server:
num-thread:

models:
  chat: "qwen2.5"
  document-enrichment: "qwen2.5"
  smart-faq: "qwen2.5"
  searchai-assist-text: "qwen2.5"
  searchai-assist-image: "llama3.2-vision"

cache-settings:
  use-cache: true
  fact-score-threshold: 40

prompts:
  standalone-question: |
    Given the conversation history and a follow-up question, rephrase the follow-up question to be a standalone question that includes all necessary context.
...