Running Open Source LLMs on WATcloud

Running Open Source LLMs on WATcloud

Written By
Reviewed By
alexboden
Name
Alex Boden

Running your own LLM inference server is straightforward with vLLM (opens in a new tab). This guide walks you through setting up a vLLM server in the WATcloud compute cluster.

Start an Interactive Session

First, SSH into a login node. Then submit an interactive job to a compute node. Since LLMs are memory-intensive, we will reserve a full RTX 3090 (24 GiB of VRAM).

srun --cpus-per-task 8 --mem 16G --gres gpu:rtx_3090:1,tmpdisk:51200 --time 1:00:00 --pty bash

Verify GPU Access

Once your session starts, confirm the GPU is available:

nvidia-smi

You should see an output like this:

Thu May 29 05:04:11 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   23C    P8              20W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Set Up vLLM

Create a temporary working directory, set up a Python virtual environment, and install vLLM:

mkdir /tmp/vllm
cd /tmp/vllm
python3 -m venv .venv
source .venv/bin/activate
pip install vllm

Check that vLLM installed correctly:

vllm --version

Sample output:

0.9.0

Start tmux

We will need to run multiple commands at the same time. You can do this in many ways (e.g. & (opens in a new tab), nohup (opens in a new tab), srun --overlap (opens in a new tab)). In this guide, we use tmux (opens in a new tab) to manage multiple terminal panes.

Start a tmux session:

tmux

Split the screen horizontally with Ctrl+b then ". Switch between panes with Ctrl+b and arrow keys.

tmux with 2 panes

Launch vLLM Server

In the first pane, start the vLLM server with a small model (we use Qwen3-0.6B as an example):

HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-0.6B --distributed-executor-backend ray --port $(($(id -u) * 20 + 5))
  • HF_HOME=/tmp/hf_home stores model weights in /tmp to avoid cluttering your home directory and hitting storage quotas.
  • --distributed-executor-backend ray uses the Ray (opens in a new tab) backend, which is typically much faster than the default.
  • --port $(($(id -u) * 20 + 5)) picks a port unique to each user to prevent conflicts.

When the server is ready, you'll see logs that look like:

INFO 05-29 05:36:42 [api_server.py:1336] Starting vLLM API server on http://0.0.0.0:30145
...
INFO:     Started server process [1043336]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Test with vLLM Chat Client

In the second pane, run:

vllm chat --url http://localhost:$(($(id -u) * 20 + 5))/v1

You should see:

Using model: Qwen/Qwen3-0.6B
Please enter a message for the chat model:
>

You can now chat with the model. Try asking it a question!

When you are done, exit the chat by pressing Ctrl+d. To stop the vLLM server, press Ctrl+c in the first pane.

Other Models to Try

Here are a few other vLLM models that work out of the box on an RTX 3090:

# Qwen3-14B (AWQ quantized) with full context length (40960 tokens)
# https://huggingface.co/Qwen/Qwen3-14B-AWQ
HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-14B-AWQ --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --quantization awq_marlin --gpu-memory-utilization 0.85
 
# Qwen3-14B (AWQ quantized) with reduced context length (8192 tokens), serves more concurrent requests
# https://huggingface.co/Qwen/Qwen3-14B-AWQ
HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-14B-AWQ --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --quantization awq_marlin --gpu-memory-utilization 0.85 --max-model-len 8192
 
# Gemma-3-27B trained with quantization-aware training (QAT)
# - Google blog post: https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/
# - Unsloth quantized model (4-bit): https://huggingface.co/unsloth/gemma-3-27b-it-qat-bnb-4bit
# Notes:
# - --enforce-eager is used to disable the CUDA graph, which reduces the VRAM usage but also reduces performance.
# - --gpu-memory-utilization 0.99 is used to increase the VRAM available for the KV cache, which allows for longer context lengths.
pip install bitsandbytes
HF_HOME=/tmp/hf_home vllm serve unsloth/gemma-3-27b-it-qat-bnb-4bit --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --gpu-memory-utilization 0.99 --enforce-eager --max-model-len 16384
 
# Gemma-3-27B (non-QAT, 4-bit quantized)
# https://huggingface.co/ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g
# Notes:
# - For unknown reasons, this model can be run with context length of 32768 tokens, which is more than the 16384 tokens of unsloth/gemma-3-27b-it-qat-bnb-4bit.
HF_HOME=/tmp/hf_home vllm serve ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --gpu-memory-utilization 0.99 --enforce-eager --max-model-len 32768
 
# Llama 3.1 8B (non-quantized)
# Official repo (gated). Requires HF_TOKEN and agreement to license on Hugging Face
# https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your_huggingface_token> HF_HOME=/tmp/hf_home vllm serve meta-llama/Llama-3.1-8B-Instruct --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 32768
# NousResearch repo (non-gated)
# https://huggingface.co/NousResearch/Meta-Llama-3.1-8B-Instruct
HF_HOME=/tmp/hf_home vllm serve NousResearch/Meta-Llama-3.1-8B-Instruct --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 32768
 
# DeepSeek R1-0528 distilled to Qwen3-8B
# https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
HF_HOME=/tmp/hf_home vllm serve deepseek-ai/DeepSeek-R1-0528-Qwen3-8B --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 16384

Notes

  • Always store model weights in /tmp or storage locations meant for large files (e.g. /mnt/wato-drive*). You can learn more about the storage options available in the compute cluster user manual.
  • Quantized models (e.g., Q4, Q5, AWQ) are typically required for models with over ~10B parameters with context lengths exceeding ~4096 tokens when using a GPU with 24GiB of VRAM. To estimate the amount of VRAM needed, you can use the VRAM Calculator (opens in a new tab).
  • The WATcloud compute cluster uses a job-queuing system (Slurm (opens in a new tab)), so your vLLM inference server will automatically stop when the job ends. This setup is ideal for short interactive sessions or batch inference, but not for persistent, long-running servers. If you need a persistent hosted LLM endpoint, consider services offered by other groups on campus, such as the ECE Nebula cluster (opens in a new tab) or the CS Club (opens in a new tab).

Subscribe to WATcloud's blog

Get the latest posts delivered right to your inbox. We won't spam you!