Running Open Source LLMs on WATcloud

May 29, 2025

Written By

Name: Ben Zhang
github: (opens in a new tab)
link: (opens in a new tab)

Reviewed By

Name: Alex Boden
github: (opens in a new tab)
linkedin: (opens in a new tab)

Running your own LLM inference server is straightforward with vLLM (opens in a new tab). This guide walks you through setting up a vLLM server in the WATcloud compute cluster.

Start an Interactive Session

First, SSH into a login node. Then submit an interactive job to a compute node. Since LLMs are memory-intensive, we will reserve a full RTX 3090 (24 GiB of VRAM).

srun --cpus-per-task 8 --mem 16G --gres gpu:rtx_3090:1,tmpdisk:51200 --time 1:00:00 --pty bash

Verify GPU Access

Once your session starts, confirm the GPU is available:

nvidia-smi

You should see an output like this:

Thu May 29 05:04:11 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06             Driver Version: 535.183.06   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:01:00.0 Off |                  N/A |
|  0%   23C    P8              20W / 350W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Set Up vLLM

Create a temporary working directory, set up a Python virtual environment, and install vLLM:

mkdir /tmp/vllm
cd /tmp/vllm
python3 -m venv .venv
source .venv/bin/activate
pip install vllm

Check that vLLM installed correctly:

vllm --version

Sample output:

0.9.0

Start tmux

We will need to run multiple commands at the same time. You can do this in many ways (e.g. & (opens in a new tab), nohup (opens in a new tab), srun --overlap (opens in a new tab)). In this guide, we use tmux (opens in a new tab) to manage multiple terminal panes.

Start a tmux session:

tmux

Split the screen horizontally with Ctrl+b then ". Switch between panes with Ctrl+b and arrow keys.

Launch vLLM Server

In the first pane, start the vLLM server with a small model (we use Qwen3-0.6B as an example):

HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-0.6B --distributed-executor-backend ray --port $(($(id -u) * 20 + 5))

HF_HOME=/tmp/hf_home stores model weights in /tmp to avoid cluttering your home directory and hitting storage quotas.
--distributed-executor-backend ray uses the Ray (opens in a new tab) backend, which is typically much faster than the default.
--port $(($(id -u) * 20 + 5)) picks a port unique to each user to prevent conflicts.

When the server is ready, you'll see logs that look like:

INFO 05-29 05:36:42 [api_server.py:1336] Starting vLLM API server on http://0.0.0.0:30145
...
INFO:     Started server process [1043336]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Test with vLLM Chat Client

In the second pane, run:

vllm chat --url http://localhost:$(($(id -u) * 20 + 5))/v1

You should see:

Using model: Qwen/Qwen3-0.6B
Please enter a message for the chat model:
>

You can now chat with the model. Try asking it a question!

When you are done, exit the chat by pressing Ctrl+d. To stop the vLLM server, press Ctrl+c in the first pane.

Other Models to Try

Here are a few other vLLM models that work out of the box on an RTX 3090:

# Qwen3-14B (AWQ quantized) with full context length (40960 tokens)
# https://huggingface.co/Qwen/Qwen3-14B-AWQ
HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-14B-AWQ --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --quantization awq_marlin --gpu-memory-utilization 0.85
 
# Qwen3-14B (AWQ quantized) with reduced context length (8192 tokens), serves more concurrent requests
# https://huggingface.co/Qwen/Qwen3-14B-AWQ
HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-14B-AWQ --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --quantization awq_marlin --gpu-memory-utilization 0.85 --max-model-len 8192
 
# Gemma-3-27B trained with quantization-aware training (QAT)
# - Google blog post: https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/
# - Unsloth quantized model (4-bit): https://huggingface.co/unsloth/gemma-3-27b-it-qat-bnb-4bit
# Notes:
# - --enforce-eager is used to disable the CUDA graph, which reduces the VRAM usage but also reduces performance.
# - --gpu-memory-utilization 0.99 is used to increase the VRAM available for the KV cache, which allows for longer context lengths.
pip install bitsandbytes
HF_HOME=/tmp/hf_home vllm serve unsloth/gemma-3-27b-it-qat-bnb-4bit --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --gpu-memory-utilization 0.99 --enforce-eager --max-model-len 16384
 
# Gemma-3-27B (non-QAT, 4-bit quantized)
# https://huggingface.co/ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g
# Notes:
# - For unknown reasons, this model can be run with context length of 32768 tokens, which is more than the 16384 tokens of unsloth/gemma-3-27b-it-qat-bnb-4bit.
HF_HOME=/tmp/hf_home vllm serve ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --gpu-memory-utilization 0.99 --enforce-eager --max-model-len 32768
 
# Llama 3.1 8B (non-quantized)
# Official repo (gated). Requires HF_TOKEN and agreement to license on Hugging Face
# https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your_huggingface_token> HF_HOME=/tmp/hf_home vllm serve meta-llama/Llama-3.1-8B-Instruct --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 32768
# NousResearch repo (non-gated)
# https://huggingface.co/NousResearch/Meta-Llama-3.1-8B-Instruct
HF_HOME=/tmp/hf_home vllm serve NousResearch/Meta-Llama-3.1-8B-Instruct --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 32768
 
# DeepSeek R1-0528 distilled to Qwen3-8B
# https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
HF_HOME=/tmp/hf_home vllm serve deepseek-ai/DeepSeek-R1-0528-Qwen3-8B --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 16384

Notes

Always store model weights in /tmp or storage locations meant for large files (e.g. /mnt/wato-drive*). You can learn more about the storage options available in the compute cluster user manual.
Quantized models (e.g., Q4, Q5, AWQ) are typically required for models with over ~10B parameters with context lengths exceeding ~4096 tokens when using a GPU with 24GiB of VRAM. To estimate the amount of VRAM needed, you can use the VRAM Calculator (opens in a new tab).
The WATcloud compute cluster uses a job-queuing system (Slurm (opens in a new tab)), so your vLLM inference server will automatically stop when the job ends. This setup is ideal for short interactive sessions or batch inference, but not for persistent, long-running servers. If you need a persistent hosted LLM endpoint, consider services offered by other groups on campus, such as the ECE Nebula cluster (opens in a new tab) or the CS Club (opens in a new tab).

Subscribe to WATcloud's blog

Get the latest posts delivered right to your inbox. We won't spam you!