Running Open Source LLMs on WATcloud
Running your own LLM inference server is straightforward with vLLM (opens in a new tab). This guide walks you through setting up a vLLM server in the WATcloud compute cluster.
Start an Interactive Session
First, SSH into a login node. Then submit an interactive job to a compute node. Since LLMs are memory-intensive, we will reserve a full RTX 3090 (24 GiB of VRAM).
srun --cpus-per-task 8 --mem 16G --gres gpu:rtx_3090:1,tmpdisk:51200 --time 1:00:00 --pty bash
Verify GPU Access
Once your session starts, confirm the GPU is available:
nvidia-smi
You should see an output like this:
Thu May 29 05:04:11 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 0% 23C P8 20W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Set Up vLLM
Create a temporary working directory, set up a Python virtual environment, and install vLLM:
mkdir /tmp/vllm
cd /tmp/vllm
python3 -m venv .venv
source .venv/bin/activate
pip install vllm
Check that vLLM installed correctly:
vllm --version
Sample output:
0.9.0
Start tmux
We will need to run multiple commands at the same time. You can do this in many ways (e.g. &
(opens in a new tab), nohup
(opens in a new tab), srun --overlap
(opens in a new tab)). In this guide, we use tmux (opens in a new tab) to manage multiple terminal panes.
Start a tmux session:
tmux
Split the screen horizontally with Ctrl+b
then "
. Switch between panes with Ctrl+b
and arrow keys.

Launch vLLM Server
In the first pane, start the vLLM server with a small model (we use Qwen3-0.6B as an example):
HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-0.6B --distributed-executor-backend ray --port $(($(id -u) * 20 + 5))
HF_HOME=/tmp/hf_home
stores model weights in/tmp
to avoid cluttering your home directory and hitting storage quotas.--distributed-executor-backend ray
uses the Ray (opens in a new tab) backend, which is typically much faster than the default.--port $(($(id -u) * 20 + 5))
picks a port unique to each user to prevent conflicts.
When the server is ready, you'll see logs that look like:
INFO 05-29 05:36:42 [api_server.py:1336] Starting vLLM API server on http://0.0.0.0:30145
...
INFO: Started server process [1043336]
INFO: Waiting for application startup.
INFO: Application startup complete.
Test with vLLM Chat Client
In the second pane, run:
vllm chat --url http://localhost:$(($(id -u) * 20 + 5))/v1
You should see:
Using model: Qwen/Qwen3-0.6B
Please enter a message for the chat model:
>
You can now chat with the model. Try asking it a question!
When you are done, exit the chat by pressing Ctrl+d
. To stop the vLLM server, press Ctrl+c
in the first pane.
Other Models to Try
Here are a few other vLLM models that work out of the box on an RTX 3090:
# Qwen3-14B (AWQ quantized) with full context length (40960 tokens)
# https://huggingface.co/Qwen/Qwen3-14B-AWQ
HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-14B-AWQ --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --quantization awq_marlin --gpu-memory-utilization 0.85
# Qwen3-14B (AWQ quantized) with reduced context length (8192 tokens), serves more concurrent requests
# https://huggingface.co/Qwen/Qwen3-14B-AWQ
HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-14B-AWQ --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --quantization awq_marlin --gpu-memory-utilization 0.85 --max-model-len 8192
# Gemma-3-27B trained with quantization-aware training (QAT)
# - Google blog post: https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/
# - Unsloth quantized model (4-bit): https://huggingface.co/unsloth/gemma-3-27b-it-qat-bnb-4bit
# Notes:
# - --enforce-eager is used to disable the CUDA graph, which reduces the VRAM usage but also reduces performance.
# - --gpu-memory-utilization 0.99 is used to increase the VRAM available for the KV cache, which allows for longer context lengths.
pip install bitsandbytes
HF_HOME=/tmp/hf_home vllm serve unsloth/gemma-3-27b-it-qat-bnb-4bit --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --gpu-memory-utilization 0.99 --enforce-eager --max-model-len 16384
# Gemma-3-27B (non-QAT, 4-bit quantized)
# https://huggingface.co/ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g
# Notes:
# - For unknown reasons, this model can be run with context length of 32768 tokens, which is more than the 16384 tokens of unsloth/gemma-3-27b-it-qat-bnb-4bit.
HF_HOME=/tmp/hf_home vllm serve ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --gpu-memory-utilization 0.99 --enforce-eager --max-model-len 32768
# Llama 3.1 8B (non-quantized)
# Official repo (gated). Requires HF_TOKEN and agreement to license on Hugging Face
# https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your_huggingface_token> HF_HOME=/tmp/hf_home vllm serve meta-llama/Llama-3.1-8B-Instruct --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 32768
# NousResearch repo (non-gated)
# https://huggingface.co/NousResearch/Meta-Llama-3.1-8B-Instruct
HF_HOME=/tmp/hf_home vllm serve NousResearch/Meta-Llama-3.1-8B-Instruct --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 32768
# DeepSeek R1-0528 distilled to Qwen3-8B
# https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
HF_HOME=/tmp/hf_home vllm serve deepseek-ai/DeepSeek-R1-0528-Qwen3-8B --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 16384
Notes
- Always store model weights in
/tmp
or storage locations meant for large files (e.g./mnt/wato-drive*
). You can learn more about the storage options available in the compute cluster user manual. - Quantized models (e.g., Q4, Q5, AWQ) are typically required for models with over ~10B parameters with context lengths exceeding ~4096 tokens when using a GPU with 24GiB of VRAM. To estimate the amount of VRAM needed, you can use the VRAM Calculator (opens in a new tab).
- The WATcloud compute cluster uses a job-queuing system (Slurm (opens in a new tab)), so your vLLM inference server will automatically stop when the job ends. This setup is ideal for short interactive sessions or batch inference, but not for persistent, long-running servers. If you need a persistent hosted LLM endpoint, consider services offered by other groups on campus, such as the ECE Nebula cluster (opens in a new tab) or the CS Club (opens in a new tab).
Subscribe to WATcloud's blog
Get the latest posts delivered right to your inbox. We won't spam you!