Running Open Source LLMs on WATcloud
Running your own LLM inference server is straightforward with vLLM. This guide walks you through setting up a vLLM server in the WATcloud compute cluster.
Start an Interactive Session
First, SSH into a login node. Then submit an interactive job to a compute node. Since LLMs are memory-intensive, we will reserve a full RTX 3090 (24 GiB of VRAM).
srun --cpus-per-task 8 --mem 16G --gres gpu:rtx_3090:1,tmpdisk:51200 --time 1:00:00 --pty bashVerify GPU Access
Once your session starts, confirm the GPU is available:
nvidia-smiYou should see an output like this:
Thu May 29 05:04:11 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 0% 23C P8 20W / 350W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+Set Up vLLM
Create a temporary working directory, set up a Python virtual environment, and install vLLM:
mkdir /tmp/vllm
cd /tmp/vllm
python3 -m venv .venv
source .venv/bin/activate
pip install vllmCheck that vLLM installed correctly:
vllm --versionSample output:
0.9.0Start tmux
We will need to run multiple commands at the same time. You can do this in many ways (e.g. &, nohup, srun --overlap). In this guide, we use tmux to manage multiple terminal panes.
Start a tmux session:
tmuxSplit the screen horizontally with Ctrl+b then ". Switch between panes with Ctrl+b and arrow keys.

Launch vLLM Server
In the first pane, start the vLLM server with a small model (we use Qwen3-0.6B as an example):
HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-0.6B --distributed-executor-backend ray --port $(($(id -u) * 20 + 5))HF_HOME=/tmp/hf_homestores model weights in/tmpto avoid cluttering your home directory and hitting storage quotas.--distributed-executor-backend rayuses the Ray backend, which is typically much faster than the default.--port $(($(id -u) * 20 + 5))picks a port unique to each user to prevent conflicts.
When the server is ready, you’ll see logs that look like:
INFO 05-29 05:36:42 [api_server.py:1336] Starting vLLM API server on http://0.0.0.0:30145
...
INFO: Started server process [1043336]
INFO: Waiting for application startup.
INFO: Application startup complete.Test with vLLM Chat Client
In the second pane, run:
vllm chat --url http://localhost:$(($(id -u) * 20 + 5))/v1You should see:
Using model: Qwen/Qwen3-0.6B
Please enter a message for the chat model:
>You can now chat with the model. Try asking it a question!
When you are done, exit the chat by pressing Ctrl+d. To stop the vLLM server, press Ctrl+c in the first pane.
Other Models to Try
Here are a few other vLLM models that work out of the box on an RTX 3090:
# Qwen3-14B (AWQ quantized) with full context length (40960 tokens)
# https://huggingface.co/Qwen/Qwen3-14B-AWQ
HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-14B-AWQ --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --quantization awq_marlin --gpu-memory-utilization 0.85
# Qwen3-14B (AWQ quantized) with reduced context length (8192 tokens), serves more concurrent requests
# https://huggingface.co/Qwen/Qwen3-14B-AWQ
HF_HOME=/tmp/hf_home vllm serve Qwen/Qwen3-14B-AWQ --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --quantization awq_marlin --gpu-memory-utilization 0.85 --max-model-len 8192
# Gemma-3-27B trained with quantization-aware training (QAT)
# - Google blog post: https://developers.googleblog.com/en/gemma-3-quantized-aware-trained-state-of-the-art-ai-to-consumer-gpus/
# - Unsloth quantized model (4-bit): https://huggingface.co/unsloth/gemma-3-27b-it-qat-bnb-4bit
# Notes:
# - --enforce-eager is used to disable the CUDA graph, which reduces the VRAM usage but also reduces performance.
# - --gpu-memory-utilization 0.99 is used to increase the VRAM available for the KV cache, which allows for longer context lengths.
pip install bitsandbytes
HF_HOME=/tmp/hf_home vllm serve unsloth/gemma-3-27b-it-qat-bnb-4bit --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --gpu-memory-utilization 0.99 --enforce-eager --max-model-len 16384
# Gemma-3-27B (non-QAT, 4-bit quantized)
# https://huggingface.co/ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g
# Notes:
# - For unknown reasons, this model can be run with context length of 32768 tokens, which is more than the 16384 tokens of unsloth/gemma-3-27b-it-qat-bnb-4bit.
HF_HOME=/tmp/hf_home vllm serve ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --gpu-memory-utilization 0.99 --enforce-eager --max-model-len 32768
# Llama 3.1 8B (non-quantized)
# Official repo (gated). Requires HF_TOKEN and agreement to license on Hugging Face
# https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
HF_TOKEN=<your_huggingface_token> HF_HOME=/tmp/hf_home vllm serve meta-llama/Llama-3.1-8B-Instruct --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 32768
# NousResearch repo (non-gated)
# https://huggingface.co/NousResearch/Meta-Llama-3.1-8B-Instruct
HF_HOME=/tmp/hf_home vllm serve NousResearch/Meta-Llama-3.1-8B-Instruct --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 32768
# DeepSeek R1-0528 distilled to Qwen3-8B
# https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
HF_HOME=/tmp/hf_home vllm serve deepseek-ai/DeepSeek-R1-0528-Qwen3-8B --distributed-executor-backend ray --port $(($(id -u) * 20 + 5)) --max-model-len 16384Notes
- Always store model weights in
/tmpor storage locations meant for large files (e.g./mnt/wato-drive*). You can learn more about the storage options available in the compute cluster user manual. - Quantized models (e.g., Q4, Q5, AWQ) are typically required for models with over ~10B parameters with context lengths exceeding ~4096 tokens when using a GPU with 24GiB of VRAM. To estimate the amount of VRAM needed, you can use the VRAM Calculator.
- The WATcloud compute cluster uses a job-queuing system (Slurm), so your vLLM inference server will automatically stop when the job ends. This setup is ideal for short interactive sessions or batch inference, but not for persistent, long-running servers. If you need a persistent hosted LLM endpoint, consider services offered by other groups on campus, such as the ECE Nebula cluster or the CS Club.
Subscribe to WATcloud's blog
Get the latest posts delivered right to your inbox. We won't spam you!

