GLM-4.7-Flash on DGX Spark with native upstream vLLM

• llm, vllm, nvidia, dgx, gpu, troubleshooting, python, transformers

TL;DR

We got GLM-4.7-Flash serving successfully on a DGX Spark natively using upstream vLLM nightly for CUDA 13.0, Python 3.12, and Transformers from source.

Known-good approach

  • Use a fresh Python 3.12 virtualenv.
  • Install vLLM nightly for cu130.
  • Install Transformers from GitHub source.
  • Run vllm from outside the cloned repo directory unless the repo is built/installed properly.
  • Use --reasoning-parser glm45 for clean separation of visible output vs reasoning.
  • The following minimal serve command works:
vllm serve /models/huggingface/GLM-4.7-Flash/ \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name glm-4.7-flash \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.60 \
  --reasoning-parser glm45 \
  --generation-config vllm

What failed and why

  • Older DGX Spark helper repo path was useful for learning, but became a compatibility trap.
  • flashinfer-python==0.4.1 failed because of a bad pyproject.toml license field with newer setuptools.
  • An older vLLM source build had a missing 12.0f CMake arch entry, which caused an undefined symbol in vllm._C.abi3.so.
  • Old 0.11-era vLLM plus newer Transformers caused tokenizer compatibility failures.
  • Installing the default nightly wheel pulled the wrong CUDA runtime expectation; the correct fix was to use the cu130 wheel variant.
  • Running the server from inside the cloned ~/code/vllm repo caused Python to import the source tree instead of the installed package, leading to No module named 'vllm._C'.

Bottom line

For a working homelab daemon, upstream-first won.


Goal

Primary objective:

  • Run GLM-4.7-Flash locally in a stable, reusable daemon on DGX Spark.

Secondary objective:

  • Capture the failed paths and lessons learned so future setup is fast and less error-prone.

Environment assumptions

  • Hardware: NVIDIA DGX Spark / GB10
  • Host CUDA capability seen by software: 12.1
  • Desired user-level stack: native host install, not an aging container image
  • Model path used during successful test:
/models/huggingface/GLM-4.7-Flash/

Final known-good stack

Python

Use Python 3.12.

Reason:

  • This matched the successful run.
  • It aligns better with vLLM’s documented install examples than Python 3.13 did.

vLLM

Use nightly vLLM for CUDA 13.0 (cu130).

Reason:

  • Default nightly is not necessarily the CUDA variant you want.
  • The working install needed the explicit CUDA 13.0 wheel line.

Transformers

Use Transformers from source.

Reason:

  • GLM-4.7 / GLM-4.7-Flash support is aligned with newer upstream behavior.
  • This matched upstream guidance and the successful path.

Execution style

Run the server from outside the cloned vllm repo unless doing an intentional editable/source install.

Reason:

  • Avoid Python import shadowing between local source tree and installed wheel.

Known-good setup procedure

1) Create a clean environment

cd ~/code/vllm
rm -rf .venv
uv venv --python 3.12 .venv
source .venv/bin/activate
unset LD_LIBRARY_PATH
unset PYTHONPATH
unset VLLM_USE_FLASHINFER_MXFP4_MOE

2) Install vLLM nightly for CUDA 13.0

uv pip install -U pip setuptools wheel
uv pip install -U vllm \
  --torch-backend=cu130 \
  --extra-index-url https://wheels.vllm.ai/nightly/cu130

3) Install Transformers from source

uv pip install -U git+https://github.com/huggingface/transformers.git

4) Verify package resolution

Run this from outside the repo root, e.g. from ~:

cd ~
source ~/code/vllm/.venv/bin/activate

python - <<'PY'
import sys, transformers, tokenizers, vllm
print("python:", sys.version)
print("transformers:", transformers.__version__)
print("tokenizers:", tokenizers.__version__)
print("vllm:", vllm.__version__)
print("vllm file:", vllm.__file__)
PY

Expected behavior:

  • vllm.__file__ should point into .venv/lib/.../site-packages/vllm/...
  • It should not point into ~/code/vllm/vllm/...

5) Start the server

vllm serve /models/huggingface/GLM-4.7-Flash/ \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name glm-4.7-flash \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.60 \
  --reasoning-parser glm45 \
  --generation-config vllm

6) Smoke test the server

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-4.7-flash",
    "messages": [
      {"role": "user", "content": "Reply with exactly: native upstream works"}
    ],
    "temperature": 0
  }'

Expected good behavior:

  • HTTP 200
  • choices[0].message.content is clean final output
  • reasoning appears in choices[0].message.reasoning, not inside content

Full setup script

#!/usr/bin/env bash
set -euo pipefail

VLLM_REPO_DIR="${HOME}/code/vllm"
VENV_DIR="${VLLM_REPO_DIR}/.venv"
MODEL_DIR="/models/huggingface/GLM-4.7-Flash/"
HOST="0.0.0.0"
PORT="8000"
MODEL_NAME="glm-4.7-flash"
GPU_UTIL="0.60"

cd "${VLLM_REPO_DIR}"
rm -rf "${VENV_DIR}"
uv venv --python 3.12 "${VENV_DIR}"
source "${VENV_DIR}/bin/activate"

unset LD_LIBRARY_PATH
unset PYTHONPATH
unset VLLM_USE_FLASHINFER_MXFP4_MOE

uv pip install -U pip setuptools wheel
uv pip install -U vllm \
  --torch-backend=cu130 \
  --extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install -U git+https://github.com/huggingface/transformers.git

cd "${HOME}"
python - <<'PY'
import sys, transformers, tokenizers, vllm
print("python:", sys.version)
print("transformers:", transformers.__version__)
print("tokenizers:", tokenizers.__version__)
print("vllm:", vllm.__version__)
print("vllm file:", vllm.__file__)
PY

exec vllm serve "${MODEL_DIR}" \
  --host "${HOST}" \
  --port "${PORT}" \
  --served-model-name "${MODEL_NAME}" \
  --dtype bfloat16 \
  --gpu-memory-utilization "${GPU_UTIL}" \
  --reasoning-parser glm45 \
  --generation-config vllm

What we learned from the failed paths

1) Older helper repos can be useful, but they freeze old assumptions

The DGX Spark helper repo was still valuable because it exposed several real Spark/Blackwell issues:

  • FlashInfer packaging metadata breakage
  • missing Blackwell-related CMake arch coverage
  • older environment assumptions

But it was not the best final serving solution for GLM-4.7-Flash.

Lesson

Use old setup repos as debug references, not necessarily as the long-term architecture.


2) FlashInfer packaging can fail for boring reasons

One major failure had nothing to do with GB10 itself.

flashinfer-python==0.4.1 failed to build because of a packaging metadata issue in pyproject.toml.

Symptom

  • build failure complaining about project.license
  • newer setuptools rejected the legacy single-string form

Lesson

Not every “Spark build failure” is a GPU problem. Some are just Python packaging drift.


3) Partial architecture patches are worse than obvious failure

The older source build eventually imported only after fixing an incomplete CMake arch entry.

Symptom

  • ImportError on vllm._C.abi3.so
  • unresolved symbol related to cutlass_moe_mm_sm100...

Cause

A cuda_archs_loose_intersection line still missed 12.0f.

Lesson

For Blackwell-family support, a patch that lands in two places but misses a third can create extremely misleading outcomes.


4) Old vLLM + new Transformers is an easy compatibility trap

An older vLLM snapshot combined with newer Transformers led to tokenizer failures.

Symptom

  • tokenizer backend missing all_special_tokens_extended

Lesson

If serving a new model family, it is often safer to follow the model’s current upstream instructions than to fight an older snapshot into shape.


5) The CUDA wheel variant matters

Using a general nightly wheel was not enough.

Symptom

  • ImportError: libcudart.so.12: cannot open shared object file

Cause

Wrong prebuilt runtime expectations for the host stack.

Fix

Install the explicit cu130 vLLM nightly variant.

Lesson

When a project publishes per-CUDA wheel lines, pick the one that matches the host/runtime target instead of trusting the generic default.


6) Running from inside the repo can sabotage a perfectly fine install

This was one of the more subtle but important issues.

Symptom

  • the CLI looked installed correctly
  • a subprocess imported local source files
  • then failed with No module named 'vllm._C'

Cause

Python path shadowing from the cloned source tree.

Lesson

When using prebuilt wheels, launch from outside the repo or explicitly verify vllm.__file__ before troubleshooting anything deeper.


7) Reasoning parser makes the output much cleaner

Without a reasoning parser, GLM-4.7-Flash placed reasoning markup into the visible response body.

Symptom

  • numbered internal analysis text in content
  • literal </think> artifacts

Fix

Use:

--reasoning-parser glm45

Result

  • visible final answer stayed in content
  • reasoning moved to the structured reasoning field

Lesson

For this model, parser configuration is not cosmetic; it materially improves API cleanliness.


Performance and operational notes

Startup cost

First startup is slow and heavy.

Observed behavior:

  • weights load: about 73 seconds
  • model load total: about 76 seconds
  • engine init / profiling / warmup / graph capture: about 141 seconds total
  • CUDA graph capture alone: about 123 seconds

Implication

This is acceptable for a daemon, but unpleasant for rapid restart loops.


Throughput observations

Observed in this run:

  • prompt throughput around 1.2 tokens/s
  • generation throughput around 16.5 tokens/s

This is not fast, but it is operational.

Lesson

Treat DGX Spark + GLM-4.7-Flash as a usable local serving target, not a speed demon.


MoE config fallback

vLLM reported:

Using default MoE config. Performance might be sub-optimal!
Config file not found at ... device_name=NVIDIA_GB10.json

Implication

There may be future gains available if vLLM gets a GB10-specific MoE tuning profile.

Recommendation

Record this as an optimization opportunity, not an immediate blocker.


PyTorch capability warning

PyTorch warned that the GB10 is compute capability 12.1, while the build claimed support up to 12.0.

Reality check

The stack still worked.

Recommendation

Do not panic about this if serving is stable. Treat it as a future optimization / support-maturity note.


Suggested next-step experiments

Now that the daemon works, test changes one at a time.

Safe next steps

  1. Keep the current known-good command as baseline.
  2. Try a reduced max context if full 202k tokens is unnecessary.
  3. Validate behavior from your real client.
  4. Test whether your client can use or ignore the reasoning field cleanly.

Higher-risk later experiments

  • --max-model-len 131072
  • MTP / speculative decoding
  • tool-calling flags
  • long-context tuning
  • source build from current upstream repo if you later want to patch or optimize deeply

Recommended “known good” launch command

Stable baseline

vllm serve /models/huggingface/GLM-4.7-Flash/ \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name glm-4.7-flash \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.60 \
  --reasoning-parser glm45 \
  --generation-config vllm

Expanded command that also worked

This later command was also verified to start successfully on the same stack:

vllm serve /models/huggingface/GLM-4.7-Flash/ \
  --host 0.0.0.0 \
  --port 8000 \
  --served-model-name glm-4.7-flash \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.60 \
  --max-model-len 131072 \
  --generation-config vllm \
  --speculative-config.method mtp \
  --speculative-config.num_speculative_tokens 1 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45 \
  --enable-auto-tool-choice

Notes on the expanded command

Observed behavior:

  • --tool-call-parser glm47 worked on the newer upstream/nightly stack.
  • --enable-auto-tool-choice initialized successfully.
  • MTP speculative decoding loaded a drafter model and reported useful acceptance metrics.
  • --max-model-len 131072 reduced the advertised maximum context compared with the larger default and improved concurrency headroom.

Observed caveats:

  • startup was heavier than the baseline due to drafter load plus graph capture
  • vLLM warned that min_p and logit_bias do not work with speculative decoding
  • compile and graph-capture overhead increased noticeably

Performance notes for the expanded command

Observed in this run:

  • total model load memory around 57.14 GiB
  • initial engine/profile/warmup/init time around 167.60 seconds
  • CUDA graph capture took about 121 seconds
  • available KV cache memory about 12.6 GiB
  • GPU KV cache size about 244,640 tokens
  • max concurrency for 131,072-token requests about 1.87x
  • speculative decoding acceptance rate about 87.4%

Practical interpretation

The newer upstream stack is not only able to serve GLM-4.7-Flash cleanly, it can also support the more advanced GLM-oriented flags that originally failed on the older setup.

Final conclusion

The most important conclusion from this whole effort is simple:

A clean, upstream-first native install works better for GLM-4.7-Flash on DGX Spark than trying to preserve an older patch-heavy setup.

The old path was still useful because it exposed real issues and taught useful lessons, but the successful operational solution was:

  • fresh environment
  • Python 3.12
  • explicit cu130 nightly vLLM
  • Transformers from source
  • careful path hygiene
  • glm45 reasoning parser

That is the secret sauce.