TL;DR
We got GLM-4.7-Flash serving successfully on a DGX Spark natively using upstream vLLM nightly for CUDA 13.0, Python 3.12, and Transformers from source.
Known-good approach
- Use a fresh Python 3.12 virtualenv.
- Install vLLM nightly for
cu130. - Install Transformers from GitHub source.
- Run
vllmfrom outside the cloned repo directory unless the repo is built/installed properly. - Use
--reasoning-parser glm45for clean separation of visible output vs reasoning. - The following minimal serve command works:
vllm serve /models/huggingface/GLM-4.7-Flash/ \
--host 0.0.0.0 \
--port 8000 \
--served-model-name glm-4.7-flash \
--dtype bfloat16 \
--gpu-memory-utilization 0.60 \
--reasoning-parser glm45 \
--generation-config vllm
What failed and why
- Older DGX Spark helper repo path was useful for learning, but became a compatibility trap.
flashinfer-python==0.4.1failed because of a badpyproject.tomllicense field with newer setuptools.- An older vLLM source build had a missing
12.0fCMake arch entry, which caused an undefined symbol invllm._C.abi3.so. - Old 0.11-era vLLM plus newer Transformers caused tokenizer compatibility failures.
- Installing the default nightly wheel pulled the wrong CUDA runtime expectation; the correct fix was to use the
cu130wheel variant. - Running the server from inside the cloned
~/code/vllmrepo caused Python to import the source tree instead of the installed package, leading toNo module named 'vllm._C'.
Bottom line
For a working homelab daemon, upstream-first won.
Goal
Primary objective:
- Run GLM-4.7-Flash locally in a stable, reusable daemon on DGX Spark.
Secondary objective:
- Capture the failed paths and lessons learned so future setup is fast and less error-prone.
Environment assumptions
- Hardware: NVIDIA DGX Spark / GB10
- Host CUDA capability seen by software: 12.1
- Desired user-level stack: native host install, not an aging container image
- Model path used during successful test:
/models/huggingface/GLM-4.7-Flash/
Final known-good stack
Python
Use Python 3.12.
Reason:
- This matched the successful run.
- It aligns better with vLLM’s documented install examples than Python 3.13 did.
vLLM
Use nightly vLLM for CUDA 13.0 (cu130).
Reason:
- Default nightly is not necessarily the CUDA variant you want.
- The working install needed the explicit CUDA 13.0 wheel line.
Transformers
Use Transformers from source.
Reason:
- GLM-4.7 / GLM-4.7-Flash support is aligned with newer upstream behavior.
- This matched upstream guidance and the successful path.
Execution style
Run the server from outside the cloned vllm repo unless doing an intentional editable/source install.
Reason:
- Avoid Python import shadowing between local source tree and installed wheel.
Known-good setup procedure
1) Create a clean environment
cd ~/code/vllm
rm -rf .venv
uv venv --python 3.12 .venv
source .venv/bin/activate
unset LD_LIBRARY_PATH
unset PYTHONPATH
unset VLLM_USE_FLASHINFER_MXFP4_MOE
2) Install vLLM nightly for CUDA 13.0
uv pip install -U pip setuptools wheel
uv pip install -U vllm \
--torch-backend=cu130 \
--extra-index-url https://wheels.vllm.ai/nightly/cu130
3) Install Transformers from source
uv pip install -U git+https://github.com/huggingface/transformers.git
4) Verify package resolution
Run this from outside the repo root, e.g. from ~:
cd ~
source ~/code/vllm/.venv/bin/activate
python - <<'PY'
import sys, transformers, tokenizers, vllm
print("python:", sys.version)
print("transformers:", transformers.__version__)
print("tokenizers:", tokenizers.__version__)
print("vllm:", vllm.__version__)
print("vllm file:", vllm.__file__)
PY
Expected behavior:
vllm.__file__should point into.venv/lib/.../site-packages/vllm/...- It should not point into
~/code/vllm/vllm/...
5) Start the server
vllm serve /models/huggingface/GLM-4.7-Flash/ \
--host 0.0.0.0 \
--port 8000 \
--served-model-name glm-4.7-flash \
--dtype bfloat16 \
--gpu-memory-utilization 0.60 \
--reasoning-parser glm45 \
--generation-config vllm
6) Smoke test the server
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.7-flash",
"messages": [
{"role": "user", "content": "Reply with exactly: native upstream works"}
],
"temperature": 0
}'
Expected good behavior:
- HTTP 200
choices[0].message.contentis clean final output- reasoning appears in
choices[0].message.reasoning, not insidecontent
Full setup script
#!/usr/bin/env bash
set -euo pipefail
VLLM_REPO_DIR="${HOME}/code/vllm"
VENV_DIR="${VLLM_REPO_DIR}/.venv"
MODEL_DIR="/models/huggingface/GLM-4.7-Flash/"
HOST="0.0.0.0"
PORT="8000"
MODEL_NAME="glm-4.7-flash"
GPU_UTIL="0.60"
cd "${VLLM_REPO_DIR}"
rm -rf "${VENV_DIR}"
uv venv --python 3.12 "${VENV_DIR}"
source "${VENV_DIR}/bin/activate"
unset LD_LIBRARY_PATH
unset PYTHONPATH
unset VLLM_USE_FLASHINFER_MXFP4_MOE
uv pip install -U pip setuptools wheel
uv pip install -U vllm \
--torch-backend=cu130 \
--extra-index-url https://wheels.vllm.ai/nightly/cu130
uv pip install -U git+https://github.com/huggingface/transformers.git
cd "${HOME}"
python - <<'PY'
import sys, transformers, tokenizers, vllm
print("python:", sys.version)
print("transformers:", transformers.__version__)
print("tokenizers:", tokenizers.__version__)
print("vllm:", vllm.__version__)
print("vllm file:", vllm.__file__)
PY
exec vllm serve "${MODEL_DIR}" \
--host "${HOST}" \
--port "${PORT}" \
--served-model-name "${MODEL_NAME}" \
--dtype bfloat16 \
--gpu-memory-utilization "${GPU_UTIL}" \
--reasoning-parser glm45 \
--generation-config vllm
What we learned from the failed paths
1) Older helper repos can be useful, but they freeze old assumptions
The DGX Spark helper repo was still valuable because it exposed several real Spark/Blackwell issues:
- FlashInfer packaging metadata breakage
- missing Blackwell-related CMake arch coverage
- older environment assumptions
But it was not the best final serving solution for GLM-4.7-Flash.
Lesson
Use old setup repos as debug references, not necessarily as the long-term architecture.
2) FlashInfer packaging can fail for boring reasons
One major failure had nothing to do with GB10 itself.
flashinfer-python==0.4.1 failed to build because of a packaging metadata issue in pyproject.toml.
Symptom
- build failure complaining about
project.license - newer setuptools rejected the legacy single-string form
Lesson
Not every “Spark build failure” is a GPU problem. Some are just Python packaging drift.
3) Partial architecture patches are worse than obvious failure
The older source build eventually imported only after fixing an incomplete CMake arch entry.
Symptom
ImportErroronvllm._C.abi3.so- unresolved symbol related to
cutlass_moe_mm_sm100...
Cause
A cuda_archs_loose_intersection line still missed 12.0f.
Lesson
For Blackwell-family support, a patch that lands in two places but misses a third can create extremely misleading outcomes.
4) Old vLLM + new Transformers is an easy compatibility trap
An older vLLM snapshot combined with newer Transformers led to tokenizer failures.
Symptom
- tokenizer backend missing
all_special_tokens_extended
Lesson
If serving a new model family, it is often safer to follow the model’s current upstream instructions than to fight an older snapshot into shape.
5) The CUDA wheel variant matters
Using a general nightly wheel was not enough.
Symptom
ImportError: libcudart.so.12: cannot open shared object file
Cause
Wrong prebuilt runtime expectations for the host stack.
Fix
Install the explicit cu130 vLLM nightly variant.
Lesson
When a project publishes per-CUDA wheel lines, pick the one that matches the host/runtime target instead of trusting the generic default.
6) Running from inside the repo can sabotage a perfectly fine install
This was one of the more subtle but important issues.
Symptom
- the CLI looked installed correctly
- a subprocess imported local source files
- then failed with
No module named 'vllm._C'
Cause
Python path shadowing from the cloned source tree.
Lesson
When using prebuilt wheels, launch from outside the repo or explicitly verify vllm.__file__ before troubleshooting anything deeper.
7) Reasoning parser makes the output much cleaner
Without a reasoning parser, GLM-4.7-Flash placed reasoning markup into the visible response body.
Symptom
- numbered internal analysis text in
content - literal
</think>artifacts
Fix
Use:
--reasoning-parser glm45
Result
- visible final answer stayed in
content - reasoning moved to the structured
reasoningfield
Lesson
For this model, parser configuration is not cosmetic; it materially improves API cleanliness.
Performance and operational notes
Startup cost
First startup is slow and heavy.
Observed behavior:
- weights load: about 73 seconds
- model load total: about 76 seconds
- engine init / profiling / warmup / graph capture: about 141 seconds total
- CUDA graph capture alone: about 123 seconds
Implication
This is acceptable for a daemon, but unpleasant for rapid restart loops.
Throughput observations
Observed in this run:
- prompt throughput around 1.2 tokens/s
- generation throughput around 16.5 tokens/s
This is not fast, but it is operational.
Lesson
Treat DGX Spark + GLM-4.7-Flash as a usable local serving target, not a speed demon.
MoE config fallback
vLLM reported:
Using default MoE config. Performance might be sub-optimal!
Config file not found at ... device_name=NVIDIA_GB10.json
Implication
There may be future gains available if vLLM gets a GB10-specific MoE tuning profile.
Recommendation
Record this as an optimization opportunity, not an immediate blocker.
PyTorch capability warning
PyTorch warned that the GB10 is compute capability 12.1, while the build claimed support up to 12.0.
Reality check
The stack still worked.
Recommendation
Do not panic about this if serving is stable. Treat it as a future optimization / support-maturity note.
Suggested next-step experiments
Now that the daemon works, test changes one at a time.
Safe next steps
- Keep the current known-good command as baseline.
- Try a reduced max context if full 202k tokens is unnecessary.
- Validate behavior from your real client.
- Test whether your client can use or ignore the
reasoningfield cleanly.
Higher-risk later experiments
--max-model-len 131072- MTP / speculative decoding
- tool-calling flags
- long-context tuning
- source build from current upstream repo if you later want to patch or optimize deeply
Recommended “known good” launch command
Stable baseline
vllm serve /models/huggingface/GLM-4.7-Flash/ \
--host 0.0.0.0 \
--port 8000 \
--served-model-name glm-4.7-flash \
--dtype bfloat16 \
--gpu-memory-utilization 0.60 \
--reasoning-parser glm45 \
--generation-config vllm
Expanded command that also worked
This later command was also verified to start successfully on the same stack:
vllm serve /models/huggingface/GLM-4.7-Flash/ \
--host 0.0.0.0 \
--port 8000 \
--served-model-name glm-4.7-flash \
--dtype bfloat16 \
--gpu-memory-utilization 0.60 \
--max-model-len 131072 \
--generation-config vllm \
--speculative-config.method mtp \
--speculative-config.num_speculative_tokens 1 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice
Notes on the expanded command
Observed behavior:
--tool-call-parser glm47worked on the newer upstream/nightly stack.--enable-auto-tool-choiceinitialized successfully.- MTP speculative decoding loaded a drafter model and reported useful acceptance metrics.
--max-model-len 131072reduced the advertised maximum context compared with the larger default and improved concurrency headroom.
Observed caveats:
- startup was heavier than the baseline due to drafter load plus graph capture
- vLLM warned that
min_pandlogit_biasdo not work with speculative decoding - compile and graph-capture overhead increased noticeably
Performance notes for the expanded command
Observed in this run:
- total model load memory around 57.14 GiB
- initial engine/profile/warmup/init time around 167.60 seconds
- CUDA graph capture took about 121 seconds
- available KV cache memory about 12.6 GiB
- GPU KV cache size about 244,640 tokens
- max concurrency for 131,072-token requests about 1.87x
- speculative decoding acceptance rate about 87.4%
Practical interpretation
The newer upstream stack is not only able to serve GLM-4.7-Flash cleanly, it can also support the more advanced GLM-oriented flags that originally failed on the older setup.
Final conclusion
The most important conclusion from this whole effort is simple:
A clean, upstream-first native install works better for GLM-4.7-Flash on DGX Spark than trying to preserve an older patch-heavy setup.
The old path was still useful because it exposed real issues and taught useful lessons, but the successful operational solution was:
- fresh environment
- Python 3.12
- explicit
cu130nightly vLLM - Transformers from source
- careful path hygiene
glm45reasoning parser
That is the secret sauce.