Research
GLM-4.7-Flash on DGX Spark with native upstream vLLM
Field report on getting GLM-4.7-Flash running natively on DGX Spark with upstream vLLM, including dead ends, compatibility traps, and the final known-good stack.
Question
What stack reliably serves GLM-4.7-Flash on DGX Spark using upstream vLLM instead of helper repositories or vendor-specific wrappers?
Context
The goal was not simply to serve a model. The goal was to determine which combination of Python version, CUDA wheel, transformers source, and runtime invocation actually survives real setup conditions on DGX Spark.
Experiment
Multiple installation paths were tested, including helper-repo shortcuts, FlashInfer variants, upstream nightly wheels, and different runtime entrypoints. Each failure was treated as evidence about version coupling and deployment hygiene.
Findings
The stable path used Python 3.12, upstream vLLM nightly for CUDA 13.0, and Transformers from source. Running from outside an uninstalled repo clone also mattered because local source shadowing created misleading failures.
Related Notes
Record
Known-good outcome:
- Fresh Python 3.12 virtual environment.
- Upstream
vllmnightly forcu130. - Transformers installed from GitHub source.
vllmlaunched outside the cloned repo directory unless the repo has been installed properly.--reasoning-parser glm45for cleaner reasoning separation.
Most failed attempts were not random. They exposed packaging assumptions, repo shadowing, or version boundaries that were easy to miss until exercised directly.