Research

GLM-4.7-Flash on DGX Spark with native upstream vLLM

Field report on getting GLM-4.7-Flash running natively on DGX Spark with upstream vLLM, including dead ends, compatibility traps, and the final known-good stack.

complete AI systems Started Updated

Question

What stack reliably serves GLM-4.7-Flash on DGX Spark using upstream vLLM instead of helper repositories or vendor-specific wrappers?

Context

The goal was not simply to serve a model. The goal was to determine which combination of Python version, CUDA wheel, transformers source, and runtime invocation actually survives real setup conditions on DGX Spark.

Experiment

Multiple installation paths were tested, including helper-repo shortcuts, FlashInfer variants, upstream nightly wheels, and different runtime entrypoints. Each failure was treated as evidence about version coupling and deployment hygiene.

Findings

The stable path used Python 3.12, upstream vLLM nightly for CUDA 13.0, and Transformers from source. Running from outside an uninstalled repo clone also mattered because local source shadowing created misleading failures.

Related Notes

Record

Known-good outcome:

  • Fresh Python 3.12 virtual environment.
  • Upstream vllm nightly for cu130.
  • Transformers installed from GitHub source.
  • vllm launched outside the cloned repo directory unless the repo has been installed properly.
  • --reasoning-parser glm45 for cleaner reasoning separation.

Most failed attempts were not random. They exposed packaging assumptions, repo shadowing, or version boundaries that were easy to miss until exercised directly.