The Simplest Intro to Hosting Your Own Model
If you download model weights, you still need a serving engine that loads the model and exposes an HTTP API. The most beginner friendly choices are vLLM and SGLang. Both can present an OpenAI compatible endpoint so your app or curl can call them like OpenAI. You can get models from Hugging Face, ModelScope, or the official vendor portal for models like Llama. ModelScope is a popular hub similar to Hugging Face. Meta lets you request and download Llama weights from their site.
Important note about file formats. vLLM and SGLang expect Transformers format weights such as safetensors. GGUF files are for llama.cpp and Ollama and do not load in vLLM or SGLang.
About endpoints. vLLM and SGLang both provide an OpenAI compatible HTTP server. If you need a different API shape, you can write a small wrapper that uses their native Python APIs. In practice you normally run one API style per server process.
Host models locally on a MacBook or PC with Ollama
Llama and API Compatibility
Llama is just a model and its weights. It has no built in HTTP API and it is not inherently OpenAI compatible. Compatibility comes from the serving framework or cloud service you choose.
Meta hosted option
-
Meta’s Llama API provides its own native API and also works with OpenAI style client libraries.
Self hosting compatibility
-
vLLM ships an OpenAI compatible HTTP server. If you want a different API shape, wrap its Python engine with your own small service. (It does not include Meta’s native Llama API.)
-
SGLang offers both an OpenAI compatible endpoint and a native SGLang program model. These can coexist in one deployment. (It is not Meta Llama’s native API.)
-
Other choices include Hugging Face TGI (native HTTP and gRPC with an optional OpenAI compatible layer) and Ollama (its own REST API with some OpenAI style routes).
Can both endpoint styles run at the same time
-
It depends on the framework. SGLang and TGI can serve native and OpenAI compatible endpoints together.
-
vLLM usually exposes only the OpenAI compatible server. To have both, run a second process on another port or add a lightweight adapter service around its Python API.
Takeaway
-
Llama itself is not OpenAI compatible. Meta’s hosted API supports both styles. With self hosting, tools like vLLM, SGLang, TGI and Ollama can give you OpenAI compatible endpoints, and some also offer their own native routes.
Example: serve Llama with vLLM or SGLang
# vLLM install and serve
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype bfloat16 \
--max-model-len 8192 \
--port 8000
# Call it with the OpenAI style Chat Completions route
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages":[{"role":"user","content":"write a poem"}],
"temperature": 0.7
}'
# SGLang install and serve
pip install sglang
sglang serve \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--port 30000 \
--tp 1 \
--trust-remote-code
# Call the SGLang server with the same route
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages":[{"role":"user","content":"tell me a joke"}]
}'
vLLM documents its OpenAI compatible server under the serve and api_server entrypoints. SGLang documents an OpenAI compatible server as well, and it also offers a native programming model for more complex templates.
Summary and common gotchas
-
Use Instruct variants for chat. Non instruct weights often need a manual chat template.
-
Make sure the tokenizer matches the weights or loading will fail or text will look wrong.
-
Transformers format for vLLM and SGLang. GGUF is for llama.cpp or Ollama.
-
If memory is tight, lower max model length or use a supported quantization for your server.
-
Multi GPU needs tensor parallel flags in each server and a fast interconnect.
-
For production, add authentication and place a reverse proxy in front of the port.
-
If you really need non OpenAI routes, build a small FastAPI or Flask wrapper that calls the vLLM or SGLang Python API. Keep it as a separate process if you want it to coexist with the OpenAI compatible server.
No comments:
Post a Comment