Instructions to use WizardLMTeam/WizardCoder-Python-34B-V1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use WizardLMTeam/WizardCoder-Python-34B-V1.0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="WizardLMTeam/WizardCoder-Python-34B-V1.0")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("WizardLMTeam/WizardCoder-Python-34B-V1.0")
model = AutoModelForCausalLM.from_pretrained("WizardLMTeam/WizardCoder-Python-34B-V1.0")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use WizardLMTeam/WizardCoder-Python-34B-V1.0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "WizardLMTeam/WizardCoder-Python-34B-V1.0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WizardLMTeam/WizardCoder-Python-34B-V1.0",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/WizardLMTeam/WizardCoder-Python-34B-V1.0

SGLang

How to use WizardLMTeam/WizardCoder-Python-34B-V1.0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "WizardLMTeam/WizardCoder-Python-34B-V1.0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WizardLMTeam/WizardCoder-Python-34B-V1.0",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "WizardLMTeam/WizardCoder-Python-34B-V1.0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WizardLMTeam/WizardCoder-Python-34B-V1.0",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use WizardLMTeam/WizardCoder-Python-34B-V1.0 with Docker Model Runner:
```
docker model run hf.co/WizardLMTeam/WizardCoder-Python-34B-V1.0
```

System requirement?

#27

by nguyengoc - opened Oct 4, 2023

Discussion

nguyengoc

Oct 4, 2023

What is the system requirement to run this model, and how can I find that?

shavera

Oct 22, 2023

•

edited Oct 22, 2023

From reading the config, this is a Flaot16 model, using the Model Memory Estimator (https://huggingface.co/spaces/hf-accelerate/model-memory-usage), it provides the following specs for the wizard-coder 30B (as a llm):

dtype Largest Layer or Residual Group Total Size Training using Adam
float32 2.59 GB 125.48 GB 501.92 GB
int8 664.02 MB 31.37 GB 125.48 GB
float16/bfloat16 1.3 GB 62.74 GB 250.96 GB
int4 332.01 MB 15.68 GB 62.74 GB

So, if you pull this down, you'll need 63GB of RAM to run it. I would love to quantize this to a int8, so it could fit on a 4090 or A6000, but don't know how right now.

sergeyzen

Oct 22, 2023

I am able to run in on M1 Max 64GB. Not super fast, but it works

llama_print_timings:      sample time =  1804.24 ms /   729 runs   (    2.47 ms per token,   404.05 tokens per second)
llama_print_timings: prompt eval time =  3652.04 ms /   144 tokens (   25.36 ms per token,    39.43 tokens per second)
llama_print_timings:        eval time = 94289.78 ms /   728 runs   (  129.52 ms per token,     7.72 tokens per second)
llama_print_timings:       total time = 100932.23 ms
Output generated in 101.16 seconds (7.20 tokens/s, 728 tokens, context 144, seed 1690939106)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  3652.09 ms
llama_print_timings:      sample time =  2548.89 ms /  1024 runs   (    2.49 ms per token,   401.74 tokens per second)
llama_print_timings: prompt eval time = 13158.02 ms /   751 tokens (   17.52 ms per token,    57.08 tokens per second)
llama_print_timings:        eval time = 141916.85 ms /  1023 runs   (  138.73 ms per token,     7.21 tokens per second)
llama_print_timings:       total time = 159473.00 ms
Output generated in 159.71 seconds (6.41 tokens/s, 1024 tokens, context 886, seed 1686911609)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  3652.09 ms
llama_print_timings:      sample time =   694.30 ms /   276 runs   (    2.52 ms per token,   397.52 tokens per second)
llama_print_timings: prompt eval time = 19746.02 ms /  1023 tokens (   19.30 ms per token,    51.81 tokens per second)
llama_print_timings:        eval time = 43975.35 ms /   275 runs   (  159.91 ms per token,     6.25 tokens per second)
llama_print_timings:       total time = 64842.96 ms
Output generated in 65.07 seconds (4.23 tokens/s, 275 tokens, context 1909, seed 828516400)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment