Instructions to use WizardLMTeam/WizardCoder-Python-34B-V1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WizardLMTeam/WizardCoder-Python-34B-V1.0 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="WizardLMTeam/WizardCoder-Python-34B-V1.0")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("WizardLMTeam/WizardCoder-Python-34B-V1.0") model = AutoModelForCausalLM.from_pretrained("WizardLMTeam/WizardCoder-Python-34B-V1.0") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use WizardLMTeam/WizardCoder-Python-34B-V1.0 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "WizardLMTeam/WizardCoder-Python-34B-V1.0" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WizardLMTeam/WizardCoder-Python-34B-V1.0", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/WizardLMTeam/WizardCoder-Python-34B-V1.0
- SGLang
How to use WizardLMTeam/WizardCoder-Python-34B-V1.0 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "WizardLMTeam/WizardCoder-Python-34B-V1.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WizardLMTeam/WizardCoder-Python-34B-V1.0", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "WizardLMTeam/WizardCoder-Python-34B-V1.0" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WizardLMTeam/WizardCoder-Python-34B-V1.0", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use WizardLMTeam/WizardCoder-Python-34B-V1.0 with Docker Model Runner:
docker model run hf.co/WizardLMTeam/WizardCoder-Python-34B-V1.0
System requirement?
What is the system requirement to run this model, and how can I find that?
From reading the config, this is a Flaot16 model, using the Model Memory Estimator (https://huggingface.co/spaces/hf-accelerate/model-memory-usage), it provides the following specs for the wizard-coder 30B (as a llm):
dtype Largest Layer or Residual Group Total Size Training using Adam
float32 2.59 GB 125.48 GB 501.92 GB
int8 664.02 MB 31.37 GB 125.48 GB
float16/bfloat16 1.3 GB 62.74 GB 250.96 GB
int4 332.01 MB 15.68 GB 62.74 GB
So, if you pull this down, you'll need 63GB of RAM to run it. I would love to quantize this to a int8, so it could fit on a 4090 or A6000, but don't know how right now.
I am able to run in on M1 Max 64GB. Not super fast, but it works
llama_print_timings: sample time = 1804.24 ms / 729 runs ( 2.47 ms per token, 404.05 tokens per second)
llama_print_timings: prompt eval time = 3652.04 ms / 144 tokens ( 25.36 ms per token, 39.43 tokens per second)
llama_print_timings: eval time = 94289.78 ms / 728 runs ( 129.52 ms per token, 7.72 tokens per second)
llama_print_timings: total time = 100932.23 ms
Output generated in 101.16 seconds (7.20 tokens/s, 728 tokens, context 144, seed 1690939106)
Llama.generate: prefix-match hit
llama_print_timings: load time = 3652.09 ms
llama_print_timings: sample time = 2548.89 ms / 1024 runs ( 2.49 ms per token, 401.74 tokens per second)
llama_print_timings: prompt eval time = 13158.02 ms / 751 tokens ( 17.52 ms per token, 57.08 tokens per second)
llama_print_timings: eval time = 141916.85 ms / 1023 runs ( 138.73 ms per token, 7.21 tokens per second)
llama_print_timings: total time = 159473.00 ms
Output generated in 159.71 seconds (6.41 tokens/s, 1024 tokens, context 886, seed 1686911609)
Llama.generate: prefix-match hit
llama_print_timings: load time = 3652.09 ms
llama_print_timings: sample time = 694.30 ms / 276 runs ( 2.52 ms per token, 397.52 tokens per second)
llama_print_timings: prompt eval time = 19746.02 ms / 1023 tokens ( 19.30 ms per token, 51.81 tokens per second)
llama_print_timings: eval time = 43975.35 ms / 275 runs ( 159.91 ms per token, 6.25 tokens per second)
llama_print_timings: total time = 64842.96 ms
Output generated in 65.07 seconds (4.23 tokens/s, 275 tokens, context 1909, seed 828516400)