Instructions to use bigcode/starcoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use bigcode/starcoder with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="bigcode/starcoder")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder") model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use bigcode/starcoder with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bigcode/starcoder" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigcode/starcoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/bigcode/starcoder
- SGLang
How to use bigcode/starcoder with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "bigcode/starcoder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigcode/starcoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "bigcode/starcoder" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bigcode/starcoder", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use bigcode/starcoder with Docker Model Runner:
docker model run hf.co/bigcode/starcoder
How to stop the prediction once the model is generated a sufficient solution for the asked prompt ?
knowing max_length is kept 300 , but answer is getting ended in 150 , so how to stop the model so that it dont give further prediction .
Any suggestion can help , since I aint sure whats the max length for different prompts , so setting it to a static , some time gives unwanted prediction after the actual prediction is already done.
+1
use this:
import time
import torch
from transformers import pipeline
start = time.time()
'''loading the local checkpoint here, device_map = "auto" decide where to put each layer,
either on the GPU or the CPU'''
pipe = pipeline("text-generation", model="/home/ec2-user/starCoderCheckpointLocal",
torch_dtype=torch.bfloat16, device_map= "auto",load_in_8bit=True)
text = input("Enter query >>")
prompt_template = "<|system|>\n<|end|>\n<|user|>\n{query}<|end|>\n<|assistant|>"
prompt = prompt_template.format(query=text)
outputs = pipe(prompt, max_new_tokens=512, stop_sequence='<|end|>', do_sample=True,
temperature=0.2, top_k=50, top_p= 0.95, eos_token_id= 49155)
# print(outputs)
# print( outputs[0]['generated_text'])
generated = outputs[0]['generated_text'].split('<|assistant|>')[-1]
print(generated)
end = seconds = time.time()
time = end - start
print("Time taken: ", str(int(time//60))+"minutes",str(round(time%60))+"seconds")
Hey @doraexp i got the value error.
ValueError: The following model_kwargs are not used by the model: ['stop_sequence'] (note: typos in the generate arguments will also show up in this list)
output = model.generate(
input_ids,
do_sample=True,
min_length=min_length,
max_length=max_length,
temperature=temperature,
early_stopping=True,
stop_sequence='<|end|>',
top_k=50,
top_p= 0.95,
eos_token_id= 49155,
)
I am using stracoder model , any further suggestion to resolve this. Or any alternative ? do suggest
thanks
Hi @MukeshSharma ,
Could you please provide me with the code snippet that you using and the checkpoint that you are trying to load, and the whole error would be really helpful too. :))
I am loading the same checkpoint
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder")
model = AutoModelForCausalLM.from_pretrained("bigcode/starcoder")
No etc changes
During output i am using
output = model.generate(
input_ids,
do_sample=True,
min_length=min_length,
max_length=max_length,
temperature=temperature,
early_stopping=True,
stop_sequence='<|end|>',
top_k=50,
top_p= 0.95,
eos_token_id= 49155,
)
So nothing etc. I am trying but still this error
i got the value error.
ValueError: The following model_kwargs are not used by the model: ['stop_sequence'] (note: typos in the generate arguments will also show up in this list)
Hi @MukeshSharma , sorry I got little busy with some-other stuff and couldn't reply before. Also, I am not sure why you are getting this error.
However, I am downloading the model on my local and then running it. Follow the below step and see if they works for u
Run this python pgrm:
tokenizer = AutoToklsenizer.from_pretrained("HuggingFaceH4/starchat-alpha")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/starchat-alpha")
#mention the directory where you want to save the checkpoint
tokenizer.save_pretrained("/home/ec2-user/starCoderCheckpointLocal")
model.save_pretrained("/home/ec2-user/starCoderCheckpointLocal")
#these commands check if the model is working offline using the local directory
tokenizer = AutoTokenizer.from_pretrained("/home/ec2-user/starCoderCheckpointLocal")
model = AutoModelForCausalLM.from_pretrained("/home/ec2-user/starCoderCheckpointLocal")
Now just run this:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
#checkpoint = "HuggingFaceH4/starchat-alpha"
checkpoint= "/home/ec2-user/starCoderCheckpointLocal"
device = "cuda" *# for GPU usage or "cpu" for CPU usage*
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# to save memory consider using fp16 or bf16 by specifying torch_dtype=torch.float16 for example
model = AutoModelForCausalLM.from_pretrained(checkpoint,torch_dtype=torch.float16).to(device)
inputs = tokenizer.encode("Create a typescript function that calculates factorial of a number.", return_tensors="pt").to(device)
outputs = model.generate(inputs,max_length=500)
print(tokenizer.decode(outputs[0]))
I hope this helps :)