StarCoder: may the source be with you!
Paper
•
2305.06161
•
Published
•
31
SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training:
This checkpoint is trained on both The-Stack-v2 and Falcon-refinedweb. Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.
This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the Starcoder Tokenizer.
from transformers import AutoModel, AutoTokenizer
# Specify the checkpoint
checkpoint = "SageLite/SageLite-l"
device = "cuda" # Use "cpu" if GPU is unavailable
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
# Example usage
code_snippet = "def print_hello_world():\tprint('Hello World!')"
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
embedding = model(inputs)[0] # Extract the embedding
| Model Name | # Params | Embd Dim | Python | Java | JS | TS | C# | C | Ruby | PhP | GO | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OpenAI-Code-01 | NA | 3072 | 21.92 | 8.90 | 4.90 | 5.70 | 3.15 | 11.58 | 26.25 | 16.60 | 9.40 | 12.04 |
| OpenAI-Text-3-Small | NA | 1536 | 25.18 | 12.61 | 8.00 | 9.44 | 5.46 | 15.86 | 30.70 | 23.33 | 11.20 | 15.57 |
| OpenAI-Text-3-Large | NA | 3072 | 40.57 | 25.33 | 20.09 | 22.00 | 11.84 | 31.90 | 42.54 | 41.84 | 21.75 | 28.65 |
| CodeSage-v2-Small | 130M | 1024 | 45.60 | 33.65 | 39.96 | 47.78 | 19.19 | 30.55 | 40.12 | 55.39 | 30.96 | 38.13 |
| CodeSage-v2-Base | 356M | 1024 | 55.86 | 42.89 | 45.29 | 54.58 | 23.90 | 38.52 | 56.02 | 64.56 | 42.88 | 47.17 |
| CodeSage-v2-Large | 1.3B | 2048 | 61.11 | 47.09 | 51.18 | 60.67 | 28.04 | 43.40 | 60.74 | 67.87 | 43.86 | 51.55 |
| SageLite-s | 80M | 768 | 47.93 | 30.83 | 35.15 | 37.64 | 18.14 | 30.53 | 42.89 | 50.70 | 21.69 | 35.06 |
| SageLite-l | 850M | 1536 | 64.46 | 45.53 | 50.80 | 54.71 | 30.66 | 47.46 | 61.01 | 68.68 | 39.25 | 51.40 |
| Model Name | # Params | CoSQA | AdvTest | Python | Java | JS | PhP | GO | Ruby | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| OpenAI-Code-01 | NA | 52.20 | 36.03 | 63.13 | 67.85 | 62.30 | 57.47 | 85.22 | 69.28 | 61.69 |
| OpenAI-Text-3-Small | NA | 52.48 | 34.10 | 62.62 | 65.87 | 60.28 | 54.85 | 81.96 | 67.57 | 59.97 |
| OpenAI-Text-3-Large | NA | 55.21 | 46.83 | 70.81 | 72.89 | 68.12 | 59.58 | 87.60 | 75.22 | 67.03 |
| CodeSage-v2-Small | 130M | 52.39 | 47.28 | 68.79 | 68.13 | 65.77 | 60.20 | 80.26 | 72.46 | 64.41 |
| CodeSage-v2-Base | 356M | 50.74 | 52.00 | 70.46 | 70.89 | 69.61 | 62.81 | 82.37 | 73.71 | 66.57 |
| CodeSage-v2-Large | 1.3B | 53.18 | 56.31 | 74.18 | 72.33 | 72.49 | 65.26 | 84.67 | 76.61 | 69.38 |
| SageLite-s | 80M | 56.49 | 42.32 | 67.59 | 66.62 | 62.32 | 58.87 | 79.36 | 70.75 | 63.04 |
| SageLite-l | 850M | 59.76 | 55.55 | 74.25 | 71.76 | 69.35 | 61.62 | 84.09 | 77.14 | 69.19 |
| Metric | SageLite-s | SageLite-l |
|---|---|---|
| ArguAna | 57.75 | 60.71 |
| CQADupstackWordpressRetrieval | 32.42 | 38.63 |
| FiQA2018 | 34.85 | 46.73 |
| NFCorpus | 29.97 | 33.70 |
| QuoraRetrieval | 85.35 | 87.50 |
| SCIDOCS | 18.99 | 21.38 |
| SciFact | 68.43 | 69.05 |
| Touche2020 | 24.41 | 21.43 |
| TRECCOVID | 70.88 | 76.08 |
| FEVER | 71.72 | 73.64 |
| HotpotQA | 58.81 | 62.96 |
| NQ | 48.26 | 54.48 |
| DBPedia | 34.83 | 40.69 |
| ClimateFEVER | 25.69 | 26.20 |
| MSMARCO | 35.01 | 36.55 |
| average | 46.49 | 49.98 |