wikimedia/wikipedia
Viewer • Updated • 61.6M • 266k • 1.22k
Distilled with Distily library using teacher model HuggingFaceTB/SmolLM-135M on dataset wikimedia/wikipedia.
LlamaForCausalLMLlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(49152, 576)
(layers): ModuleList(
(0-14): 15 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=576, out_features=576, bias=False)
(k_proj): Linear(in_features=576, out_features=192, bias=False)
(v_proj): Linear(in_features=576, out_features=192, bias=False)
(o_proj): Linear(in_features=576, out_features=576, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LigerSwiGLUMLP(
(gate_proj): Linear(in_features=576, out_features=1536, bias=False)
(up_proj): Linear(in_features=576, out_features=1536, bias=False)
(down_proj): Linear(in_features=1536, out_features=576, bias=False)
)
(input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
(post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
)
)
(norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
(rotary_emb): LlamaRotaryEmbedding()
)
(lm_head): Linear(in_features=576, out_features=49152, bias=False)
)
LlamaForCausalLM -> LlamaForCausalLM--- teacher model modules
+++ student model modules
@@ -2,7 +2,7 @@
(model): LlamaModel(
(embed_tokens): Embedding(49152, 576)
(layers): ModuleList(
- (0-29): 30 x LlamaDecoderLayer(
+ (0-14): 15 x LlamaDecoderLayer(
(self_attn): LlamaSdpaAttention(
(q_proj): Linear(in_features=576, out_features=576, bias=False)
(k_proj): Linear(in_features=576, out_features=192, bias=False)
Trained on 553,266,374 tokens from the wikimedia/wikipedia dataset.
998,00020231101.entrainDistillationObjective(
logits_loss_component=LossComponent(
weight=1,
loss_fn='kl'
),
hs_loss_component=LossComponent(
weight=0
),
attn_loss_component=LossComponent(
weight=0
)
)
The following hyperparameters were used during training:
0.00024242Adam with betas=(0.9,0.999) and epsilon=1e-08polynomial1.0DistillationObjective( logits_loss_component=LossComponent( weight=1, loss_fn='kl' ), hs_loss_component=LossComponent( weight=0 ), attn_loss_component=LossComponent( weight=0 ) )<torch.optim.lr_scheduler.LambdaLR object at 0x76ca190e3fd0>NoneNone{'num_hidden_layers': 15}None[('lm_head', False)]FalseTrueHuggingFaceTB/SmolLM-135MFalseFalsewikimedia/wikipedia20231101.entraintext100000010240.002False42False10.01.00.00TrueBase model
HuggingFaceTB/SmolLM-135M