I am using AutoTrain to finetune my Llama model with my custom data and the model give random responses ignoring my dataset. The thing is that on my dataset I have 145 rows in JSONL and when I start the fine-tuning with this dataset and I analyze logs I can see these rows:
So the dataset is recognized with 145 rows so from here I can understand that my dataset is well-structured and every row is a valid JSON object.
But right after the model shards are uploaded, it gives me this log:
So my question is: Why does it log Generating train split 0 examples and Generating train split 9 examples right below?
Is this a normal behaviour of AutoTrain?
Or thereâs something that I have to adjust on my training dataset?
After the model is finetuned, obviously I can see it on my HuggingFace hub and I can also see the training statistics on TensorBoard but I see only a dot on the graphs and the training loss about 5.4, so yeah, everytime I try to ask him something about my dataset or anything else, he answers me randomly.
What can I do in order to finetune a model in the right way? Maybe I just have to expand my dataset because 145 rows are not enough and those logs are just normal?
And it gives me error KeyError {âtextâ: âtextâ} is invalid. (even if Iâm using SFT)
So now looking at the discussion they are talking about disabling the parameter packing but the thing is that even if I enable full parameter mode there is no packing parameter, anyway Iâm using basic parameter mode because otherwise I donât know what to tweak.
Maybe do I have to write manually parameters activating JSON parameters first and doing so I can write like packing=false and try with other parameters?
Or maybe itâs just my dataset too small and I have to expand it?
Ok it was predictable that the dataset was too small for a real fine-tuning actually, Iâll create a bigger one and Iâll try launch a finetuning and weâll see if I will have the same problem, but I donât think so .
Last question, what do you think the minimal amount of examples a dataset should have in order to make a really good and successful fine-tuning?
Ah I forgot to say, maybe the issue could be that AutoTrain GUI doesnât permit to set a value to a packing parameter because behind itâs a default set and it canât be handled, so if someone wants to train their own model, the dataset has to be large
Hmm, I think you should ask someone who knows more about LLM fine-tuning than I do, but what I sometimes hear is that â500 to 1000 samples are sufficient for LoRAâ, âdata diversity is more important than quantityâ, etc.
There are people who know more about AI than I do who say things like, âAsk AI about AI.â Commercial AI systems like Gemini and ChatGPT have been trained on a lot of AI-related information, so when you ask them about AI itself, they often provide fairly reliable answers. Since they have a solid foundation of knowledge, even just enabling search can help you gather reasonably up-to-date information.