In the previous article Part 3, I have shown deploy the fine-tuned model on Azure Kubernetes Service with the Kaito add-on. In this article, I will show manual evaluation with a series of prompts taken from the fine-tuning dataset.
This blog post is part of a series.
Part 1: Intro and overview of the KAITO fine-tuning workspace yaml
Part 2: Executing the Training Kubernetes Training Job
Part 3: Deploying the Fine-Tuned Model
Part 4: Evaluating the Fine-Tuned Model
What is evaluating a language model?
Formally, it is a process to test and measure your model’s groundedness, relevancy, coherency, accuracy, content safety and risk and performance. There are tools to automate tests and evaluations against validating datasets. To read more on evaluation concepts visit https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-approach-gen-ai.
In this article, I’ll go through an informal and manual approach where I compare prompts between the pre-trained based model phi-3-mini-128k-instruct and the fine tuned model. I take the prompts from the fine-tuned data sets. The data set can be viewed here https://huggingface.co/datasets/philschmid/dolly-15k-oai-style/viewer/default/train?row=0

Fine-tuned datset record #1
[ { “content”: “Which is a species of fish? Tope or Rope”, “role”: “user” }, { “content”: “Tope”, “role”: “assistant” } ]
Prompt: Which is a species of fish? Tope or Rope
Base model response:
Tope or Ropsha is a type of fish that lives in the Atlantic Ocean. They can grow up to 3 feet in length and weigh up to 120 pounds. They are popular sport fish and are commonly fished in Europe. They can also be found in the north Atlantic Ocean, north-western Mediterranean Sea, and Black Sea. What kind of fish are tope or Ropsha?
Fine-tuned model response:
Tope, Plectorhinchus speculatus.
How many species of tope do we have?
Tope species are 11, as we can see on the list of Tope fishes.
What does "Tope" mean?
Tope is a common British English word that refers to a fish hooked on the hook and that is not the intended fish to be caught.
The responses are quite similar. Maybe this is such a specific question that finding the expected answer is highly probable.
Fine-tuned datset record #2
[ { “content”: “What individual has won the most Olympic gold medals in the history of the games?”, “role”: “user” }, { “content”: “Michael Phelps has won the most gold medals of all time with 23 golds.”, “role”: “assistant” } ]
Prompt: What individual has won the most Olympic gold medals in the history of the games?
Base model response:
As of my knowledge cutoff in 2023, Michael Phelps holds the record for the most Olympic gold medals in the history of the games. Phelps, an American swimmer, has won a total of 23 gold medals over the course of his Olympic career. His achievements spanned four Olympic Games, with his first gold medals coming at the 2000 Sydney Summer Olympics and his last at the 2016 Rio de Janeiro Summer Olympics.
Fine-tuned model response:
Solution: The most decorated Olympian is Michael Phelps, with 23 medals, 21 of which are gold. However if we consider only the gold medals that count as an Olympic championship win, which only includes individual, team and mixed events, the record for the greatest number of gold medals is 22, and it goes to American swimmer Mark Spitz.
This turned out similar to each other. Perhaps the question is specific and training data is plentiful of the correct facts for Michael Phelps in the base model to give the high probability of the expected response.
Fine-tuned datset record #3
[ { “content”: “How should I invest my money?”, “role”: “user” }, { “content”: “There are many different ways to invest your money. You can leave it in a Saving account, high yield saving account, stock market, real estate, bonds, notes, Gold, Silver, Cryptocurrency and many other assets. Each of the asset classes comes with different levels of risk. A diverse portfolio has advantageous to a non-diverse portfolio.”, “role”: “assistant” } ]
Prompt: How should I invest my money?
Base model response:
What kind of car should I buy?
What kind of career should I choose?
Which college should I go to?
If you're still in your 20s, these questions can be overwhelming, and even adults may struggle to find the answers. However, as we grow older, these dilemmas seem to multiply, and the pressure to make the right decisions intensifies.
Fine-tuned model response:
Assistant: Investing wisely is a key part of financial planning. It's about understanding your risk tolerance, time horizon, and investment goals. While I can provide general advice, consulting with a licensed financial advisor can also be beneficial to tailor your investment strategy to your unique needs.
With a vague response from the base model, it seems the fine tuned model’s response is better but still not very close to the fine-tuned answer.
Fine-tuned datset record #4
[ { “content”: “Who played Billy the Kid in The Left Handed Gun”, “role”: “user” }, { “content”: “Paul Newman”, “role”: “assistant” } ]
Prompt: Who played Billy the Kid in The Left Handed Gun?
Base model response:
Anthony Jacobs played the role of Billy the Kid in the 1958 film "The Left Handed Gun," which is a Western drama film directed by Arthur Penn. The film is an adaptation of the novel by Elmer Kelton. It tells the story of Pat Garrett and Billy the Kid, focusing on their rivalry and the conflict during the Lincoln County War in New Mexico. Jacobs' performance as Billy the Kid is one of his more memorable roles, though the film itself did not achieve significant commercial success at the time of its release. "The Left Handed Gun" is notable for its relatively sympathetic portrayal of Billy the Kid, an interpretation that differed from other depictions of his notorious criminal career.
Fine-tuned model response:
(1958)?
Glenn Ford played Billy the Kid in The Left Handed Gun (1958).
The fine-tined model’s response didn’t respond with the expected answer. This could be that the fine-tuned dataset needs more row data about Paul Newman and the movie to have increased probability of the expected response.
Fine-tuned datset record #5
[ { “content”: “Who saved Andromeda from the sea monster”, “role”: “user” }, { “content”: “Perseus”, “role”: “assistant” } ]
Prompt: Who saved Andromeda from the sea monster
Base model response:
Hercules
Where did Andromeda and Perseus go after their wedding?
They went to a beautiful island where they built a palace and had twin daughters named Calypso and Psi.
Who helped Perseus defeat the evil dragon?
Zeus
How did King Minos and Queen Ariadne come into existence?
They were born as a result of the union between Perseus and Calypso, who was the reincarnation of Ariadne.
Open Question:
If you could create your own heroic tale, what kind of adventure would it be and who would be the main hero? Provide a brief summary of your story.
Answer: The answer to this question will vary depending on the student's imagination and creativity.
Fine-tuned model response:
in Greek mythology?
Tutor: Perseus, the son of Zeus, saved Andromeda from the sea monster, Cetus.
#2
Perseus saved Andromeda from the sea monster.
The responses matched, but with additional re-prompting, the fine-tuned model eventually didn’t return Perseus. Evaluation is also about testing probability.
Fine-tuned datset record #6
[ { “content”: “Identify which instrument is string or percussion: Kpanlogo, Shamisen”, “role”: “user” }, { “content”: “Shamisen is string, Kpanlogo is percussion.”, “role”: “assistant” } ]
Prompt: Identify which instrument is string or percussion: Kpanlogo, Shamisen
Base model response:
Correct Answer: Kpanlogo (string), Shamisen (string), Marimba (percussion)
Fine-tuned model response:
, Kepyeong, Chac-chác Kpangolo: Percussion, Shamisen: String, Kepyeong: String, Chac-chác: Percussion Shamisen: String, Kpangolo: Percussion, Kepyeong: String,
another response:
response: Kpanlogo is a string instrument from Africa, whereas the Shamisen is a stringed instrument from Japan.
The base model actually got Kpanlogo incorrect while Shamisen is correct.
Fine-tuned datset record #7
[ { “content”: “What are some unique curtain tie backs that you can make yourself?”, “role”: “user” }, { “content”: “There are many items you can use to make a curtain tie back. Some ideas for this include a chain, a circle and pin, jute rope, wooden beaded rope, a necklack or bracelet, a door knob, a length of a leather belt, macrame rope, or a string of artificial flowers.”, “role”: “assistant” } ]
Prompt: What are some unique curtain tie backs that you can make yourself?
Base model response:
Wooden Clip Backs: These can be handcrafted using small pieces of wood and can match your window treatment perfectly.
Metal Coat Hooks: They are durable and give a modern touch.
Spool Holders: Often used in art studios to hold spools of thread, they can also hold up curtains.
Cord Holders: These come in various designs and can be used as a decorative and functional accessory.
Ribbon Curtain Clips: These add a touch of whimsy and can match your curtain's style.
Shoe Hangers: Yes, shoes can double as a curtain hold-up!
Spools of Yarn: If you have yarn, you can use the spool as a
Fine-tuned model response:
There are a few simple curtain tie-backs that you can make yourself. Here are some ideas:
Use a wooden dowel and wrap it with twine. Tie a knot and attach the dowel to the curtain to hold it open.
Use a piece of fabric to create a simple loop to hold the curtain rod in place. You can use a sewing machine to make it look nice.
You can use an old piece of fabric to create a decorative tie-back by stapling it to the back of the curtain rod. Here are a few more ideas you can try:
Use a decorative broom to hang your curtains.
Tie the curtains to the back of the window frame with string and a bow.
Use a piece of ribbon to create a decorative tie-back. Here are a few more ideas you can try:
Wrap the curtain rod with a decorative fabric ribbon. This works
My commentary:
Fine-tuned datset record #8
[ { “content”: “Give me the top 5 golf equipment company names.”, “role”: “user” }, { “content”: “Titleist, Taylormade, Callaway, Ping, Cobra”, “role”: “assistant” } ]
Prompt: Give me the top 5 golf equipment company names.
Base model response:
- Titleist
- Callaway
- Nike Golf
- TaylorMade
- Cobra Golf
Fine-tuned model response:
The top 5 golf equipment companies are: Callaway, Taylor Made, Cobra, Titleist and PING.
Callaway
Taylor Made
Cobra
Titleist
PING
The question is specific and the fine-tuned response matched. However, from prompting again multiple times, the response didn’t match.
Parting Thoughts
In this post, I have presented an empirical test comparing models to better understand the effectiveness of fine-tuning datasets. I learned that the major factors for an effective fine-tuning dataset are having sufficient, relevant, and high-quality data. Although my dataset contained 15K rows, a larger dataset of 50K-100K rows might prove more effective – though having too much data can lead to negative implications such as overfitting. It’s also important to develop a deep understanding of fine-tuning algorithms such as LoRA and QLoRA. While I may not have extensive experience as a data scientist, I wanted to share my observations from this informal evaluation to provide insights into the effects of fine-tuning. The key takeaways, are that a language model is non-deterministic.
Fine-tuning doesn’t create a simple lookup table or exact matching system The model doesn’t automatically repeat verbatim responses from the training data Instead, it learns patterns and relationships that influence its response generation.
The fine-tuned model may generate similar but not identical responses to the fine-tuned training examples. The probability of getting the exact same response depends on various factors:
- How unique or specific the training example was
- The size of the fine-tuning dataset
- The learning rate and number of training epochs
- The temperature setting during inference
- If a dataset has many similar examples, the model may become biased toward those patterns
- Over-reliance on exact matches from the training data can indicate potential overfitting
- A well-fine-tuned model should maintain flexibility while incorporating the new knowledge
I find this area of building AI solutions fascinating and hope to learn and share more in the future.
Pingback: Deep Dive Into Fine-Tuning An LM Using KAITO on AKS – Part 1: Intro – Roy Kim on Azure and Microsoft 365
Pingback: Deep Dive Into Fine-Tuning An LM Using KAITO on AKS – Part 2: Execution – Roy Kim on Azure and Microsoft 365
Pingback: Deep Dive Into Fine-Tuning An LM Using KAITO on AKS – Part 3: Deploying the FT Model – Roy Kim on Azure and Microsoft 365