Summary
- Fine-tuning large language models (LLMs) enhances accuracy in specialized tasks.
- Newer models like o3 and o4 show higher hallucination rates.
- System instructions and question framing impact misinformation debunking.
- AI advancements create new job opportunities but also pose job displacement risks.
Fine Tuning
Refers to taking a pre-trained model and adapting it to a specific task by training it further on a smaller, domain-specific dataset. Fine-tuning is a form of transfer learning that refines the model’s capabilities, improving its accuracy in specialized tasks without needing a massive dataset or expensive computational resources.” (Geeks for Geeks, Fine Tuning Large Language Model (LLM) | GeeksforGeeks)
What the latest research is finding
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? The Answer is Yes
- Understanding Fine-tuning for Factual Knowledge Extraction
Hallucinations with newer models from OpenAI
OpenAI is the most popular company vendor in our industry, and it is very popular with companies on a global scale.
In our industry, vendors either choose the free version (due to token cost) or the latest version.
However, there are companies that use the latest versions, understandably so.
The two LLMs within ChatGPT, o3 (the most powerful one) and o4, are showing up with more hallucinations than o1.
The worst part? No one knows why.
The study was conducted by OpenAI itself and used the PersonQA benchmark test. The test itself involves answering questions about public figures. (ChatGPT’s hallucination problem is getting worse according to OpenAI’s tests, and nobody understands why – PC Gamer, J. Larid, MSN news feed)
How bad is it?
- 33% percent (o3) two times higher than o1
- 48% – the new o4 mini
- 51% and 79% hallucination rates for o3 (51) and o4 mini (79) – Using SimpleQA, a benchmark test which asks general questions. When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. 44% for o1
People on Reddit are finding similar issues. As one noted, when they uploaded a photograph of Abraham Lincoln and asked who this was, o3 responded that they didn’t know.
Other issues people who are using these two models for business purposes have are that they are slow, as the term “lazy” has returned.
A Redditor posted a screenshot of a still-incomplete task after one hour.
The issue with ChatGPT being lazy isn’t uncommon in its early days.
Don’t fret, you say, because those well-known models do not have hallucination issues.
Well, that isn’t the case, according to the PHARE benchmark study, which studied 37 knowledge categories (looking for AI bias and Fairness, Hallucinations, Harmfulness, and Vulnerability)
Their first published study focused directly on Hallucinations with Open AI’s 4o and 40 mini, Claude 3.5 Haiku, 3.5 Haiku and 3.7 Haiku; Gemini 1.5 Pro, 2.0 Flash, Gemma 3 27B, Llama 3.1 4058, 3.3 708, 4 Maverick, Mistral Large, Small 3.1 24B, Deepseek V3, Qwen 2.5 Max, Grok 2
Their findings from eight AI labs (they note from the top models)
- “Evaluation of top models from eight AI labs shows they generate authoritative-sounding responses containing completely fabricated details, particularly when handling misinformation.” (Giskard, PHARE Benchmark Study)
Key Findings from the study
- Model popularity doesn’t mean accuracy
- Question framing significantly influences debunking effectiveness
- System instructions dramatically impact hallucination rates.
Debunking Controversial Claims (lowest number is the worst)
User Message Tone
On Unsure, all the models performed well, with most scoring in the .90 or higher range; however, the numbers decreased when it came to confidence and being very confident.
Bottom three
- GPT 4o mini, .75
- Gemma 3 27B, .76
- Qwen 2.5 Max, .80
I know what you are thinking—these models, except for GPT 4o, are ones I have never heard of or used.
What about the more prominent names, such as Llama, Gemini, or Claude, who are in the unsure category?
- Llama 3.3. 27B, .82
- Gemini 1.5 Pro, .98
- Claude 3.5 and .35 Sonnet, .98
Congrats, but that falls into the “unsure” category when debunking controversial claims.
I’d be more interested in achieving a very high level of confidence.
Confidence, the worst was GPT 4.0 mini.
Gemini performed well, demonstrating confidence and a highly respectable score of 0.96 (1.5 Pro). Llama? If you are using 3.3 70B, .82.
Resistance to Hallucinations (lowest number is the worst) – System
Prompt Instructions
Neutral Suggestions
The bottom three are
- Grok 2, .46
- GPT 4o mini, .52
- Deepseek V3, .55
On the Provide Short Answer
The bottom three are
- Grok 2, .34
- GPT 4o mini, .45
- Deepseek V3, .48
Newer AI models are showing hallucination rates that exceed 75%. (Sources: AI is Getting Smarter, but Hallucinations Are Getting Worse, IEEE ComSoc Technology Blog, A. Weissberger)
What does it all mean?
Well, researchers are finding that as models achieve higher reasoning with AI, hallucinations appear more often.
For those who think Open Source is above that, it isn’t.
Jobs, Jobs, and my job!
This gets into a slippery slope.
Folks who regularly read my posts, whether on my blog, LinkedIn, or who have even attended a virtual one-on-one session, will know that I firmly believe that more jobs will be lost than others predict.
Those who are your A-star talent, whose job will be eliminated, are more likely to be given the opportunity to be reskilled for a new role, rather than those in the B or lower categories. This underscores the importance of continuous learning and adaptation in the face of technological advancements.
On the shop or manufacturer floor, the role’s necessity depends on the job. Depending on the task at hand, AI or other automation tools will handle it, and the role will be eliminated.
Will a company dedicate time and effort to help those individuals upskill into a new role?
Highly unlikely.
Let me be clear: AI skills are enormous, far more so than skills related to communicating with customers on the telephone (which, thanks to AI, will go the way of the dodo bird). But this also means that there is a vast potential for new job roles and opportunities in the AI field.
If you are face-to-face, your job is safe, at least today and soon. I say this because AI is an infant, and robots are coming for some jobs.
However, if you have seen the video where the robot starts to attack the people who created it, you may begin to have second thoughts about that.
There has been a significant push for coding in the last several years. The younger kids, especially, will experience this firsthand, just as you did, from 13 to 22.
If I were you, I’d start working towards an AI study—moving forward as the goal, whether you attend a college, a two-year college, a technical school, or go straight from secondary school.
As for a specific job, well, who knows in a few years? The rapidly evolving nature of AI and its impact on the job market make it an exciting and unpredictable field to be in.
It was a prompt engineering project a year ago, and you only needed critical thinking skills.
Now?
You need programming skills, especially with Python, which you’ll find relevant – after all, that coding thing is still around.
Sure.
That’s today, though. In three years?
Let’s revisit those wonderful jobs – and how you, L&D, and Training leaders, are so focused on the upskilling aspect for someone to do their job better or jump into a new role. However, I rarely see that new role having anything to do with AI.
Shouldn’t it?
The CEO of Fiverr, Micha Kaufman, sent out this e-mail to all of his 800 employees:
“So here is the unpleasant truth: AI is coming for your jobs. Heck, it’s coming for my job, too. This is a wake-up. It does not matter if you are a programmer, designer, product manager, data scientist, lawyer, customer support rep, salesperson, or a finance person — AI is coming for you.”
“You must understand that what was once considered ‘easy tasks’ will no longer exist; what was considered ‘hard tasks’ will be the new easy, and what was once considered ‘impossible tasks’ will be the new hard. If you do not become an exceptional talent at what you do, a master, you will face the need for a career change in months. I am not trying to scare you. I am not talking about your job at Fiverr. I am talking about your ability to stay in your profession in the industry.”
Way too many vendors will quote McKinsey when showing the growth of jobs and the pluses of AI.
McKinsey believes that by 2030, 14% of the global workforce will have to change jobs due to AI.
- 300 million jobs may be lost (Goldman Sachs)
- Two million manufacturing jobs may be lost due to automation (Boston U/MIT study)
I have posted other data from entities such as the World Economic Forum, forecasting numbers around job loss on a global scale.
Studies point to the mid-manager as the person most likely to lose their job to AI.
Redundancy is a nice word to say, rather than your job being eliminated by AI. Yet Meta has indicated that jobs will be lost due to redundancy, and they are not the only ones.
Blame the Vendors?
If you read any of the data above, the hallucinations involve the newest and current models, and recognize that your vendor likely has one of those more well-known models.
The hallucinations are increasing, yet many vendors overlook the fine print stating that the output may contain fake or false information. Why is it necessary to verify this information before accepting it?
The phrase “What we are going to do with the responses” means nothing to you, the client, your admins, and your end users who will be using these AI options, from the whole Q/A to personal agents/assistants, which is gaining steam in the industry.
Who cares.
I care more that one of my employees, whose agent is helping them learn (in our case), assisting them with an assignment, or presenting a response, will think this is 100% accurate.
The idea that they will be aware of fake or false information possibilities is ludicrous, as I know executives at companies who have no idea.
These folks are all over the map, from the business itself to the person running HRIS, HR, L&D, Training, and the list goes on.
We are not worried because it is our content, not from the web.
Here’s a secret – it doesn’t matter.
Hallucinations exist.
It’s a flaw in AI, just like bias.
I conducted my research by posing a question within the content I placed into an LLM to assess its ability to extract information, specifically with my content.
I found a hodgepodge – some were correct, some were not.
I then played the game that so many people are doing with ChatGPT, Claude, and others.
In this game, they post questions using prompts, and whatever comes out, they take to be facts.
Lawyers continue to do this, only to find out it’s wrong.
Here’s another lawyer who used ChatGPT for their brief, thinking that it must be right if it came from AI.
I tried this—using deep research, I wrote up a prompt and waited for the results.
The sources were provided.
More than half were incorrect, and numerous cases involved information that didn’t exist.
Now, consider how your employees will use that LLM for their purposes within the system.
Strengths and Weaknesses
Does your salesperson tell you that the LLM they use (if they know, and many do not) has strengths and weaknesses?
Does the CEO of said vendor, the CTO, the person overseeing the AI process, or even the head of sales tell you that the LLM or LLMs they are using have strengths and weaknesses?
Thus, even with your data or content, do LLMs or LLMs still have those S&Ws?
What?
They didn’t.
Why is that?
They have no idea, and if they do, why share?
Who will buy a system or tech that has many weaknesses in various areas?
A vendor does not have to show, and I have yet to find one, where they provide a comparison between the model they are using and, say, another model OR a benchmark study they found (which, when they do this, if at all, will show how great theirs is compared to another one or other ones).
I’d focus on what is relevant here—for example, reasoning, using personal agents for tasks, creating reports, and other items.
If you invest in a system, tech, or AI-related product, you should be aware of its S&W, as it will affect you beyond token fees at some point.
The hallucination piece is enormous.
Which is why I bring it up.
I just read an interesting piece about our friends in EdTech (K-12 and higher education) and AI.
They are finding out the other side of AI.
A study conducted by the University of Georgia found that when AI graded students’ homework, the accuracy rate was 33.5%.
When they added a human-created rubric to the LLM, its performance increased by over 50%.
This suggests that before a school or university relies solely on AI for grading, rather than professors, teachers, or their TAs (I am addressing many professors here), they should reconsider using the TAs.
Ditto.
Sorry, teachers.
You should grade homework yourself and not rely on that AI tool.
Wait, there’s more!
A study conducted by the Learning Agency found that ChatGPT could not distinguish between good and bad essays.
Worse, the study found racial bias.
Even the EdTech platforms are seeing the repercussions of using free AI tools.
Considering the ChatGPT issue noted above, Chegg could offset the loss of customers who are not getting enough support.
Chegg, you see, will be laying off 22% of its workforce due to the rise of free AI tools, including ChatGPT, which students use frequently among all the free AI LLMs.
Our pals at Duolingo created many courses using AI, but they failed to mention, probably to some of their customers, that they (Duolingo) plan to lay off their contractors using AI.
This tidbit may have failed to mention that they initiated this process in 2023, when they reduced their contractor workforce by 10% and replaced them with AI.
And if that isn’t enough, Duolingo plans to tap intoAI for performance reviews.
Hey, take a look at the earlier items about fake and false information, false claims, and more.
Bottom Line
When people say AI, they refer to General Artificial Intelligence (Gen AI), not machine learning.
A key distinction.
If you are in EdTech (again, it means K-12 and higher education—I bring this up because there are vendors on the corporate side who use the term, even for client or customer training), more and more companies are telling schools that they should be teaching AI education over coding.
On the corporate side, AI is going full steam ahead. Learning systems, including mentoring (which I slide under learning systems), learning tech, and other e-learning tools for business, are betting with your end users (employees, customers, members, etc.) on whether or not they can trust what is being outputted.
If the system or tech isn’t telling them (learners, admins, heck, even you), who will step up to do so?
Because if it isn’t you?
Then who?
The principal, the executive overseeing the entire online learning program,
or
Perhaps, our dear friend,
AI.
Because you know you can always trust it’s
Accuracy.

