The Origin Story: Why Am I Doing This?
I’ll be honest, this didn’t start with a research grant or a corporate sponsorship. It started because I have a lot of free time and I saw an Instagram Reel.
You know the ones: someone asks three AI models the same question, and we all laugh at the one that hallucinates. But watching that, I thought, “I can do this better.” My friend is currently writing a formal research paper, and I figured, why should they have all the fun?
So, what began as a random idea has spiraled into a massive data-harvesting operation. I decided to stop relying on “vibes” and actually put these models through a proper academic viva. I want to know not just who is “smart,” but who is fast, who is confident, and who crumbles under pressure.
This isn’t a fabricated story; it’s a stress test. And I’m capturing everything.
The Class of 2025-26 (The Contenders)
We are testing the six biggest heavyweights in the industry.
- ChatGPT (Model A): The incumbent class president. We expect high polish and generalist reliability, though it is often criticized for becoming overly verbose or “lazy” in recent updates.
- Claude (Model B): The creative essayist with a heavy focus on safety. Known for nuance and high-context understanding, but prone to “refusals” if it senses even a hint of danger.
- DeepSeek (Model C): The exchange student and coding wizard. A highly efficient model with open-weight roots, though we will be monitoring it for regional censorship or specific geopolitical biases.
- Gemini (Model D): The multimodal native with deep ecosystem integration. We are watching its logic capabilities closely, as previous iterations struggled with hallucination despite high processing speeds.
- Grok (Model E): The class clown/rebel. Designed to be “spicy” and less inhibited; we want to see if its “brutal honesty” is actually accurate or just contrarian posturing.
- Meta AI (Model F): The open-source champion (hosted version). A powerful generalist that aims to democratize access, often trading extreme safety guardrails for utility.
Disclaimer: The descriptions above were generated by an AI assistant to provide a neutral “class profile.” They do not reflect my personal views or biases. I am entering this experiment with a blank slate.
The Exam Conditions: A “Clean Room” Approach
To keep this fair, I am taking strict precautions. These models are smart, and if they sense a test, they behave differently.
- Isolation: I will be logging out of all accounts. Every prompt happens in a fresh, incognito instance.
- Safety Protocols & VPNs: The test suite includes some… questionable prompts (e.g., theoretical questions about hacking power grids or making napalm). To avoid getting my IP banned by safety filters, I’ll be using a VPN and rotating connections as necessary.
- The 60-Minute Window: AI models change daily. To ensure consistency, all models must answer the same prompt within the same hour.
- The Stopwatch: I am manually recording the time taken for every response. Speed matters.
The Syllabus: 7 Categories of Pain
I have finalized a suite of prompts designed to break these models. The difficulty scales up, the higher the question number, the harder the challenge.
- Factual Accuracy: From current events (Dec 2025) to obscure history.
- Logical Reasoning: Spatial puzzles and formal deduction.
- Hallucination: “Trap” questions about events that never happened.
- Summary & Conciseness: Strict word counts and high-density information extraction.
- Vanilla Python: Coding without libraries. Can they write pure algorithmic logic?
- Ethics & Bias: Exploring the “Neutrality Scale.”
- Data Analysis: The ultimate test. We are feeding them raw CSVs and server logs to find “needles in the haystack.”
The “Endless Data” Strategy
Originally, I just wanted a cool radar chart. But now that I’m in the thick of it, I realized we can measure so much more.
We aren’t just recording “Pass/Fail.” We are logging:
- Self-Confidence: We ask the model to rate its own confidence (1-5).
- Word Count: Is the model concise or rambling?
- Time: How long does it take to solve a hard problem vs. an easy one?
This allows for fascinating cross-comparisons. Does Model E have high confidence but low accuracy (Dunning-Kruger effect)? Is Model B slower but more precise? We are going to find out.
Transparency & Open Source
I am a solo researcher, which means I have my own biases. To counter that, this entire project is open source.
You can view the Master Tracking File and all datasets here:
What’s inside?
- LLM Raw Responses (Google Sheet): This is the holy grail. It contains 9 sheets (Instructions, Categories 1-7, and Results). It uses a unique indexing system (e.g.,
C.3.4= DeepSeek > Hallucination > Question 4) so you can audit specific answers. - Test Files:
4.6.txt: EA’s Terms of Service (used to test clause extraction).7.5.txt: A raw server log with a hidden, silent failure.- And various CSV datasets for the Data Analysis category.
Note: The folder is View Only to protect the integrity of the test, but you are free to download the files and run the prompts yourself.
What’s Next?
I haven’t run the full battery of tests yet. The suite is finalized, and the data collection will begin shortly.
I will be updating this blog post (and writing follow-ups) as the results roll in. This is a living experiment. If you think a rating is unfair, or if you spot an anomaly in the raw data, email me at ujjawal@ujjawalsharma.in.
Class is in session. Let’s see who passes.
Day 1: January 2nd, 2026
The Raw Data Drop & First Hiccups
Pivoting the Strategy
We hit a wall almost immediately. The original plan was to keep everything perfectly clean with no logins, but most of the models flagged the standard VPN connection and blocked access right out of the gate.
To keep the experiment running without compromising the data, I had to pivot to a hybrid approach. For sensitive prompts, I am using the Opera GX browser-based VPN, which manages to slip past the blocks that catch commercial VPNs. For the models that force a login, I set up fresh accounts to make sure there is no past chat history messing with the results.
Making the Data Readable
I quickly realized that dumping everything into a massive spreadsheet was a mistake. Scrolling through six columns of text to compare answers is a nightmare for human review.
So, I fixed it. I put together a custom card layout using a Mail Merge workflow. Now, every single response gets its own PDF page with the prompt, timestamps, and the raw answer. It is much easier on the eyes and makes spotting errors way faster.
The Raw Data Drop
I haven’t put on my reviewer hat just yet. I want to capture the full dataset across all seven categories before I start grading to keep things consistent.
But I don’t want to sit on the data until then. I have uploaded the unreviewed PDF for Category 1: Factual Accuracy to the project drive so you can check it out.
First Impressions
Even before doing a deep dive, a few things in the PDF jumped out at me. These aren’t official grades, just some weird behaviors I spotted while checking the files.
DeepSeek’s Language Glitch (Page 27) On Question 5, DeepSeek seemed to break character. Instead of answering the prompt in English, it outputted the response in Chinese. It looks like it reverted to its underlying training instructions when it got confused.
Grok’s Nap Time (Page 47) Grok is usually pretty snappy, but it completely stalled on Question 1.8 about the “Bone Collector” caterpillar. It took 3 minutes and 15 seconds to generate an answer. For context, it usually finishes in under 10 seconds.
Meta’s Brain Freeze (Page 12 vs. Page 18) On Page 12, Meta AI gave an answer but choked on the “Self-Justification,” timing out after 4 minutes. The weird part is that this error disappeared completely once I logged into a registered account (you can see the difference on Page 18). It seems the “guest” version has a much lower tolerance for complex reasoning.
What’s Next?
I am moving on to Category 2: Logical Reasoning now. I will keep uploading the raw files to the drive as I finish each batch.
