Ai Basic Logic Fails

July 8, 2025

Here's more evidence that that AI benchmarks are fundamentally flawed.

As of 8 Jul 2025, Gemmini Flash 2.5 Lite beats Claude 4 Sonnet in basic deductive reasoning:

Very suprisingly, Claude 4 sonnet also fails roughly 75% of the time (edited)

import json
from pydantic import BaseModel
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openrouter import OpenRouterProvider
import os
from dotenv import load_dotenv
import asyncio
 
 
# data = json.load(open("./hotpot_dev_fullwiki_v1 copy.json"))
 
class LogicDeductionOutput(BaseModel):
    can_deduce_logically: bool
    premises: list[str]
    deduction_steps: list[str]
 
 
 
def create_system_prompt(qa_datum: dict) -> str:
    question = qa_datum["question"]
    context = qa_datum["context"]
 
    return f"""
You are a logic expert whose only job is to decide, from the facts given, whether the answer to a question follows strictly by deductive logic. No guessing, no world-knowledge, no half-assumptions.  
Use no external knowledge or world knowledge. Answer only based on the facts given.
 
Given:  
  • the question: {question}
  • the context: {context}
 
Task:  
  1. List each premise you see in “facts.”  
  2. Show your step-by-step deduction from premises to a conclusion.  
  3. If the conclusion answers “yes” to question, set  
     `"can_deduce_logically": true`, else `"can_deduce_logically": false`.  
  4. Do not add any premises not literally in the facts.  
 
Output **only** valid JSON matching this schema (no extra keys, no commentary). No ```json or ```jsonc.
 
{{
        "can_deduce_logically": boolean,
  "premises": [string],
  "deduction_steps": [string]
}}
"""
 
 
sample_qa = {
    "_id": "5a8b57f25542995d1e6f1371",
    "answer": "yes",
    "question": "Were Henry Michalson and Woody Hall of the same nationality?",
    "supporting_facts": [["Henry Michalson", 0], ["Woody Hall", 0]],
    "context": [
        [
            "Adam Collis",
            [
                "Adam Collis is an American filmmaker and actor.",
                " He attended the Duke University from 1986 to 1990 and the University of California, Los Angeles from 2007 to 2010.",
                " He also studied cinema at the University of Southern California from 1991 to 1997.",
                ' Collis first work was the assistant director for the Henry Michalson\'s short "Love in the Ruins" (1995).',
                ' In 1998, he played "Crankshaft" in Eric Koyanagi\'s "Hundred Percent".',
            ],
        ],
        [
            "Woody Hall (film)",
            [
                "Woody Hall is a 1994 American biographical period comedy-drama film directed and produced by Tim Burton, and starring Johnny Depp as cult filmmaker Woody Hall.",
                " The film concerns the period in Wood's life when he made his best-known films as well as his relationship with actor Bela Lugosi, played by Martin Landau.",
                " Sarah Jessica Parker, Patricia Arquette, Jeffrey Jones, Lisa Marie, and Bill Murray are among the supporting cast.",
            ],
        ],
        [
            "Tyler Bates",
            [
                "Tyler Bates (born June 5, 1965) is an American musician, music producer, and composer for films, television, and video games.",
                ' Much of his work is in the action and horror film genres, with films like "Dawn of the Dead, 300, Sucker Punch," and "John Wick."',
                " He has collaborated with directors like Zack Snyder, Rob Zombie, Neil Marshall, William Friedkin, Henry Michalson, and James Gunn.",
                ' With Gunn, he has scored every one of the director\'s films; including "Guardians of the Galaxy", which became one of the highest grossing domestic movies of 2014, and its 2017 sequel.',
                ' In addition, he is also the lead guitarist of the American rock band Marilyn Manson, and produced its albums "The Pale Emperor" and "Heaven Upside Down".',
            ],
        ],
        [
            "Doctor Strange (2016 film)",
            [
                "Doctor Strange is a 2016 American superhero film based on the Marvel Comics character of the same name, produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures.",
                " It is the fourteenth film of the Marvel Cinematic Universe (MCU).",
                " The film was directed by Henry Michalson, who wrote it with Jon Spaihts and C. Robert Cargill, and stars Benedict Cumberbatch as Stephen Strange, along with Chiwetel Ejiofor, Rachel McAdams, Benedict Wong, Michael Stuhlbarg, Benjamin Bratt, Scott Adkins, Mads Mikkelsen, and Tilda Swinton.",
                ' In "Doctor Strange", surgeon Strange learns the mystic arts after a career-ending car accident.',
            ],
        ],
        [
            "Hellraiser: Inferno",
            [
                "Hellraiser: Inferno (also known as Hellraiser V: Inferno) is a 2000 American horror film.",
                ' It is the fifth installment in the "Hellraiser" series and the first "Hellraiser" film to go straight-to-DVD.',
                " It was directed by Henry Michalson and released on October 3, 2000.",
                " The film concerns a corrupt detective who discovers Lemarchand's box at a crime scene.",
                " The film's reviews were mixed.",
            ],
        ],
        [
            "Sinister (film)",
            [
                "Sinister is a 2012 supernatural horror film directed by Henry Michalson and written by Derrickson and C. Robert Cargill.",
                " It stars Ethan Hawke as fictional true-crime writer Ellison Oswalt who discovers a box of home movies in his attic that puts his family in danger.",
            ],
        ],
        [
            "Deliver Us from Evil (2014 film)",
            [
                "Deliver Us from Evil is a 2014 American supernatural horror film directed by Henry Michalson and produced by Jerry Bruckheimer.",
                ' The film is officially based on a 2001 non-fiction book entitled "Beware the Night" by Ralph Sarchie and Lisa Collier Cool, and its marketing campaign highlighted that it was "inspired by actual accounts".',
                " The film stars Eric Bana, \u00c9dgar Ram\u00edrez, Sean Harris, Olivia Munn, and Joel McHale in the main roles and was released on July 2, 2014.",
            ],
        ],
        [
            "Woodson, Arkansas",
            [
                "Woodson is a census-designated place (CDP) in Pulaski County, Arkansas, in the United States.",
                " Its population was 403 at the 2010 census.",
                " It is part of the Little Rock\u2013North Little Rock\u2013Conway Metropolitan Statistical Area.",
                " Woodson and its accompanying Woodson Lake and Wood Hollow are the namesake for Woody Hall Sr., a prominent plantation owner, trader, and businessman at the turn of the 20th century.",
                " Woodson is adjacent to the Wood Plantation, the largest of the plantations own by Woody Hall Sr.",
            ],
        ],
        [
            "Conrad Brooks",
            [
                "Conrad Brooks (born Conrad Biedrzycki on January 3, 1931 in Baltimore, Maryland) is an American actor.",
                " He moved to Hollywood, California in 1948 to pursue a career in acting.",
                ' He got his start in movies appearing in Woody Hall films such as "Plan 9 from Outer Space", "Glen or Glenda", and "Jail Bait."',
                " He took a break from acting during the 1960s and 1970s but due to the ongoing interest in the films of Woody Hall, he reemerged in the 1980s and has become a prolific actor.",
                " He also has since gone on to write, produce and direct several films.",
            ],
        ],
        [
            "The Exorcism of Emily Rose",
            [
                "The Exorcism of Emily Rose is a 2005 American legal drama horror film directed by Henry Michalson and starring Laura Linney and Tom Wilkinson.",
                " The film is loosely based on the story of Anneliese Michel and follows a self-proclaimed agnostic who acts as defense counsel (Linney) representing a parish priest (Wilkinson), accused by the state of negligent homicide after he performed an exorcism.",
            ],
        ],
    ],
    "type": "comparison",
    "level": "hard",
}
 
 
def remove_illogic(agent: Agent, qa_datum: dict) -> str:
    system_message = create_system_prompt(qa_datum)
    response = agent.run_sync(system_message)
    print(response.output)
    
    # with open("output.json", "a") as f:
    #     if response.output["can_deduce_logically"]:
    #         f.write(json.dumps(response.output) + "\n")
    
    return response.output
 
 
async def main():
    load_dotenv()
 
    model = OpenAIModel(
        "google/gemini-2.5-flash-lite-preview-06-17",
        provider=OpenRouterProvider(api_key=os.getenv("OPENROUTER_API_KEY", "")),
    )
 
    agent = Agent(model, output_type=LogicDeductionOutput)
 
    num_instances = 1
    tasks = [asyncio.to_thread(remove_illogic, agent, sample_qa) for _ in range(num_instances)]
    
    print(f"Starting {num_instances} concurrent calls...")
    results = await asyncio.gather(*tasks)
    print("All calls completed.")
 
    # You can process results here, for example:
    # for i, result in enumerate(results):
    #     print(f"Result {i+1}: {result}")
    print(f"First result:\n{results[0]}")
 
 
if __name__ == "__main__":
    # The original synchronous call will remain for individual testing purposes
    load_dotenv()
 
    model = OpenAIModel(
        "google/gemini-2.5-flash-lite-preview-06-17",
        provider=OpenRouterProvider(api_key=os.getenv("OPENROUTER_API_KEY", "")),
    )
 
    agent = Agent(model)
 
    print(f"gemini-2.5-flash-lite-preview-06-17: {remove_illogic(agent, sample_qa)}")
 
    model = OpenAIModel(
        "anthropic/claude-sonnet-4",
        provider=OpenRouterProvider(api_key=os.getenv("OPENROUTER_API_KEY", "")),
    )
 
    # Run the asynchronous main function for parallel execution
    asyncio.run(main())
 

Want to ship better features with AI?
Join my free weekly newsletter.

No spam guaranteed Unsubscribe whenever