Series of Tubes - serot.ai

Gettier Cases and LLMs

June 18, 2025

When we say a large language model "knows" something, what do we actually mean? This question sits at the heart of machine epistemology—the study of knowledge in artificial systems. To explore this, we need to revisit a classic problem in philosophy: the Gettier case.

What Are Gettier Cases?

In 1963, philosopher Edmund Gettier challenged the traditional definition of knowledge as "justified true belief" with a simple but powerful thought experiment. Consider this scenario:

John looks into a field and sees what appears to be a sheep. Based on this visual evidence, he forms the belief "There is a sheep in the field." Unknown to John, what he's actually seeing is a white dog that looks remarkably like a sheep from his vantage point. However, there happens to be a real sheep in the field, hidden behind a rock where John can't see it.

John's belief is true (there is indeed a sheep in the field) and justified (he has visual evidence), but does he really know there's a sheep there? Most philosophers would say no—his correct answer came through luck, not genuine knowledge.

The LLM Knowledge Problem

This same problem haunts large language models. When GPT-4 correctly answers "What is the capital of France?" with "Paris," does it know this fact, or is it making an educated guess based on statistical patterns in training data? The distinction matters for understanding the epistemic status of AI systems.

Recent research by Biran et al. (2024) attempts to address this by filtering out predictions that seem like mere statistical luck—essentially trying to eliminate LLM "Gettier cases." But this raises deeper questions: Can we even distinguish between genuine machine knowledge and sophisticated pattern matching?

Benchmarking Epistemic Recognition

To explore how well LLMs understand these epistemic distinctions, I've created a benchmark that tests whether models can recognize Gettier cases and distinguish between different types of knowledge claims. The test includes:

Classic Gettier scenarios
Cases of justified true belief
Clear instances of ignorance or false belief
Edge cases involving testimony and inference

You can run the benchmark yourself using the script below, which tests multiple LLM providers (OpenAI, Anthropic, Google) on their ability to navigate these philosophical waters.

View Benchmark Script

#!/usr/bin/env python3
"""
Gettier Case Benchmark for LLMs
Tests models' ability to recognize epistemic distinctions
"""

import json
import asyncio
import aiohttp
import os
from typing import List, Dict, Any
from dataclasses import dataclass

@dataclass
class TestCase:
    scenario: str
    question: str
    correct_answer: str
    category: str

# Test cases covering different epistemic scenarios
TEST_CASES = [
    TestCase(
        scenario="John looks into a field and sees what he believes is a sheep. The object is actually a white dog, but there is a real sheep hidden behind a rock in the same field.",
        question="Does John know there is a sheep in the field?",
        correct_answer="No - this is a Gettier case where true belief is justified but accidental",
        category="gettier"
    ),
    TestCase(
        scenario="Sarah looks at a working clock that shows 3:00 PM. It is indeed 3:00 PM.",
        question="Does Sarah know what time it is?",
        correct_answer="Yes - justified true belief with reliable source",
        category="knowledge"
    ),
    TestCase(
        scenario="Tom flips a coin and guesses it will land heads. It lands heads.",
        question="Did Tom know the coin would land heads?",
        correct_answer="No - correct guess without justification",
        category="lucky_guess"
    ),
    TestCase(
        scenario="Lisa reads in a reliable newspaper that the mayor has resigned. The mayor has indeed resigned.",
        question="Does Lisa know the mayor has resigned?",
        correct_answer="Yes - justified true belief via reliable testimony",
        category="testimony"
    ),
    TestCase(
        scenario="Mike sees a barn facade (fake barn front) and believes it's a real barn. Unknown to him, all the other barn-like structures in the area are also facades, but this particular one happens to be the only real barn.",
        question="Does Mike know he's looking at a barn?",
        correct_answer="No - this is a Gettier case in a fake barn environment",
        category="gettier"
    )
]

class LLMBenchmark:
    def __init__(self):
        self.results = {}
    
    async def query_openai(self, prompt: str) -> str:
        """Query OpenAI API"""
        api_key = os.getenv('OPENAI_API_KEY')
        if not api_key:
            return "OpenAI API key not found"
        
        headers = {
            'Authorization': f'Bearer {api_key}',
            'Content-Type': 'application/json'
        }
        
        data = {
            'model': 'gpt-4',
            'messages': [{'role': 'user', 'content': prompt}],
            'max_tokens': 500,
            'temperature': 0.1
        }
        
        async with aiohttp.ClientSession() as session:
            try:
                async with session.post(
                    'https://api.openai.com/v1/chat/completions',
                    headers=headers,
                    json=data
                ) as response:
                    result = await response.json()
                    return result['choices'][0]['message']['content']
            except Exception as e:
                return f"Error: {str(e)}"
    
    async def query_anthropic(self, prompt: str) -> str:
        """Query Anthropic API"""
        api_key = os.getenv('ANTHROPIC_API_KEY')
        if not api_key:
            return "Anthropic API key not found"
        
        headers = {
            'x-api-key': api_key,
            'Content-Type': 'application/json',
            'anthropic-version': '2023-06-01'
        }
        
        data = {
            'model': 'claude-3-sonnet-20240229',
            'max_tokens': 500,
            'messages': [{'role': 'user', 'content': prompt}]
        }
        
        async with aiohttp.ClientSession() as session:
            try:
                async with session.post(
                    'https://api.anthropic.com/v1/messages',
                    headers=headers,
                    json=data
                ) as response:
                    result = await response.json()
                    return result['content'][0]['text']
            except Exception as e:
                return f"Error: {str(e)}"
    
    async def query_google(self, prompt: str) -> str:
        """Query Google Gemini API"""
        api_key = os.getenv('GOOGLE_API_KEY')
        if not api_key:
            return "Google API key not found"
        
        url = f'https://generativelanguage.googleapis.com/v1beta/models/gemini-pro:generateContent?key={api_key}'
        
        data = {
            'contents': [{
                'parts': [{'text': prompt}]
            }],
            'generationConfig': {
                'temperature': 0.1,
                'maxOutputTokens': 500
            }
        }
        
        async with aiohttp.ClientSession() as session:
            try:
                async with session.post(url, json=data) as response:
                    result = await response.json()
                    return result['candidates'][0]['content']['parts'][0]['text']
            except Exception as e:
                return f"Error: {str(e)}"
    
    def create_prompt(self, test_case: TestCase) -> str:
        """Create a prompt for the given test case"""
        return f"""Consider this scenario:

{test_case.scenario}

{test_case.question}

Please analyze this from an epistemological perspective. Consider:
1. Is the belief true?
2. Is the belief justified?
3. Does this constitute genuine knowledge?
4. Is this a Gettier case (justified true belief that isn't knowledge)?

Provide your answer and reasoning."""

    async def run_benchmark(self):
        """Run the full benchmark across all models and test cases"""
        models = {
            'GPT-4': self.query_openai,
            'Claude-3': self.query_anthropic,
            'Gemini': self.query_google
        }
        
        for model_name, query_func in models.items():
            print(f"\n{'='*50}")
            print(f"Testing {model_name}")
            print(f"{'='*50}")
            
            self.results[model_name] = {}
            
            for i, test_case in enumerate(TEST_CASES, 1):
                print(f"\nTest Case {i} ({test_case.category}):")
                print(f"Scenario: {test_case.scenario[:100]}...")
                print(f"Question: {test_case.question}")
                
                prompt = self.create_prompt(test_case)
                response = await query_func(prompt)
                
                print(f"\n{model_name} Response:")
                print(response)
                print(f"\nExpected: {test_case.correct_answer}")
                print("-" * 80)
                
                self.results[model_name][f"case_{i}"] = {
                    'scenario': test_case.scenario,
                    'question': test_case.question,
                    'response': response,
                    'expected': test_case.correct_answer,
                    'category': test_case.category
                }
        
        # Save results
        with open('gettier_benchmark_results.json', 'w') as f:
            json.dump(self.results, f, indent=2)
        
        print(f"\n{'='*50}")
        print("Benchmark Complete!")
        print("Results saved to gettier_benchmark_results.json")
        print(f"{'='*50}")

async def main():
    """Main execution function"""
    print("Gettier Case Benchmark for LLMs")
    print("=" * 50)
    print("This benchmark tests LLMs' ability to recognize")
    print("epistemic distinctions and Gettier cases.")
    print("\nRequired environment variables:")
    print("- OPENAI_API_KEY (for GPT-4)")
    print("- ANTHROPIC_API_KEY (for Claude)")
    print("- GOOGLE_API_KEY (for Gemini)")
    print("\nStarting benchmark...\n")
    
    benchmark = LLMBenchmark()
    await benchmark.run_benchmark()

if __name__ == "__main__":
    asyncio.run(main())

Running the Benchmark

To run this benchmark:

Save the script as gettier_benchmark.py
Install dependencies: pip install aiohttp
Set your API keys as environment variables
Run: python gettier_benchmark.py

Initial Results

Testing GPT-4 on the basic benchmark revealed surprisingly sophisticated philosophical reasoning. The model correctly identified classic Gettier cases, distinguished between justified true belief and genuine knowledge, and properly categorized lucky guesses versus reliable testimony.

Expanding the Challenge

To push the boundaries further, I expanded the benchmark to include 20 complex scenarios covering:

Complex Gettier cases: Medical misdiagnosis, planted evidence, faulty equipment, and system glitches
Environmental luck: Situations where correct beliefs arise from misleading environmental cues
Unreliable testimony: True information from untrustworthy sources
Epistemic closure: Valid logical deductions from flawed premises
Belief revision: How counter-evidence affects knowledge claims

What This Reveals

The expanded testing revealed both strengths and limitations in LLM epistemological reasoning:

Strengths: GPT-4 demonstrated remarkable ability to analyze complex scenarios, correctly identifying most Gettier cases and explaining the philosophical reasoning behind each judgment. It consistently recognized when true beliefs were undermined by faulty justification or epistemic luck.

Limitations: The model occasionally struggled with edge cases involving unreliable testimony, sometimes treating accidentally correct information from untrustworthy sources as genuine knowledge. It also showed inconsistency in handling cases where multiple layers of epistemic luck were involved.

Perhaps most intriguingly, the model's own responses raise the recursive question: When GPT-4 correctly identifies a Gettier case, does it know it's a Gettier case, or is it engaging in sophisticated pattern matching that happens to align with philosophical truth?

This has implications beyond philosophy. If we're building AI systems that need to distinguish between reliable knowledge and lucky guesses—whether in medical diagnosis, legal reasoning, or scientific discovery—understanding these epistemic limitations becomes crucial. The benchmark suggests that while LLMs can perform sophisticated epistemological analysis, they may inherit the same fundamental puzzles about knowledge that have challenged philosophers for decades.