Building Confidence: Making GPT-4V More Honest About What It Can Read

Published: November 12, 2024

I’ve been working with GPT-4V lately to extract data from industrial equipment nameplates (you know, those metal plates riveted to pumps with all the crucial specs). While the model is impressive, I quickly discovered something interesting: when it can’t read something clearly, it tends to choose hallucination over outputting null values.

GPT-4V with structured outputs seems magical at first. Feed it a beautiful image of a nameplate (ha), define your schema, and boom - structured data comes out. But in the real world, there are plenty of opportunities for the nameplate to be scratched, rusted, dirty, etc.

Initially, I approached this like any other structured data problem - build a Pydantic model and let GPT-4V fill it in:

class FlygtNamePlate(BaseModel):
    serial_number: Optional[str] = None
    impeller_no: Optional[int] = None
    power: Optional[float] = None
    # ... you get the idea

Simple, right? Well, here’s the thing - as soon as you feed it a bad quality example - it starts to do some weird things… Ideally Optional[] = None should allow it to be completely missing OR None. And I would expect the model to skip that piece of data. Instead - it seems: to hallucinate where numbers are missing, get confused on what numbers mean what (extracting to the wrong field), etc.

This led me to introduce an overall confidence_score and filter away poor extractions - but what if we want to see the individual data points that weren’t difficult to read?

Teaching GPT-4V to Express Doubt

What if instead of forcing our model to make binary choices, we let it tell us how confident it is about each piece of data? Think of it like a support engineer saying “I’m pretty sure that’s a 7, but this rust spot is making it hard to tell.”

Here’s what I came up with:

class ConfidenceLevel(float, Enum):
    VERY_LOW = 0.2    # "Is that a 3 or an 8? Your guess is as good as mine"
    LOW = 0.4         # "I think it's a 3, but don't quote me on that"
    MEDIUM = 0.6      # "Pretty sure it's a 3"
    HIGH = 0.8        # "Yeah, that's definitely a 3"
    VERY_HIGH = 1.0   # "I'd bet my next paycheck that's a 3"

class ConfidentField(BaseModel, Generic[T]):
    value: Optional[T] = None
    confidence: ConfidenceLevel

    def is_reliable(self, threshold: float = 0.6) -> bool:
        return self.confidence.value >= threshold

class FlygtNamePlate(BaseModel):
    serial_number: ConfidentField[str] = None
    impeller_no: ConfidentField[int] = None
    power: ConfidentField[float] = None

What Does This Look Like in Practice?

Let’s look at three real-world scenarios I encounter all the time:

Field	Pristine Nameplate	That One From 1995	Found It In A Swamp
Serial Number	0.9 (3127.160)	0.6 (3127.???)	0.2 (???.???)
Impeller No	1.0 (437)	0.8 (437)	0.3 (???)
Power	0.9 (5.5 kW)	0.4 (None)	0.2 (None)

Now we’re getting somewhere! Instead of guessing whether that blurry number is right, we can:

Only use high-confidence data for critical operations
Flag low-confidence readings for human verification
Track which images have poor quality and analyze trends

Where Do We Go From Here?

This approach has changed how we handle nameplate data extraction in a few key ways:

No More Binary Thinking: Data quality exists on a spectrum, and now our models reflect that
Risk-Appropriate Thresholds: Need to order a $50k pump? Maybe set that confidence threshold to 0.9
Better Feedback Loops: We can tell immediately if our photo quality is good enough

I’m still iterating on this solution, but it’s already saved us from a few potentially expensive mistakes. The next step is probably integrating this more tightly with our maintenance workflows - but that’s a topic for another post.

What do you think? Are you using GPT-4V for industrial applications? I’d love to hear how others are handling these kinds of real-world challenges.

Building Confidence: Making GPT-4V More Honest About What It Can Read

The Problem with Blind Trust

Teaching GPT-4V to Express Doubt

What Does This Look Like in Practice?

Where Do We Go From Here?