Building Confidence: Making GPT-4V More Honest About What It Can Read
I’ve been working with GPT-4V lately to extract data from industrial equipment nameplates (you know, those metal plates riveted to pumps with all the crucial specs). While the model is impressive, I quickly discovered something interesting: when it can’t read something clearly, it tends to choose hallucination over outputting null values.
The Problem with Blind Trust
GPT-4V with structured outputs seems magical at first. Feed it a beautiful image of a nameplate (ha), define your schema, and boom - structured data comes out. But in the real world, there are plenty of opportunities for the nameplate to be scratched, rusted, dirty, etc.
Initially, I approached this like any other structured data problem - build a Pydantic model and let GPT-4V fill it in:
class FlygtNamePlate(BaseModel):
serial_number: Optional[str] = None
impeller_no: Optional[int] = None
power: Optional[float] = None
# ... you get the idea
Simple, right? Well, here’s the thing - as soon as you feed it a bad quality example - it starts to do some weird things… Ideally Optional[] = None
should allow it to be completely missing OR None
. And I would expect the model to skip that piece of data. Instead - it seems: to hallucinate where numbers are missing, get confused on what numbers mean what (extracting to the wrong field), etc.
This led me to introduce an overall confidence_score
and filter away poor extractions - but what if we want to see the individual data points that weren’t difficult to read?
Teaching GPT-4V to Express Doubt
What if instead of forcing our model to make binary choices, we let it tell us how confident it is about each piece of data? Think of it like a support engineer saying “I’m pretty sure that’s a 7, but this rust spot is making it hard to tell.”
Here’s what I came up with:
class ConfidenceLevel(float, Enum):
VERY_LOW = 0.2 # "Is that a 3 or an 8? Your guess is as good as mine"
LOW = 0.4 # "I think it's a 3, but don't quote me on that"
MEDIUM = 0.6 # "Pretty sure it's a 3"
HIGH = 0.8 # "Yeah, that's definitely a 3"
VERY_HIGH = 1.0 # "I'd bet my next paycheck that's a 3"
class ConfidentField(BaseModel, Generic[T]):
value: Optional[T] = None
confidence: ConfidenceLevel
def is_reliable(self, threshold: float = 0.6) -> bool:
return self.confidence.value >= threshold
class FlygtNamePlate(BaseModel):
serial_number: ConfidentField[str] = None
impeller_no: ConfidentField[int] = None
power: ConfidentField[float] = None
What Does This Look Like in Practice?
Let’s look at three real-world scenarios I encounter all the time:
Field | Pristine Nameplate | That One From 1995 | Found It In A Swamp |
---|---|---|---|
Serial Number | 0.9 (3127.160) | 0.6 (3127.???) | 0.2 (???.???) |
Impeller No | 1.0 (437) | 0.8 (437) | 0.3 (???) |
Power | 0.9 (5.5 kW) | 0.4 (None) | 0.2 (None) |
Now we’re getting somewhere! Instead of guessing whether that blurry number is right, we can:
- Only use high-confidence data for critical operations
- Flag low-confidence readings for human verification
- Track which images have poor quality and analyze trends
Where Do We Go From Here?
This approach has changed how we handle nameplate data extraction in a few key ways:
- No More Binary Thinking: Data quality exists on a spectrum, and now our models reflect that
- Risk-Appropriate Thresholds: Need to order a $50k pump? Maybe set that confidence threshold to 0.9
- Better Feedback Loops: We can tell immediately if our photo quality is good enough
I’m still iterating on this solution, but it’s already saved us from a few potentially expensive mistakes. The next step is probably integrating this more tightly with our maintenance workflows - but that’s a topic for another post.
What do you think? Are you using GPT-4V for industrial applications? I’d love to hear how others are handling these kinds of real-world challenges.