Vision-Based Robotic Fault Detection Microsoft Hackathon
Using a VLM to monitor manufacturing robot arms and autonomously detect and respond to faults
In manufacturing environments, robotic arms execute repetitive pick-and-place or assembly workflows. When something goes wrong — a dropped part, a misaligned grip, an unexpected obstruction — the line stops. Traditional fault detection relies on narrow sensor thresholds that miss novel failure modes. This project replaces that with a vision-language model that watches the robot's video feed, understands what should be happening, and takes corrective action when it doesn't.
System Architecture
End-to-end pipeline from camera feed to corrective action.
A camera streams video of the robot arm during its workflow. Frames are sampled at a fixed rate and sent to a vision-language model along with a reference sequence — a downsampled recording of the correct execution flow. The VLM compares what it sees against what it expects and outputs a structured assessment: whether a fault has occurred, what type, and what action to take.
Fault Detection Approach
How the VLM identifies when something has gone wrong.
The key insight is providing the VLM with a visual example of correct execution as context. Rather than training a specialized model to recognize specific failure modes, we leverage the VLM's general visual understanding — it can reason about what it sees relative to what it expects. This makes the system robust to novel fault types that weren't anticipated during setup, since the model understands the intent of the workflow, not just a checklist of known errors.
Automated Response
Decision flow from fault detection to corrective action.
When the VLM detects a fault, it classifies whether the issue is retryable — for example, a slightly misaligned pick that could succeed on a second attempt — or non-retryable, such as a broken part or a safety hazard. For retryable faults, the system issues a retry command to the robot arm. If the retry also fails, or if the fault was non-retryable from the start, the system issues a stop command and raises an alert to a technician.
The alert includes the VLM's description of the fault — what went wrong, when it happened, and what the robot was attempting. This gives the technician immediate context without needing to review footage, reducing diagnosis time and getting the line back up faster.
Demo
The system in action during the Microsoft Hackathon.