Vision-Based Robotic Fault Detection Microsoft Hackathon

Using a VLM to monitor manufacturing robot arms and autonomously detect and respond to faults

In manufacturing environments, robotic arms execute repetitive pick-and-place or assembly workflows. When something goes wrong — a dropped part, a misaligned grip, an unexpected obstruction — the line stops. Traditional fault detection relies on narrow sensor thresholds that miss novel failure modes. This project replaces that with a vision-language model that watches the robot's video feed, understands what should be happening, and takes corrective action when it doesn't.

System Architecture

End-to-end pipeline from camera feed to corrective action.

Camera Live video stream of robot arm

Frame Sampler Downsampled frames at fixed interval

VLM Vision-language model inference

Controller Decision engine & robot commands

A camera streams video of the robot arm during its workflow. Frames are sampled at a fixed rate and sent to a vision-language model along with a reference sequence — a downsampled recording of the correct execution flow. The VLM compares what it sees against what it expects and outputs a structured assessment: whether a fault has occurred, what type, and what action to take.

Fault Detection Approach

How the VLM identifies when something has gone wrong.

VLM Context Window

Reference Input

Downsampled frame sequence of the correct workflow execution — serves as the "golden path" the model uses to understand what normal operation looks like.

Live Input

Current frames from the robot arm's camera feed, sampled at the same rate. The model compares these against the reference to detect deviations.

Structured Outputs

Fault Detected (Y/N) Fault Classification Recommended Action Fault Description

The key insight is providing the VLM with a visual example of correct execution as context. Rather than training a specialized model to recognize specific failure modes, we leverage the VLM's general visual understanding — it can reason about what it sees relative to what it expects. This makes the system robust to novel fault types that weren't anticipated during setup, since the model understands the intent of the workflow, not just a checklist of known errors.

Automated Response

Decision flow from fault detection to corrective action.

Fault Detected? VLM compares live feed against reference

Yes

Classify Fault Retryable or non-retryable?

Retryable

Retry Command Robot re-attempts the failed step

If retry fails

Stop + Alert Halt robot, notify technician

Non-retryable

Stop + Alert Halt robot, notify technician

When the VLM detects a fault, it classifies whether the issue is retryable — for example, a slightly misaligned pick that could succeed on a second attempt — or non-retryable, such as a broken part or a safety hazard. For retryable faults, the system issues a retry command to the robot arm. If the retry also fails, or if the fault was non-retryable from the start, the system issues a stop command and raises an alert to a technician.

The alert includes the VLM's description of the fault — what went wrong, when it happened, and what the robot was attempting. This gives the technician immediate context without needing to review footage, reducing diagnosis time and getting the line back up faster.

Demo

The system in action during the Microsoft Hackathon.