Abstract
Context-aware systems can support humans at work by automatically performing quality control, providing assistance, or generating instructions and documentation for latter use. However, the adaptation of such intelligent systems to custom use cases demands training data, expertise, and effort. With the dissemination of Vision Language Models (VLMs), recognition capabilities are becoming more accessible. We explore the use of readily available VLMs for understanding egocentric video footage of common manual tasks in production environments. Results demonstrate the feasibility of using VLMs in such contexts.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2025 Valentin Knoben, Julia Kramme, Christian Wurll
