Exploring Vision Language Models for Egocentric Action Localization
PDF

Keywords

Vision language model
action recognition
temporal localization

Abstract

Context-aware systems can support humans at work by automatically performing quality control, providing assistance, or generating instructions and documentation for latter use. However, the adaptation of such intelligent systems to custom use cases demands training data, expertise, and effort. With the dissemination of Vision Language Models (VLMs), recognition capabilities are becoming more accessible. We explore the use of readily available VLMs for understanding egocentric video footage of common manual tasks in production environments. Results demonstrate the feasibility of using VLMs in such contexts.

https://doi.org/10.60643/urai.v2025p23
PDF
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Copyright (c) 2025 Valentin Knoben, Julia Kramme, Christian Wurll