OmniParser V2 - Inward App

OmniParser V2

Turn any LLM into a Computer Use Agent

Website microsoft.com

What it is

OmniParser ‘tokenizes’ UI screenshots from pixel spaces into structured elements in the screenshot that are interpretable by LLMs. This enables the LLMs to do retrieval based next action prediction given a set of parsed interactable elements.

Intent

I need it when

Build GUI automation systems that respect responsible AI principles and avoid inferring sensitive user attributes

OmniParser V2's icon caption model is trained with Responsible AI data to avoid inferring sensitive attributes (race, religion, etc.) from images. The system includes threat modeling, sandbox containers, and safety guidance, enabling organizations to deploy GUI agents aligned with Microsoft AI principles and ethical AI practices.

Rapidly prototype and test different LLM-based GUI automation agent configurations

OmniTool, the dockerized Windows system bundled with OmniParser V2, provides out-of-the-box integration with multiple state-of-the-art LLMs and includes essential tools for screen understanding, grounding, action planning, and execution—allowing developers to experiment with different agent settings without building infrastructure from scratch.

Improve accuracy and speed of GUI agents when detecting small UI elements and icons

OmniParser V2 achieves 39.6% accuracy on ScreenSpot Pro (vs. GPT-4o's 0.8% baseline) through larger training datasets for interactive element detection and icon captioning. It reduces latency by 60% versus V1, enabling faster, more reliable detection of tiny interactable elements that general LLMs struggle to locate.

Enable LLMs to understand and interact with graphical user interfaces for automated screen navigation

OmniParser V2 converts UI screenshots into structured, LLM-interpretable elements by tokenizing pixels into interactable components. This allows any LLM (GPT-4o, DeepSeek, Qwen, Anthropic) to reliably identify buttons, icons, and UI regions, then predict and execute the next action—solving the core challenge of GUI automation without requiring models to process raw pixel data.

Drop

Not a fit when

When users need commercial support or SLAs for production GUI automation systems
When organizations require proprietary licensing or cannot use open-source research code
When users lack technical expertise to integrate LLMs and parse UI screenshots programmatically
When real-time GUI automation is needed on non-standard or proprietary UI frameworks not covered in training data
When users need a fully managed, no-code GUI automation platform without model integration requirements

Commercials

Pricing

Pricing not specified