Skip to content

skyDuanXianBing/Open-Interface

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

181 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Open Interface

Open Interface Logo

Control Your Computer Using LLMs

Open Interface

  • Runs a screenshot-driven desktop agent loop powered by GPT-5, GPT-4o, GPT-4V, Gemini, Claude, Qwen, and compatible OpenAI-style endpoints.
  • Asks the model for at most one next UI action at a time, then executes it with local keyboard and mouse control.
  • Re-observes the screen after each step, optionally verifies visual change locally, and course-corrects until the task is done or safely stopped.

Full Autopilot for All Computers Using LLMs

macOS Linux Windows
Github All Releases GitHub code size in bytes GitHub Repo stars GitHub GitHub Latest Release)

Demo ๐Ÿ’ป

"Solve Today's Wordle"
Solve Today's Wordle
clipped, 2x

More Demos
  • "Make me a meal plan in Google Docs"
  • "Write a Web App"

Install ๐Ÿ’ฝ

MacOS Logo MacOS
  • Download the MacOS binary from the latest release.
  • Unzip the file and move Open Interface to the Applications Folder.

Apple Silicon M-Series Macs
  • Open Interface will ask you for Accessibility access to operate your keyboard and mouse for you, and Screen Recording access to take screenshots to assess its progress.
  • In case it doesn't, manually add these permission via System Settings -> Privacy and Security

Intel Macs
  • Launch the app from the Applications folder.
    You might face the standard Mac "Open Interface cannot be opened" error.


    In that case, press "Cancel".
    Then go to System Preferences -> Security and Privacy -> Open Anyway.

    ย  ย 

  • Open Interface will also need Accessibility access to operate your keyboard and mouse for you, and Screen Recording access to take screenshots to assess its progress.


  • Lastly, checkout the Setup section to connect Open Interface to your preferred LLM provider.
Linux Logo Linux
  • Linux binary has been tested on Ubuntu 20.04 so far.
  • Download the Linux zip file from the latest release.
  • Extract the executable and checkout the Setup section to connect Open Interface to LLMs such as GPT-5, GPT-4o, Gemini, Claude, or Qwen.
Linux Logo Windows
  • Windows binary has been tested on Windows 10.
  • Download the Windows zip file from the latest release.
  • Unzip the folder, move the exe to the desired location, double click to open, and voila.
  • Checkout the Setup section to connect Open Interface to your preferred LLM provider.
Python Logo Run as a Script
  • Clone the repo git clone https://github.com/AmberSahdev/Open-Interface.git
  • Enter the directory cd Open-Interface
  • Optionally use a Python virtual environment
    • Note: pyenv handles tkinter installation weirdly so you may have to debug for your own system yourself.
    • pyenv local 3.12.2
    • python -m venv .venv
    • source .venv/bin/activate
  • Install dependencies pip install -r requirements.txt
  • Run the app using python app/app.py

Setup ๐Ÿ› ๏ธ

Set up the OpenAI API key
  • Get your OpenAI API key

    • Open Interface can use OpenAI-compatible models including GPT-4o, GPT-4V, GPT-5, and computer-use-preview depending on your configuration. OpenAI keys can be downloaded from your OpenAI account at platform.openai.com/settings/organization/api-keys.
    • Follow the steps here to add balance to your OpenAI account. Some higher-tier models may require prepaid billing or additional account access.
    • More info
  • Save the API key in Open Interface settings

    • In Open Interface, go to the Settings menu on the top right and enter the key you received from OpenAI into the text field like so:

    Set API key in settings

  • After setting the API key for the first time you'll need to restart the app.

Set up the Google Gemini API key
  • Go to Settings -> Advanced Settings and select the Gemini model you wish to use.
  • Get your Google Gemini API key from https://aistudio.google.com/app/apikey.
  • Save the API key in Open Interface settings.
  • Save the settings and restart the app.
Optional: Setup a Custom LLM
  • Open Interface supports using other OpenAI API style LLMs (such as Llava) as a backend and can be configured easily in the Advanced Settings window.
  • Enter the custom base url and model name in the Advanced Settings window and the API key in the Settings window as needed.
  • NOTE - If you're using Llama:
    • You may need to enter a random string like "xxx" in the API key input box.
    • You may need to append /v1/ to the base URL.
      Set API key in settings

  • If your LLM does not support an OpenAI style API, you can use a library like this to convert it to one.
  • You will need to restart the app after these changes.

Current Architecture ๐Ÿง 

Open Interface now uses a structured request pipeline and Prompt System v1. The current app is still a strict single-step visual agent loop, but the prompt, history, and verification layers are more explicit and consistent than the earlier context.txt + request_data JSON approach.

  • The runtime is a single-step closed loop, not a multi-step batch planner.
  • Each model round may return multiple steps, but the runtime executes only the first executable one.
  • Core creates a structured request_context for every request and persists messages plus execution logs through SessionStore.
  • Most providers now share one prompt semantics source; provider adapters differ mainly in API message formatting.
  • computer-use-preview is still a separate tool-driven path rather than the standard JSON step-output flow.

Request Flow

  1. The UI sends the user's natural-language goal into a queue.
  2. App forwards it to Core.execute_user_request(...) on a worker thread.
  3. Core stops any previous request, snapshots the active session history, stores the new user message, and builds request_context.
  4. LLM and the selected provider capture the latest screenshot and build a unified prompt package.
  5. The model returns JSON with steps and done; the runtime keeps at most one step.
  6. Interpreter executes the step locally and writes an execution log.
  7. If local verification is enabled, StepVerifier compares before/after screenshots and feeds the result back into the next round.
  8. The loop repeats until the model returns done, the request is interrupted, or the runtime stops after repeated failure.

Prompt System v1

Prompt assembly now lives under app/prompting/ and is built through app/prompting/builder.py.

  • Stable prompt layers: PromptSystemContext and registry-generated PromptToolSchema
  • Dynamic prompt layers: PromptTaskContext, PromptExecutionTimeline, PromptRecentDetails, PromptVisualContext, and PromptOutputContract
  • context.txt now stores stable rules only; dynamic runtime state is assembled from request_context
  • Tool definitions come from ToolRegistry, so models see an explicit allowlist of tool names, parameters, and usage rules
  • Coordinate actions use the same 0-100 ruler values shown on the screenshot grid; the runtime converts them locally to pixels

State, Memory, and Verification

  • session_history_snapshot captures narrative session history at request start
  • step_history records authoritative per-request execution progress and verification results
  • agent_memory keeps compact loop memory such as recent failures, recent actions, and unreliable anchors
  • Local step verification can be toggled with runtime.disable_local_step_verification
  • When local verification is disabled, successful steps are still re-observed and recorded as verification_status = skipped
  • Prompt text dumps can be enabled with advanced.save_prompt_text_dumps, which writes final prompt text to promptdump/

Stuff Itโ€™s Error-Prone At, For Now ๐Ÿ˜ฌ

  • Accurate spatial-reasoning and hence clicking buttons.
  • Keeping track of itself in tabular contexts, like Excel and Google Sheets, for similar reasons as stated above.
  • Navigating complex GUI-rich applications like Counter-Strike, Spotify, Garage Band, etc due to heavy reliance on cursor actions.

The Future ๐Ÿ”ฎ

(with better models trained on video walkthroughs like Youtube tutorials)

  • "Create a couple of bass samples for me in Garage Band for my latest project."
  • "Read this design document for a new feature, edit the code on Github, and submit it for review."
  • "Find my friends' music taste from Spotify and create a party playlist for tonight's event."
  • "Take the pictures from my Tahoe trip and make a White Lotus type montage in iMovie."

Notes ๐Ÿ“

  • Cost Estimation: $0.0005 - $0.002 per LLM request depending on the model used.
    (User requests can require between two to a few dozen LLM backend calls depending on the request's complexity.)
  • You can interrupt the app anytime by pressing the Stop button, or by dragging your cursor to any of the screen corners.
  • Open Interface can only see your primary display when using multiple monitors. Therefore, if the cursor/focus is on a secondary screen, it might keep retrying the same actions as it is unable to see its progress.
  • Most providers now share one prompt contract, but computer-use-preview still follows a separate real-tool execution path.
  • Prompt text dumps are available for debugging through advanced.save_prompt_text_dumps; they exclude API credentials and image binaries.

System Diagram ๐Ÿ–ผ๏ธ

+-------------------------------------------------------------------+
| App / UI                                                          |
|                                                                   |
|  user goal -> Core -> SessionStore                                |
|                  |                                                 |
|                  v                                                 |
|             request_context                                        |
|                  |                                                 |
|                  v                                                 |
|          LLM / Provider Adapter                                    |
|                  |                                                 |
|                  v                                                 |
|     PromptBuilder + ToolRegistry + Screenshot                      |
|                  |                                                 |
|                  v                                                 |
|        Model returns JSON { steps, done }                          |
|                  |                                                 |
|                  v                                                 |
|            Interpreter executes one step                           |
|                  |                                                 |
|                  v                                                 |
|      StepVerifier observes before/after screen change              |
|                  |                                                 |
|                  +-----> step_history / agent_memory ----+         |
|                                                          |         |
|  <---------------- repeat until done / stop / failure ---+         |
+-------------------------------------------------------------------+

Star History โญ๏ธ

Star History

Links ๐Ÿ”—

  • Check out more of my projects at AmberSah.dev.
  • Other demos and press kit can be found at MEDIA.md.

About

Control Any Computer Using LLMs.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • Python 100.0%