Nubian (WIP)

Autonomous computer-use agent

Most agents read the page. Nubian watches the screen. It skips a11y, CDP, and CodeAct, and uses vision and a virtual keyboard instead. The same channel a person uses. Today it drives LibreOffice Calc, Writer, Chrome, terminals, and the file manager. The target is to do the same in GIMP, Photoshop, and Blender, where no DOM or accessibility tree exists.

nubian-demo.mp4

nubian-gimp-demo.mp4

Nubian completing a job application end to end

Above: Nubian finishing a real online job application end to end. Left pane is the live reasoning trace; right pane is the agent's Ubuntu desktop. Along the way it wrote the cover letter in LibreOffice Writer, downloaded the required files, filled the form fields, and reached the "successfully applied" confirmation.

Full breakdown: docs/ARCHITECTURE.md

⚠ Heads-up — you need a GPU running UI-TARS-1.7B for accurate clicks.

Rent any GPU VM (~$0.20/hr on vast.ai, or RunPod / GCP / Hetzner)

Serve UI-TARS via vLLM on a reachable port (see deployment/GROUNDER_DEPLOY.md)

Point nubian.uground.base-url at it

Or set nubian.uground.enabled=false to run without grounded clicks (degraded mode)

Architecture

        ┌─────────────────────────────────────────────┐
        │  Agent (this repo, Java 17 + Spring Boot)   │
        │                                             │
        │  seeact loop = per-turn planner call:       │
        │    raw screenshot ──> Gemini Pro planner    │
        │    target description ──> UI-TARS grounder  │
        │    {action, x, y, ...} ──> Tools.invoke     │
        │                                             │
        │  Supervisor (every N turns OR on doom-loop):│
        │    consults Gemini Pro with screenshot      │
        │    can MODIFY plan / ADVANCE / EXECUTE keys │
        └──────────────┬──────────────────────────────┘
                       │  HTTP
        ┌──────────────▼──────────────┐
        │  Sandbox controller          │  /hands/action, /eyes/screenshot
        │  (Docker container w/        │  /eyes/evidence (windows, fs, ...)
        │   X11 + pyautogui)           │
        └──────────────────────────────┘

Quick start

# 1. Build
mvn -o package -DskipTests

# 2. Fill in your API key
cp config/application-dev.properties.example config/application-dev.properties
$EDITOR config/application-dev.properties   # set GEMINI_API_KEY or OPENROUTER_API_KEY

# 3. Bring up the sandbox (Ubuntu desktop + Nubian controller on port 6090)
#    Either run locally:
docker compose up -d sandbox
#    Or point the agent at a remote desktop in config/application-dev.properties:
#      nubian.sandbox.computer-agent.host=<remote-ip>
#      nubian.sandbox.computer-agent.agent-port=<port>     # default 6090
#      nubian.sandbox.computer-agent.base-path=/agent      # if behind nginx

# 4. Deploy the UI-TARS grounder on a GPU VM (required for grounded clicks)
#    See deployment/GROUNDER_DEPLOY.md.

# 5. Start the agent
./start.sh
# UI on http://localhost:8080

Tool contract

The agent dispatches actions to the sandbox via Tools.invoke(name, args). Most-used:

Action	Args	Notes
`click`	`{x, y, button?}`	pixel coords from the grounder
`type_text`	`{text, mode?: "append"\|"replace"}`	replace = Ctrl+A → Delete → type
`hotkey`	`{combo: "ctrl+l"\|"enter"\|...}`
`scroll`	`{direction: "up"\|"down"\|"left"\|"right", amount?: <int>}`	dy/dx accepted as legacy; always default to "down"/"right"
`activate_window`	`{name}`	exact title from `/eyes/evidence`
`close_window`	`{name}`	window manager close (more reliable than clicking the X)
`write_file`	`{path, content}`	absolute path in sandbox FS
`wait`	`{ms}`

Configuration knobs

All in config/application-dev.properties:

nubian.agent.flow=seeact                       # only supported flow
nubian.agent.max-steps=400                     # hard cap per task
nubian.agent.observation-settle-ms=600         # wait before post-action screenshot
nubian.agent.supervisor.enabled=true
nubian.agent.supervisor.interval=10            # fire every 10 planner calls
nubian.uground.enabled=true                    # disable for screenshot-only
nubian.agent.best-of-n=1                       # >1 samples N candidates per turn

Repo layout

.
├── nubian-app/                # Spring Boot app (the agent itself)
│   └── src/main/java/com/nubian/ai/app/
│       ├── Agent.java         # main loop + flow routing
│       ├── SeeActPlanner.java # per-turn planner LLM call
│       ├── SupervisorAdvisor.java
│       ├── Tools.java         # the action dispatch layer
│       ├── UGroundClient.java # pixel grounder client
│       └── Sandbox.java       # HTTP client for the sandbox controller
├── omniparser-server/         # optional UI-element parser sidecar
├── config/                    # local config (gitignored real, .example checked in)
├── docs/                      # design notes + handoff docs
├── scripts/                   # diff viewers, jitter probes
├── docker-compose.yml         # sandbox + sidecars
├── pom.xml
└── start.sh

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nubian (WIP)

Architecture

Quick start

Tool contract

Configuration knobs

Repo layout

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
config		config
deployment		deployment
docs		docs
nubian-app		nubian-app
skills		skills
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pom.xml		pom.xml
start.sh		start.sh

Folders and files

Latest commit

History

Repository files navigation

Nubian (WIP)

Architecture

Quick start

Tool contract

Configuration knobs

Repo layout

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages