Commit 1a6294f4 authored by Lisa (AI Assistant)'s avatar Lisa (AI Assistant)

feat: align Android node capabilities with gateway

parent ced218d3
Pipeline #323 canceled with stages
......@@ -2,62 +2,107 @@
Android node agent for the Hermes Node Protocol.
This is the Android sibling of the Linux `hermes-node-agent`: an outbound-only node that connects to the Hermes Node Gateway over WSS, registers Android-safe capabilities, and executes supported actions without requiring inbound firewall rules, SSH, or rooted devices.
This project is the Android sibling of the Linux `hermes-node-agent`. The goal is **functional parity where Android permits it**, not a toy “status-only” agent.
The gateway should not need a new “phone_control” tool just because the node is Android. Where the existing capability names already describe the intent, Android should reuse them and implement Android-native semantics behind the same wire contract.
## Current status
Project scaffold. No remote Git repository is configured yet; the GitLab remote will be added later when provided.
Early Android/Kotlin scaffold with protocol/capability classes. The project already has a GitLab remote:
```text
git@git.nexlab.net:lisa/hermes-node-android.git
```
## Capability naming decision
Use existing gateway capability names where practical:
- `computer_control` = interactive device control.
- On Linux: mouse, keyboard, X11 desktop actions.
- On Android: touch/input abstraction, launch apps, open intents, press Back/Home/Recents, type text, tap/swipe, optionally accessibility-backed UI actions.
- `desktop_observe` = structured screen/device observation.
- On Linux: active window, cursor, screen geometry, screenshots.
- On Android: foreground app/activity where available, display metrics, orientation, screen state, screenshots where permission/API allows.
- `audio_control` = device/media audio operations.
- On Android: media session controls, volume, audio route/status, playback of provided audio, mic capture only with permission and visible UX.
- `camera_control` = camera read/write paths where supported.
- On Android: list cameras, capture frame/video with CameraX/Camera2 and explicit permission/visible UX. Virtual camera injection is not an Android baseline feature.
- `browser_control` = browser automation if implemented later via Chrome DevTools/WebView/intent flows. For first pass, browser launching/open URL can live under `computer_control` actions.
- `exec` = Android shell execution if available, disabled by default and honest about non-root limitations.
This avoids gateway-side churn while still letting the gateway inspect `capability_info.platform == "android"` when it needs platform-specific hints.
## Design goals
- **Protocol-compatible** with the existing Hermes Node Gateway JSON/WebSocket protocol.
- **Protocol-compatible** with existing Hermes Node Gateway JSON/WebSocket protocol.
- **Outbound-only** connection from Android to gateway: `wss://gateway:8765`.
- **No root requirement** for baseline Android operation.
- **Capability-first architecture** so Android can expose what is safe/available:
- `android_status` / device info
- notifications bridge, if explicitly enabled
- media/audio controls where Android APIs permit it
- camera capture only with explicit permission and visible UX
- file operations limited to app-accessible storage / SAF grants
- exec only for non-root shell where available, and disabled by default
- **User-owned configuration** inside app storage, never hardcoded tokens.
- **Visible foreground service** for persistent connectivity, because Android kills hidden background agents.
- **Functional parity with Linux capabilities where Android permits it.**
- **No root requirement** for baseline operation.
- **Foreground service** for persistent connectivity.
- **Explicit permissions** for sensitive functions: notifications, accessibility, camera, mic, screen capture.
- **No hidden capture or spyware behavior.** If Android requires a visible permission prompt or foreground notification, we respect that.
- **No hardcoded tokens.** Config is app-local and user-supplied.
## Non-goals for the first milestone
## First implementation targets
- Root-only device control.
- Silent spyware-style capture.
- Full Linux `exec` parity.
- Bypassing Android permission prompts or background limits.
### `computer_control`
Android is a different security model, not just Linux with a touchscreen. We'll keep this boring and explicit rather than building a brittle hack pile.
Android-native actions:
## Proposed architecture
- `launch_app` by package name.
- `open_url` via browser intent.
- `open_intent` for explicit/implicit intents.
- `key_press` for Android navigation keys where permitted: Back, Home, Recents, Enter, Volume Up/Down.
- `type_text` through accessibility/IME path where available.
- `tap`, `swipe`, `long_press` through accessibility gesture dispatch where enabled.
- `get_active_window` mapped to foreground app/activity when available.
```text
Android App
├─ MainActivity: setup/status UI
├─ Foreground HermesNodeService
│ ├─ GatewayClient: OkHttp WebSocket connection + reconnect
│ ├─ ProtocolHandler: register, heartbeat, command dispatch
│ ├─ ConfigStore: encrypted/shared preferences
│ └─ CapabilityRegistry
│ ├─ StatusCapability
│ ├─ NotificationCapability
│ ├─ MediaCapability
│ ├─ CameraCapability
│ └─ ShellCapability (disabled by default)
└─ Android permissions declared per capability
```
### `desktop_observe`
Android-native observations:
- `screen_info`: display size, density, orientation, interactive/locked state where available.
- `active_window`/`get_active_window`: foreground app/activity via usage stats or accessibility, depending on grants.
- `screenshot`: MediaProjection-based capture with explicit user consent.
- `clipboard_get`: only when Android version/API permits it for the foreground/default IME constraints.
### `audio_control`
- `list_audio_devices`
- `get_audio_status`
- `play_audio`
- `capture_input` with runtime microphone permission and foreground UX
- `capture_output` only if Android API/app constraints allow; otherwise explicit unsupported result.
### `camera_control`
- `list_cameras`
- `get_camera_status`
- `capture_frame`
- `capture_video`
- `inject_video`: return explicit unsupported on normal Android unless a later rooted/vendor-specific backend exists.
### `exec`
- Optional Android shell command execution using app-accessible process execution.
- Disabled by default.
- Must return honest permission/SELinux/root limitations.
## Repository layout
```text
.
├── app/ Android application module
├── app/
│ └── src/main/
│ ├── AndroidManifest.xml
│ ├── java/net/nexlab/hermesnodeandroid/
│ │ ├── MainActivity.kt
│ │ ├── HermesNodeService.kt
│ │ ├── GatewayClient.kt
│ │ ├── NodeAgent.kt
│ │ ├── ProtocolModels.kt
│ │ └── capabilities/
│ └── res/
├── docs/
│ ├── ANDROID_CAPABILITIES.md
......
......@@ -14,6 +14,10 @@ android {
versionCode = 1
versionName = "0.1.0"
}
kotlinOptions {
jvmTarget = "17"
}
}
dependencies {
......
......@@ -3,6 +3,9 @@
<uses-permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.FOREGROUND_SERVICE" />
<uses-permission android:name="android.permission.POST_NOTIFICATIONS" />
<uses-permission android:name="android.permission.CAMERA" />
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<uses-permission android:name="android.permission.QUERY_ALL_PACKAGES" />
<application
android:allowBackup="false"
......
/*
* Hermes Node Android
* Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
* SPDX-License-Identifier: GPL-3.0-or-later
*/
package net.nexlab.hermesnodeandroid
import android.content.Context
import net.nexlab.hermesnodeandroid.capabilities.AndroidAudioControl
import net.nexlab.hermesnodeandroid.capabilities.AndroidCameraControl
import net.nexlab.hermesnodeandroid.capabilities.AndroidComputerControl
import net.nexlab.hermesnodeandroid.capabilities.AndroidDesktopObserve
import net.nexlab.hermesnodeandroid.capabilities.AndroidExecControl
import okhttp3.Response
import okhttp3.WebSocket
import okhttp3.WebSocketListener
import org.json.JSONObject
class NodeAgent(
private val context: Context,
private val config: NodeConfig,
) : WebSocketListener() {
private val computer = AndroidComputerControl(context)
private val observe = AndroidDesktopObserve(context)
private val audio = AndroidAudioControl(context)
private val camera = AndroidCameraControl(context)
private val exec = AndroidExecControl(context)
private var gatewayClient: GatewayClient? = null
fun start() {
gatewayClient = GatewayClient(config.gatewayUrl, config.token, config.nodeName)
gatewayClient?.connect(this)
}
fun stop() {
gatewayClient?.close()
gatewayClient = null
}
override fun onOpen(webSocket: WebSocket, response: Response) {
webSocket.send(registrationFrame())
}
override fun onMessage(webSocket: WebSocket, text: String) {
val command = runCatching { GatewayCommand.fromJson(text) }.getOrElse { return }
val action = command.action ?: command.command
val result = when (command.type) {
"computer_control" -> computer.handle(action, command.params)
"desktop_observe" -> observe.handle(action, command.params)
"audio_control" -> audio.handle(action, command.params)
"camera_control" -> camera.handle(action, command.params)
"exec" -> exec.handle(command)
"heartbeat_ack", "register_ack" -> return
else -> CapabilityResult.unsupported("Unknown command type: ${command.type}")
}
val resultType = when (command.type) {
"computer_control" -> "computer_control_result"
"desktop_observe" -> "desktop_observe_result"
"audio_control" -> "audio_control_result"
"camera_control" -> "camera_control_result"
"exec" -> "exec_complete"
else -> "error"
}
webSocket.send(result.toWireResult(resultType, command.id, action))
}
fun registrationFrame(): String {
val tools = mutableListOf<String>()
if (config.enableExec) tools += "exec"
if (config.enableComputerControl) tools += "computer_control"
if (config.enableDesktopObserve) tools += "desktop_observe"
if (config.enableAudioControl) tools += "audio_control"
if (config.enableCameraControl) tools += "camera_control"
if (config.enableBrowserControl) tools += "browser_control"
return JSONObject()
.put("type", "register")
.put("node_name", config.nodeName)
.put("version", "0.1.0-android")
.put("tools", jsonArrayOf(tools))
.put("capabilities", capabilityInfo())
.toString()
}
private fun capabilityInfo(): JSONObject = JSONObject()
.put("platform", "android")
.put("enable_exec", config.enableExec)
.put("enable_computer_control", config.enableComputerControl)
.put("enable_desktop_observe", config.enableDesktopObserve)
.put("enable_audio_control", config.enableAudioControl)
.put("enable_camera_control", config.enableCameraControl)
.put("enable_browser", config.enableBrowserControl)
.put("computer_control", computer.capabilityInfo())
.put("desktop_observe", observe.capabilityInfo())
.put("audio_control", audio.capabilityInfo())
.put("camera_control", camera.capabilityInfo())
}
/*
* Hermes Node Android
* Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
* SPDX-License-Identifier: GPL-3.0-or-later
*/
package net.nexlab.hermesnodeandroid
import org.json.JSONArray
import org.json.JSONObject
data class NodeConfig(
val gatewayUrl: String,
val token: String,
val nodeName: String,
val enableExec: Boolean = false,
val enableComputerControl: Boolean = true,
val enableDesktopObserve: Boolean = true,
val enableAudioControl: Boolean = true,
val enableCameraControl: Boolean = true,
val enableBrowserControl: Boolean = false,
)
data class GatewayCommand(
val id: String,
val type: String,
val action: String?,
val command: String?,
val params: JSONObject,
) {
companion object {
fun fromJson(raw: String): GatewayCommand {
val obj = JSONObject(raw)
return GatewayCommand(
id = obj.optString("id"),
type = obj.optString("type"),
action = obj.optString("action", null),
command = obj.optString("command", null),
params = obj.optJSONObject("params") ?: JSONObject(),
)
}
}
}
data class CapabilityResult(
val success: Boolean,
val data: JSONObject = JSONObject(),
val error: String? = null,
) {
fun toWireResult(resultType: String, id: String, action: String? = null): String {
val obj = JSONObject()
.put("type", resultType)
.put("id", id)
.put("success", success)
if (action != null) obj.put("action", action)
if (error != null) obj.put("error", error)
for (key in data.keys()) obj.put(key, data.get(key))
return obj.toString()
}
companion object {
fun unsupported(message: String) = CapabilityResult(false, error = message)
fun ok(vararg pairs: Pair<String, Any?>): CapabilityResult {
val obj = JSONObject()
pairs.forEach { (k, v) -> obj.put(k, v) }
return CapabilityResult(true, obj)
}
}
}
fun jsonArrayOf(items: Iterable<String>): JSONArray {
val arr = JSONArray()
items.forEach { arr.put(it) }
return arr
}
/*
* Hermes Node Android
* Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
* SPDX-License-Identifier: GPL-3.0-or-later
*/
package net.nexlab.hermesnodeandroid.capabilities
import android.content.Context
import android.media.AudioManager
import net.nexlab.hermesnodeandroid.CapabilityResult
import org.json.JSONObject
class AndroidAudioControl(private val context: Context) {
private val audioManager = context.getSystemService(Context.AUDIO_SERVICE) as AudioManager
fun capabilityInfo(): JSONObject = JSONObject()
.put("available", true)
.put("can_get_status", true)
.put("can_play_audio", true)
.put("capture_input_requires_microphone_permission", true)
.put("capture_output_ready", false)
fun handle(action: String?, params: JSONObject): CapabilityResult = when (action) {
"list_audio_devices" -> CapabilityResult.unsupported("list_audio_devices requires API-specific AudioDeviceInfo wiring")
"get_audio_status" -> getAudioStatus()
"play_audio" -> CapabilityResult.unsupported("play_audio scaffolded; needs download/cache + MediaPlayer/ExoPlayer backend")
"capture_input" -> CapabilityResult.unsupported("capture_input requires RECORD_AUDIO permission and foreground UX")
"capture_output" -> CapabilityResult.unsupported("capture_output is not generally available to normal Android apps")
else -> CapabilityResult.unsupported("Unknown audio_control action: $action")
}
private fun getAudioStatus(): CapabilityResult = CapabilityResult.ok(
"music_volume" to audioManager.getStreamVolume(AudioManager.STREAM_MUSIC),
"music_volume_max" to audioManager.getStreamMaxVolume(AudioManager.STREAM_MUSIC),
"ringer_mode" to audioManager.ringerMode,
"speakerphone_on" to audioManager.isSpeakerphoneOn,
"microphone_mute" to audioManager.isMicrophoneMute,
)
}
/*
* Hermes Node Android
* Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
* SPDX-License-Identifier: GPL-3.0-or-later
*/
package net.nexlab.hermesnodeandroid.capabilities
import android.content.Context
import android.hardware.camera2.CameraCharacteristics
import android.hardware.camera2.CameraManager
import net.nexlab.hermesnodeandroid.CapabilityResult
import org.json.JSONArray
import org.json.JSONObject
class AndroidCameraControl(private val context: Context) {
fun capabilityInfo(): JSONObject = JSONObject()
.put("available", true)
.put("can_capture", true)
.put("can_inject", false)
.put("capture_requires_camera_permission", true)
.put("backend", "Camera2/CameraX planned")
fun handle(action: String?, params: JSONObject): CapabilityResult = when (action) {
"list_cameras" -> listCameras()
"get_camera_status" -> CapabilityResult.ok("available" to true, "backend" to "Camera2")
"capture_frame" -> CapabilityResult.unsupported("capture_frame requires CAMERA permission and CameraX/Camera2 capture session")
"capture_video" -> CapabilityResult.unsupported("capture_video requires CAMERA/RECORD_AUDIO permissions and visible UX")
"inject_video" -> CapabilityResult.unsupported("inject_video is not a normal Android capability without rooted/vendor-specific backend")
else -> CapabilityResult.unsupported("Unknown camera_control action: $action")
}
private fun listCameras(): CapabilityResult {
val manager = context.getSystemService(Context.CAMERA_SERVICE) as CameraManager
val arr = JSONArray()
manager.cameraIdList.forEach { id ->
val chars = manager.getCameraCharacteristics(id)
val facing = when (chars.get(CameraCharacteristics.LENS_FACING)) {
CameraCharacteristics.LENS_FACING_FRONT -> "front"
CameraCharacteristics.LENS_FACING_BACK -> "back"
CameraCharacteristics.LENS_FACING_EXTERNAL -> "external"
else -> "unknown"
}
arr.put(JSONObject().put("id", id).put("facing", facing))
}
return CapabilityResult.ok("cameras" to arr, "count" to arr.length())
}
}
/*
* Hermes Node Android
* Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
* SPDX-License-Identifier: GPL-3.0-or-later
*/
package net.nexlab.hermesnodeandroid.capabilities
import android.content.Context
import android.content.Intent
import android.net.Uri
import android.view.KeyEvent
import net.nexlab.hermesnodeandroid.CapabilityResult
import org.json.JSONObject
class AndroidComputerControl(private val context: Context) {
fun capabilityInfo(): JSONObject = JSONObject()
.put("available", true)
.put("platform_semantics", "android_phone_control")
.put("actions", listOf(
"launch_app", "open_url", "open_intent", "key_press",
"tap", "swipe", "long_press", "type_text", "get_active_window",
"screenshot", "mouse_move", "mouse_click", "mouse_position"
).joinToString(","))
.put("requires_accessibility_for_input", true)
.put("notes", "Reuses gateway computer_control name for Android-native phone control")
fun handle(action: String?, params: JSONObject): CapabilityResult = when (action) {
"launch_app" -> launchApp(params.optString("package"))
"open_url" -> openUrl(params.optString("url"))
"open_intent" -> openIntent(params)
"key_press" -> keyPress(params.optString("key"))
"tap", "mouse_click", "mouse_move", "swipe", "long_press", "type_text" ->
CapabilityResult.unsupported("$action requires AccessibilityService gesture/input backend; scaffolded but not wired yet")
"get_active_window", "screenshot", "mouse_position" ->
CapabilityResult.unsupported("$action belongs to AndroidDesktopObserve backend; route via desktop_observe or wire delegation later")
else -> CapabilityResult.unsupported("Unknown computer_control action: $action")
}
private fun launchApp(packageName: String): CapabilityResult {
if (packageName.isBlank()) return CapabilityResult.unsupported("launch_app requires params.package")
val intent = context.packageManager.getLaunchIntentForPackage(packageName)
?: return CapabilityResult.unsupported("No launch intent for package: $packageName")
intent.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK)
context.startActivity(intent)
return CapabilityResult.ok("launched" to packageName)
}
private fun openUrl(url: String): CapabilityResult {
if (url.isBlank()) return CapabilityResult.unsupported("open_url requires params.url")
val intent = Intent(Intent.ACTION_VIEW, Uri.parse(url)).addFlags(Intent.FLAG_ACTIVITY_NEW_TASK)
context.startActivity(intent)
return CapabilityResult.ok("url" to url)
}
private fun openIntent(params: JSONObject): CapabilityResult {
val action = params.optString("intent_action", Intent.ACTION_VIEW)
val data = params.optString("data", "")
val intent = Intent(action).addFlags(Intent.FLAG_ACTIVITY_NEW_TASK)
if (data.isNotBlank()) intent.data = Uri.parse(data)
context.startActivity(intent)
return CapabilityResult.ok("intent_action" to action, "data" to data)
}
private fun keyPress(key: String): CapabilityResult {
val code = when (key.lowercase()) {
"back" -> KeyEvent.KEYCODE_BACK
"home" -> KeyEvent.KEYCODE_HOME
"enter" -> KeyEvent.KEYCODE_ENTER
"volume_up" -> KeyEvent.KEYCODE_VOLUME_UP
"volume_down" -> KeyEvent.KEYCODE_VOLUME_DOWN
else -> return CapabilityResult.unsupported("Unsupported Android key: $key")
}
return CapabilityResult.unsupported("key_press($code) requires AccessibilityService or privileged input injection")
}
}
/*
* Hermes Node Android
* Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
* SPDX-License-Identifier: GPL-3.0-or-later
*/
package net.nexlab.hermesnodeandroid.capabilities
import android.content.Context
import android.content.res.Configuration
import android.os.PowerManager
import android.util.DisplayMetrics
import android.view.WindowManager
import net.nexlab.hermesnodeandroid.CapabilityResult
import org.json.JSONObject
class AndroidDesktopObserve(private val context: Context) {
fun capabilityInfo(): JSONObject = JSONObject()
.put("available", true)
.put("screen_info", true)
.put("active_window_requires_usage_stats_or_accessibility", true)
.put("screenshot_requires_media_projection", true)
fun handle(action: String?, params: JSONObject): CapabilityResult = when (action) {
"screen_info" -> screenInfo()
"active_window", "get_active_window" -> CapabilityResult.unsupported("active_window requires UsageStats or AccessibilityService grant")
"screenshot", "region_screenshot" -> CapabilityResult.unsupported("screenshot requires MediaProjection consent flow")
"clipboard_get" -> CapabilityResult.unsupported("clipboard_get is Android-version/foreground constrained; not wired yet")
"cursor_position", "mouse_position" -> CapabilityResult.ok("x" to 0, "y" to 0, "note" to "Android has touch focus, not a persistent mouse cursor")
else -> CapabilityResult.unsupported("Unknown desktop_observe action: $action")
}
private fun screenInfo(): CapabilityResult {
val metrics: DisplayMetrics = context.resources.displayMetrics
val orientation = when (context.resources.configuration.orientation) {
Configuration.ORIENTATION_LANDSCAPE -> "landscape"
Configuration.ORIENTATION_PORTRAIT -> "portrait"
else -> "unknown"
}
val pm = context.getSystemService(Context.POWER_SERVICE) as PowerManager
return CapabilityResult.ok(
"width" to metrics.widthPixels,
"height" to metrics.heightPixels,
"density" to metrics.density,
"density_dpi" to metrics.densityDpi,
"orientation" to orientation,
"interactive" to pm.isInteractive,
)
}
}
/*
* Hermes Node Android
* Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
* SPDX-License-Identifier: GPL-3.0-or-later
*/
package net.nexlab.hermesnodeandroid.capabilities
import android.content.Context
import net.nexlab.hermesnodeandroid.CapabilityResult
import net.nexlab.hermesnodeandroid.GatewayCommand
import org.json.JSONObject
class AndroidExecControl(private val context: Context) {
fun capabilityInfo(): JSONObject = JSONObject()
.put("available", false)
.put("enabled_by_default", false)
.put("reason", "Android app shell execution is limited and should be explicit")
fun handle(command: GatewayCommand): CapabilityResult =
CapabilityResult.unsupported("exec is disabled for Android until permission model and command allow/deny/ask rules are implemented")
}
# Android capabilities
Android capabilities must map to Android's permission and lifecycle model. Linux parity is not the goal; protocol compatibility with honest capability advertisement is.
The Android agent should expose **the same gateway capability names** as Linux wherever the gateway semantics are already good enough. The implementation behind those names is Android-native.
## Baseline capability set
## Capability mapping
- `android_status`: app version, Android version, manufacturer/model, battery/network summary.
- `android_notification`: optional notification listener bridge; requires explicit Android settings grant.
- `android_media`: media session controls where permitted.
- `android_camera`: explicit capture flow only; requires camera permission and visible UX.
- `android_files`: limited app-private files first; SAF-backed access later.
- `exec`: disabled by default; Android shell is limited and not comparable to Linux node exec.
### `computer_control`
## Security posture
Meaning on Android: phone/device control.
- No hidden background capture.
- No hardcoded tokens.
- WSS by default.
- Foreground service notification while connected.
- Per-capability enable/disable switches.
Actions to support:
- `launch_app`: start app by package name.
- `open_url`: open URL via Android intent.
- `open_intent`: explicit/implicit Android intent bridge.
- `key_press`: Back/Home/Recents/Enter/volume controls where available.
- `tap`, `swipe`, `long_press`: accessibility gesture dispatch.
- `type_text`: accessibility/IME-backed text entry.
- `get_active_window`: foreground app/activity when permission grants exist.
Keep the gateway name `computer_control` unless gateway-side UX later needs a friendly alias. Compatibility beats naming purity here.
### `desktop_observe`
Meaning on Android: screen/device observation.
Actions to support:
- `screen_info`: display metrics, density, orientation, interactive state.
- `active_window` / `get_active_window`: foreground app/activity using UsageStats or AccessibilityService.
- `screenshot` / `region_screenshot`: MediaProjection with explicit consent.
- `clipboard_get`: only where Android version and foreground restrictions permit it.
- `cursor_position` / `mouse_position`: mostly compatibility shim; Android has touch focus, not a persistent cursor.
### `audio_control`
Actions to support:
- `get_audio_status`: stream volume, route, ringer mode, mute state.
- `list_audio_devices`: AudioDeviceInfo enumeration.
- `play_audio`: play gateway-provided audio through MediaPlayer/ExoPlayer.
- `capture_input`: microphone recording with runtime permission and foreground UX.
- `capture_output`: return unsupported unless Android API/app role permits it.
### `camera_control`
Actions to support:
- `list_cameras`: Camera2/CameraX camera inventory.
- `get_camera_status`: backend/permission state.
- `capture_frame`: explicit camera capture.
- `capture_video`: explicit video capture.
- `inject_video`: unsupported on normal Android; no fake parity.
### `browser_control`
Not first milestone. Basic browser launching/open URL belongs in `computer_control.open_url`. Full browser automation can be added later if there is a clean Android backend.
### `exec`
Disabled by default. Android shell execution is SELinux/app-sandbox constrained and not Linux-equivalent. If enabled later, it must implement the same allow/deny/ask model as Linux and clearly report limitations.
## Permission model
Android requires explicit grants for the interesting bits:
- AccessibilityService for gestures, key actions, UI inspection.
- UsageStats or Accessibility for foreground app/activity.
- MediaProjection for screenshots/screen recording.
- Camera permission for capture.
- Microphone permission for input capture.
- Notification listener permission for notification bridge if added.
The app must surface these as setup/status checks, not pretend they are available.
# Protocol compatibility
Hermes Node Android should speak the same JSON/WebSocket protocol as the Linux node agent where the semantics match.
Hermes Node Android should speak the same JSON/WebSocket protocol as the Linux node agent.
## Shared messages
## Registration
- `register`
- `register_ack`
- `heartbeat`
- `heartbeat_ack`
- command messages routed by capability
## Android-specific principle
The Android agent should advertise Android-specific capabilities rather than pretending to support Linux capabilities. Gateway-side tools can route to these capabilities once implemented.
## Initial registration example
The Android node registers existing tool names:
```json
{
"type": "register",
"node_name": "phone",
"version": "0.1.0-android",
"capabilities": ["android_status"],
"capability_info": {
"tools": [
"computer_control",
"desktop_observe",
"audio_control",
"camera_control"
],
"capabilities": {
"platform": "android",
"enable_exec": false,
"enable_camera_control": false,
"enable_audio_control": false
"computer_control": {
"available": true,
"platform_semantics": "android_phone_control"
}
}
}
```
## Wire compatibility rule
Use existing message/result types:
- `computer_control``computer_control_result`
- `desktop_observe``desktop_observe_result`
- `audio_control``audio_control_result`
- `camera_control``camera_control_result`
- `exec` → existing exec output/complete flow, if later enabled
Do not introduce `phone_control` unless the gateway needs a user-facing alias. The Android node can expose Android-only actions inside `computer_control` while preserving the gateway-side tool name.
## Android-specific action examples
### Launch an app
```json
{
"type": "computer_control",
"id": "cmd-1",
"action": "launch_app",
"params": {
"package": "org.telegram.messenger"
}
}
```
### Open a URL
```json
{
"type": "computer_control",
"id": "cmd-2",
"action": "open_url",
"params": {
"url": "https://lisa.nexlab.net/"
}
}
```
### Get screen info
```json
{
"type": "desktop_observe",
"id": "cmd-3",
"action": "screen_info",
"params": {}
}
```
## Compatibility principle
Same capability names, honest platform metadata, explicit unsupported errors for Android-impossible actions. That keeps the gateway stable while giving Android real functionality.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment