docs: add multimodal test client design

aea94bc8 · Stefy Lanza (nextime / spora ) · a7f4abdb · aea94bc8
Commit aea94bc8 authored May 06, 2026 by Stefy Lanza (nextime / spora )
Hide whitespace changes
Inline Side-by-side

Showing with 351 additions and 0 deletions

2026-05-06-manual-multimodal-test-client-design.md .../specs/2026-05-06-manual-multimodal-test-client-design.md +351 -0

No files found.
--- a/docs/superpowers/specs/2026-05-06-manual-multimodal-test-client-design.md
+++ b/docs/superpowers/specs/2026-05-06-manual-multimodal-test-client-design.md
+# Manual Multimodal Test Client Design
+
+## Overview
+
+Add a standalone Python test client script for manual, one-request-at-a-time smoke testing of key CoderAI API flows.
+
+The script should support these test modes:
+
+- normal LLM prompt
+- transcription
+- audio generation
+- video generation
+- video doubt
+- music audio doubt
+
+The user should be able to run exactly one mode per invocation, either by passing the mode directly on the command line or by launching the script without a mode and choosing from an interactive menu.
+
+The script must be usable with minimal input by providing meaningful defaults for prompt, model, endpoint URL, and media files where needed, while still allowing explicit overrides.
+
+## Goals
+
+- Provide one convenient manual smoke-test client for multiple API capabilities.
+- Support one test at a time, not batch orchestration.
+- Allow both scripted CLI use and interactive ad-hoc use.
+- Allow overriding endpoint URL, bearer token, model name, prompt text, and media file inputs.
+- Save generated artifacts locally for generation modes and print any text reply to stdout.
+- Keep the implementation standalone and easy to run without touching the server code paths.
+
+## Non-Goals
+
+- Build an automated benchmark suite.
+- Add server endpoints or alter server request/response contracts.
+- Guarantee that every configured model supports every test mode.
+- Replace unit/integration tests with this script.
+- Support running multiple modes in a single invocation.
+
+## Current Context
+
+The current repository already exposes the following relevant endpoints:
+
+- `POST /v1/chat/completions` in `codai/api/text.py`
+- `POST /v1/audio/transcriptions` in `codai/api/transcriptions.py`
+- `POST /v1/audio/generate` in `codai/api/audio_gen.py`
+- `POST /v1/video/generations` in `codai/api/video.py`
+
+The Pydantic request models show:
+
+- `AudioGenerationRequest` accepts prompt-based generation and returns audio-oriented response data.
+- `VideoGenerationRequest` accepts prompt-based generation plus optional media inputs and returns video-oriented response data.
+- `ChatCompletionRequest` is permissive on extra fields but `ChatMessage.content` normalization currently appears text-centric, converting multipart content arrays into flattened text or placeholders.
+
+This means the client can confidently target the transcription, audio generation, video generation, and plain LLM paths directly. The two “doubt” modes must be designed around the request shapes the backend currently accepts rather than assuming a richer multimodal chat transport that is not yet proven here.
+
+## Recommended Approach
+
+Create a single standalone Python CLI script with:
+
+- one positional `mode` argument for direct execution
+- an interactive menu fallback when `mode` is omitted
+- shared global options for connection/auth/model overrides
+- mode-specific request builders
+- shared response handling utilities
+- local artifact saving for generation outputs
+
+This keeps the user experience simple while avoiding duplicated logic across multiple small scripts.
+
+## Script Behavior
+
+### Invocation styles
+
+The script should support both of these patterns:
+
+- direct mode invocation, for example `python <script> llm`
+- interactive menu invocation, for example `python <script>` then select one mode
+
+Exactly one mode should execute per run.
+
+### Supported modes
+
+The script should expose these user-facing mode names:
+
+- `llm`
+- `transcription`
+- `audio-generation`
+- `video-generation`
+- `video-doubt`
+- `music-audio-doubt`
+
+Aliases are optional but not required in the first version.
+
+### Global options
+
+Every mode should support these shared overrides:
+
+- `--url` for the base server URL
+- `--token` for bearer authentication
+- `--model` for model name or alias override
+- `--prompt` for prompt override when the mode uses prompt text
+- `--output-dir` for local artifact save location
+
+The script may also expose mode-specific file flags such as `--file`, `--audio-file`, `--video-file`, or similarly explicit names, but the shared defaults must make a basic smoke test possible without forcing the user to provide every input.
+
+## Mode Design
+
+### LLM mode
+
+Purpose:
+- send a normal text-only chat request
+
+Endpoint:
+- `POST /v1/chat/completions`
+
+Default inputs:
+- default URL
+- default text-capable model alias/name
+- default prompt such as a short deterministic question
+
+Success behavior:
+- print assistant text reply to stdout
+
+### Transcription mode
+
+Purpose:
+- upload one audio file for speech-to-text
+
+Endpoint:
+- `POST /v1/audio/transcriptions`
+
+Default inputs:
+- default URL
+- default audio transcription model alias/name
+- default sample audio file path
+- optional default language/prompt values if useful
+
+Request format:
+- multipart form with `model` and uploaded `file`
+
+Success behavior:
+- print returned transcription text to stdout
+
+### Audio generation mode
+
+Purpose:
+- generate an audio artifact from a text prompt
+
+Endpoint:
+- `POST /v1/audio/generate`
+
+Default inputs:
+- default URL
+- default audio-generation model alias/name
+- default audio-generation prompt
+- default response format chosen so the client can resolve an artifact reliably
+
+Success behavior:
+- save the resulting audio artifact locally
+- print the local output path to stdout
+- print any returned text metadata or summary if present
+
+### Video generation mode
+
+Purpose:
+- generate a video artifact from a text prompt
+
+Endpoint:
+- `POST /v1/video/generations`
+
+Default inputs:
+- default URL
+- default video-generation model alias/name
+- default prompt
+- conservative default generation parameters suitable for smoke testing
+
+Success behavior:
+- save the resulting video artifact locally
+- print the local output path to stdout
+- print any returned text metadata or summary if present
+
+### Video doubt mode
+
+Purpose:
+- ask a textual question about a supplied video input
+
+Endpoint strategy:
+- use the currently supported server request shape rather than inventing a new dedicated endpoint
+
+Design constraint:
+- the current visible chat request normalization in `codai/pydantic/textrequest.py` suggests text-centric handling of multipart content arrays, so this mode must be implemented only according to request formats actually supported by the current codebase.
+
+First-version design:
+- treat this as a smoke-test mode that combines:
+  - a user prompt describing the question about the video
+  - a required or default video input reference
+  - the best-supported request format discovered in the implementation phase
+
+Success behavior:
+- print the resulting text answer to stdout
+- if the server returns an artifact in addition to text, save it and print the path as well
+
+### Music audio doubt mode
+
+Purpose:
+- ask a textual question about a supplied audio/music input
+
+Endpoint strategy:
+- same constraint and approach as `video-doubt`
+
+First-version design:
+- combine:
+  - a user prompt describing the question about the audio/music input
+  - a required or default audio input reference
+  - the best-supported request format discovered in the implementation phase
+
+Success behavior:
+- print the resulting text answer to stdout
+- if the server returns an artifact in addition to text, save it and print the path as well
+
+## Defaults Strategy
+
+The script should be useful without a long flag list.
+
+Each mode should define a default bundle containing:
+
+- default base URL
+- default model
+- default prompt if applicable
+- default sample file path if required
+- default output extension or expected artifact type
+
+These defaults should be easy to inspect and adjust inside the script.
+
+For file-backed defaults:
+- use predictable local sample asset paths
+- do not embed binary sample files into the script itself
+- if a default sample file is missing, fail with a clear message that tells the user which override flag to supply
+
+## Output Handling
+
+### Stdout behavior
+
+The script should always print the text reply/result to stdout when one exists.
+
+Examples:
+- LLM reply text
+- transcription text
+- text answer from video/audio doubt mode
+- metadata/summary text accompanying a generation response
+
+### Artifact behavior
+
+For generation-oriented responses, the script should save the generated artifact locally by default.
+
+This includes:
+- audio file for `audio-generation`
+- video file for `video-generation`
+- any returned artifact from other modes, if present
+
+The script should then print the resolved saved path to stdout.
+
+### Output directory
+
+The script should save files to a predictable output directory, configurable through `--output-dir`.
+
+File naming should avoid collisions by including mode and a timestamp or similarly unique suffix.
+
+### URL and base64 handling
+
+If an endpoint returns:
+- a downloadable URL, the client should fetch it and save the binary locally
+- base64 content, the client should decode and save it locally
+
+The script should hide those mechanics from the user and present a final local artifact path.
+
+## Error Handling
+
+The script should fail clearly when:
+
+- a required input file is missing and no default sample exists
+- the server returns a non-2xx response
+- a generation mode succeeds structurally but does not actually contain a usable artifact
+- the response shape is incompatible with the selected mode
+- a URL-based artifact cannot be downloaded
+- base64 artifact decoding fails
+
+Error messages should explain:
+- which mode failed
+- which endpoint was called
+- which file/model/prompt inputs were resolved
+- what the server or client-side parsing error was
+
+## File and Code Organization
+
+A focused standalone script is preferred over scattering logic across multiple files unless the script becomes too large.
+
+Internally, the script should still be organized into clear units such as:
+
+- argument parsing and interactive selection
+- mode default resolution
+- request builders per mode
+- HTTP execution helpers
+- response parsing and artifact saving helpers
+
+If helper extraction becomes necessary, it should remain narrowly scoped to this client tooling rather than affecting server modules.
+
+## Testing Strategy
+
+Add automated tests for the client script logic without depending on live model inference.
+
+Recommended test coverage:
+
+- mode selection logic for CLI vs interactive fallback
+- default resolution per mode
+- override precedence for URL/token/model/prompt/file inputs
+- request construction for each supported endpoint type
+- response parsing for:
+  - plain text responses
+  - generation responses returning URLs
+  - generation responses returning base64 payloads
+- error handling for missing default files and malformed responses
+
+Use mocked HTTP responses rather than live network calls.
+
+## Open Implementation Constraint
+
+The exact implementation of `video-doubt` and `music-audio-doubt` must be finalized by confirming which request shape the current backend really supports for video/audio-question-style prompts.
+
+The implementation should not invent a new server contract. It should instead:
+
+- inspect current supported request structures in the repo
+- choose the best existing format
+- document any limitations explicitly in the client behavior and tests
+
+## Files Likely to Change
+
+Likely additions:
+- one standalone test client script in a project-appropriate scripts/tools location
+- one or more tests for that script
+- optional sample asset directory path references if the repo already has or will accept such test assets
+
+Likely no server changes are required for the first version.
+
+## Design Decisions Finalized
+
+- One standalone Python script handles all supported manual smoke-test modes.
+- The script runs exactly one mode per invocation.
+- It supports both direct mode invocation and interactive selection fallback.
+- Users can override endpoint URL, bearer token, model name, prompt, and file inputs.
+- Defaults are built in so the script remains usable with minimal arguments.
+- Generation modes save artifacts locally and print the saved path.
+- Textual replies are printed to stdout whenever present.
+- `video-doubt` and `music-audio-doubt` must be implemented against proven existing backend request formats, not assumed new multimodal contracts.