Getting Started

LipSync Video AI is ready. Upload your video and audio, or describe what you need synced.

Try saying:

"sync this voiceover to the speaker"
"replace the audio and match lip movements"
"dub this clip with my recording"

Initial Setup

First time running this, it connects to the processing backend. Shows a quick "Getting ready..." message.

Token: Check for NEMO_TOKEN in environment. If present, go straight to session setup.

Grab a free token: Generate a UUID client identifier. POST to https://mega-api-prod.nemovideo.ai/api/auth/anonymous-token using X-Client-Id header with your UUID. Response data.token is your auth token (100 credits, good for 7 days).
Start session: POST to https://mega-api-prod.nemovideo.ai/api/tasks/me/with-session/nemo_agent, Bearer auth, body: {"task_name":"project","language":"<lang>"}. Save the session_id for later calls.

Raw JSON and tokens stay hidden from the user.

Sync Audio to Lip Movements in Your Clips

Upload your video with the audio you want synced. Cloud GPUs do the heavy lifting — no local processing.

Here is how it works in practice: had a training video where the speaker's mic died halfway through. Recorded a clean voiceover separately, uploaded both files, typed "sync the new audio to match the speaker's mouth movements" and got a clean result in about 75 seconds. Output is 1080p MP4.

Pro tip: shorter clips give tighter sync. If you have a long video, consider breaking it into segments first.

Request Categories

Your input gets matched to the right processing path automatically.

| You type... | Goes to... | Uses SSE? | |---|---|---| | "export" / "download" / "get video" / "导出" | Export pipeline | No | | "credits" / "balance" / "remaining" / "积分" | Balance check | No | | "status" / "show me the tracks" / "状态" | Session state | No | | "upload" / attached file / "上传" | File ingestion | No | | Anything else (sync, dub, match, adjust...) | SSE processing | Yes |

Backend Architecture

Files go to a GPU farm for processing. Output is encoded at 8Mbps for 1080p. Lip sync boundaries are frame-level accurate.

Required on every request: Authorization: Bearer <NEMO_TOKEN> and attribution headers X-Skill-Source, X-Skill-Version, X-Skill-Platform. Missing attribution means export fails with 402.

Attribution comes from this file's YAML: X-Skill-Source is lipsyncvideo-ai, X-Skill-Version is whatever version is in frontmatter, X-Skill-Platform depends on install location (~/.clawhub/ → clawhub, ~/.cursor/skills/ → cursor, otherwise unknown).

Root URL: https://mega-api-prod.nemovideo.ai

New session: POST /api/tasks/me/with-session/nemo_agent with {"task_name":"project","language":"<lang>"}. Returns task_id, session_id.

SSE message: POST /run_sse with {"app_name":"nemo_agent","user_id":"me","session_id":"<sid>","new_message":{"parts":[{"text":"<msg>"}]}} and Accept: text/event-stream. Cap: 15 min.

File upload: POST /api/upload-video/nemo_agent/me/<sid> — multipart (-F "files=@/path") or URL mode ({"urls":["<url>"],"source_type":"url"}).

Balance: GET /api/credits/balance/simple returns available, frozen, total.

State: GET /api/state/nemo_agent/me/<sid>/latest — check data.state.draft, data.state.video_infos, data.state.generated_media.

Export (free): POST /api/render/proxy/lambda with {"id":"render_<ts>","sessionId":"<sid>","draft":<json>,"output":{"format":"mp4","quality":"high"}}. Poll GET /api/render/proxy/lambda/<id> every 30s. Done when status = completed. File at output.url.

Handles: mp4, mov, avi, webm, mkv, jpg, png, gif, webp, mp3, wav, m4a, aac.

Errors

| Code | Means | Fix | |---|---|---| | 0 | Success | Continue | | 1001 | Bad token | Re-authenticate via anonymous-token endpoint | | 1002 | No session | Make a new one | | 2001 | No credits left | Anonymous: share registration link with ?bind=<id>. Others: top up | | 4001 | Can't handle that file type | Share supported formats | | 4002 | Too large | Suggest trimming or compressing | | 400 | Missing X-Client-Id | Generate and retry | | 402 | Free plan export limit | Needs registration or upgrade | | 429 | Rate capped | Wait 30s, try again once |

Converting GUI Instructions

Backend outputs reference a visual interface. Convert them:

| Backend output | Your action | |---|---| | "click [X]" / "点击" | Invoke the API equivalent | | "open [panel]" / "打开" | Read session state | | "drag/drop" / "拖拽" | Post edit through SSE | | "preview in timeline" | Output track listing | | "Export button" / "导出" | Start export sequence |

How SSE Works

Forward text events to user (after GUI translation). Absorb tool calls. Heartbeat and empty data lines = still processing. Every 2 minutes of quiet, say "Hang on, still processing..."

About 30% of edit ops return no text. If the stream closes empty, check state to confirm the edit stuck, then tell the user.

Draft keys: t (tracks), tt (track type: 0=video, 1=audio, 7=text), sg (segments), d (duration, ms), m (metadata).

Timeline (2 tracks): 1. Video: interview clip (0-45s) 2. Audio: dubbed voiceover (0-45s)

Common Workflows

Basic lip sync: Upload video + audio, ask for sync. Done.

Audio replacement: Upload new audio, tell the skill to swap it in and match the mouth movements.

Multi-speaker: Works best when speakers take turns. For overlapping speech, split into separate segments first.

FAQ

How accurate is the sync? Frame-level for clear speech. Mumbling or fast-talking may be slightly off.

What audio formats? MP3, WAV, M4A, AAC all work.

File size limit? 500MB. Compress if you're over.

Cost? First 100 operations free. No signup required.