Cinematic Cultural Video Pipeline

Overview

This skill captures a full workflow for making a high-quality AI short video about cultural, historical, folk, or festival topics. It is designed for cases where the user explicitly wants to go step by step: understand → plan → script → design assets → storyboard → confirm → generate video → compose final MP4.

Core principle: do not jump straight to video. Build a coherent production system first, especially when the video needs recurring characters, historical figures, fixed clothing, props, and cinematic continuity.

The workflow was proven on a Dragon Boat Festival customs explainer: a modern adult female protagonist travels back to ancient times, observes Qu Yuan, dragon boats, zongzi, calamus, mugwort, and returns to the present with the meaning of “端午安康”.

When to Use

Use this when the user asks for:

节日习俗 / 民俗 / 历史 / 神话 / 传统文化科普视频
“先理解，再规划，再脚本，再出图，全确认后再生成视频”
cinematic realism, 东方玄幻, 修仙感, 大场面, not ordinary PPT / children’s illustration / generic ink-painting explainers
recurring characters that must stay consistent across many shots
a full AI video pipeline using storyboards as image-to-video first frames
final delivery with narration, subtitles, music, compressed preview, and high-quality original

Do not use this for quick one-off image generation, ordinary static posters, or videos where the user does not care about staged approval and continuity.

Mandatory Production Gates

Never skip gates unless the user explicitly says to skip them.

Understand source material
- Read uploaded PDFs/images/notes.
- Extract factual points, customs, characters, places, legends, and symbolic meanings.
- Do not write a script yet.
- Ask for confirmation of understanding.
Plan the video
- Define audience, duration, style, narrative hook, visual direction, structure, and risk points.
- Decide whether to use a narrator, viewpoint character, or first-person story.
- Confirm before scripting.
Write/rewrite the script
- Produce narration, short subtitles, scene beats, tone notes.
- Keep subtitles short; do not fill the screen.
- For cultural videos, prefer story-driven explanation instead of lecture.
- Confirm before visuals.
Analyze visual references
- Summarize realism level, color palette, lighting, lens style, clothing, world tone.
- Convert user phrases like “普通没人看了” into a visual strategy: cinematic realism, fantasy atmosphere, dramatic scene scale.
Pre-production bible
- Define world rules, style rules, protagonist, supporting characters, historical figures, scenes, props, costumes, palette, camera rules.
Character design first
- Generate/confirm main character three-view sheet.
- Generate/confirm expression sheet and action sheet if recurring.
- Generate/confirm historical figures and supporting characters.
- Do not create final storyboards before key characters are approved.
Scene/prop design
- Generate/confirm recurring environments and cultural props.
- For festival videos, props are often the factual anchors: zongzi, mugwort, calamus, five-color thread, dragon boat, realgar wine, old book, ritual symbols.
Storyboard frames
- Generate one image per shot, embedding character, costume, scene, prop, camera, lighting, and narration constraints.
- Confirm all storyboards before video generation.
Image-to-video generation
- Use confirmed storyboards as first frames.
- Generate short clips per shot.
- Keep each shot independently rerunnable.
Post-production composition
- Align clip durations to narration.
- Add voiceover, subtitles, background ambience/music.
- Export high-quality original and a compressed messaging-platform version.

Story Pattern: Modern Viewpoint Character

For folk-history topics, a strong default pattern is a modern protagonist as audience surrogate:

Modern misconception or simple habit: “以前我以为端午就是吃粽子。”
Trigger object: old book, herb scent, festival object, family item.
Portal or sensory transition.
Ancient/historical site arrival.
Observe key figure or key event.
Participate in one small symbolic action.
Connect individual customs into one meaning.
Return to modern life with a changed understanding.

Tone rule:

Most narration can be happy, curious, first-person, travel/vlog-like.
Heavy history, death, exile, war, patriotic grief, or real historical suffering must become slower and respectful.
Avoid documentary announcer tone unless the user asks for it.

Character and Asset Rules

Main protagonist

For a recurring modern character, lock:

adult age range and identity; avoid minor-coded wording
exact hairstyle
exact clothing and outerwear state
shoes and accessories
body type and emotional energy
travel/action wear: sweat, dust, wrinkles, mud, wind
recurring object: book, herb, bracelet, etc.

For three-view sheets, hard-code:

same person, same face, same hairstyle
same exact outfit front/side/back
same outerwear state in all views
same shoes/accessories
no inconsistent garment changes

Historical figures

Do not invent canonical figures as generic fantasy NPCs. Identify public visual anchors first:

period and social identity
classic painting/statue language
signature headwear, clothing, props, posture
temperament
explicit negative constraints

Example for Qu Yuan:

Warring States Chu scholar-official
thin middle-aged literati
high crown / 切云冠意象
dark blue or deep cyan wide-sleeved Chu robe
bamboo scroll, herbs, riverbank exile mood
sorrowful, dignified, patriotic grief
not a young idol, Daoist immortal, wuxia swordsman, armored general, or white-robed fantasy god

Props and customs

Every custom should have both:

surface action: eating zongzi, racing dragon boats, hanging calamus, placing mugwort
meaning layer: remembrance, protection, disease avoidance, seasonal cleansing, family safety, folk exorcism

Make the visual metaphor serve the factual meaning. Do not add random combat or fantasy effects that distort the custom.

Prompting Structure

Use a repeated prompt skeleton for every asset and storyboard:

[Format] vertical 9:16, cinematic realism, high detail, no text, no watermark
[World] grounded historical/cultural setting + controlled eastern fantasy atmosphere
[Characters] exact character identity, outfit, hairstyle, emotional state
[Action] what happens in this shot
[Scene/Props] confirmed scene and props
[Camera] shot size, movement, lens feel
[Lighting/Color] palette and atmosphere
[Continuity] maintain fixed outfit, fixed face, fixed prop design
[Negative] no cartoon, no childish style, no random clothing changes, no extra limbs, no subtitles/logos/watermarks

For video prompts, add motion instructions:

Keep the first-frame character identity, exact outfit, historical accuracy, natural motion, cinematic vertical video, no subtitles, no logos, no watermarks, no distorted hands or faces.

Batch Storyboard-to-Video Workflow

Directory convention

Use a self-contained artifact tree:

~/.hermes/artifacts/<project>/
  source/                     # optional copied source materials
  characters/                 # character prompts and sheets
  scenes/                     # scene prompts and images
  props/                      # prop prompts and images
  storyboard/                 # 01_...png through NN_...png
  video_clips/                # generated per-shot mp4 files + task metadata
  compose_work/               # intermediate processed clips/audio
  final_script_vX.md
  final_narration_vX.txt
  narration_vX.mp3
  final_vX.mp4
  final_vX_compressed.mp4

Public image URL requirement

Image-to-video APIs such as Seedance usually require a public image_url; local paths and base64 images may fail.

Preferred options:

Stable object storage: COS / TOS / S3 / R2.
Temporary public hosts only as fallback, then submit tasks immediately.

Temporary fallback pattern:

import mimetypes, requests
from pathlib import Path

def upload_public(path: Path) -> str:
    mime = mimetypes.guess_type(str(path))[0] or 'application/octet-stream'

    # Uguu returns JSON: {success: true, files: [{url: ...}]}
    try:
        with path.open('rb') as f:
            r = requests.post(
                'https://uguu.se/upload',
                files={'files[]': (path.name, f, mime)},
                timeout=180,
            )
        if r.status_code < 400:
            url = r.json().get('files', [{}])[0].get('url')
            if url and url.startswith('http'):
                return url.replace('\\/', '/')
        last = f'Uguu {r.status_code}: {r.text[:300]}'
    except Exception as e:
        last = f'Uguu error: {e}'

    # Litterbox returns a plain-text URL
    try:
        with path.open('rb') as f:
            r = requests.post(
                'https://litterbox.catbox.moe/resources/internals/api.php',
                data={'reqtype': 'fileupload', 'time': '1h'},
                files={'fileToUpload': (path.name, f, mime)},
                timeout=180,
            )
        if r.status_code < 400:
            url = r.text.strip()
            if url.startswith('http'):
                return url
        last += f' | Litterbox {r.status_code}: {r.text[:300]}'
    except Exception as e:
        last += f' | Litterbox error: {e}'

    raise RuntimeError(f'All temporary upload hosts failed: {last}')

Pitfall: 0x0.st may return 503 uploads disabled ... AI botnet spam. Do not block on it; fail over.

Submit image-to-video task

For Volcengine Agent Plan / Seedance, use environment variables and never hard-code keys:

import os, requests

API_KEY = os.environ.get('CUSTOM_AGENTPLAN_API_KEY')
BASE_URL = os.getenv('CUSTOM_AGENTPLAN_BASE_URL', 'https://ark.cn-beijing.volces.com/api/plan/v3').rstrip('/')
MODEL = 'doubao-seedance-1.5-pro'
if not API_KEY:
    raise RuntimeError('CUSTOM_AGENTPLAN_API_KEY missing')
HEADERS = {'Authorization': f'Bearer {API_KEY}', 'Content-Type': 'application/json'}

payload = {
    'model': MODEL,
    'content': [
        {'type': 'text', 'text': video_prompt},
        {'type': 'image_url', 'image_url': {'url': image_url}},
    ],
    'resolution': '720p',
    'ratio': '9:16',
    'duration': 5,
}

r = requests.post(f'{BASE_URL}/contents/generations/tasks', headers=HEADERS, json=payload, timeout=90)
r.raise_for_status()
task_id = r.json()['id']

Important API details:

Use type: image_url, not base64 image content.
Do not add role fields to image content.
Poll asynchronously; tasks may be queued, running, then succeeded.
Video URL is usually in content.video_url, but implement recursive fallback search if needed.

Resume-safe metadata

Write a JSON metadata file after every upload/submission/poll so the batch can resume:

{
  "clips": [
    {
      "idx": 1,
      "file": "01_scene.png",
      "prompt": "...",
      "image_url": "https://...",
      "task_id": "...",
      "status": "downloaded",
      "video_url": "https://...",
      "local_video": "/path/to/01_scene.mp4"
    }
  ]
}

Submit all missing tasks first, then poll pending tasks until all are downloaded. This avoids restarting from scratch after a timeout.

Final Composition Workflow

Generate narration

Use a TTS provider suited to the language. For Chinese, edge-tts works well:

edge-tts \
  --voice zh-CN-XiaoyiNeural \
  --rate +8% \
  --pitch +6Hz \
  --file final_narration.txt \
  --write-media narration.mp3

Use a warmer voice for documentary; use a livelier voice for first-person travel/vlog energy.

Align clip durations to narration

Use ffprobe to get narration duration.
Estimate per-shot duration from paragraph length.
Scale all shot durations to match exact narration length.
If generated clips are shorter than the needed shot duration, loop them; if longer, trim them.

Example duration estimate:

weights = []
for paragraph in narration_paragraphs:
    chinese = sum(1 for ch in paragraph if '\u4e00' <= ch <= '\u9fff')
    punctuation = sum(1 for ch in paragraph if ch in '，。！？——：、')
    weights.append(chinese + 0.8 * punctuation)

durations = [w / sum(weights) * narration_duration for w in weights]

Normalize clips and burn short subtitles

ffmpeg -y -stream_loop -1 -i input_clip.mp4 \
  -t 6.500 \
  -vf "scale=720:1280:force_original_aspect_ratio=increase,crop=720:1280,setsar=1,fps=30,drawtext=fontfile=/usr/share/fonts/truetype/wqy/wqy-zenhei.ttc:text='短字幕':x=(w-text_w)/2:y=h-185:fontsize=38:fontcolor=white:borderw=4:bordercolor=black@0.72:box=1:boxcolor=black@0.22:boxborderw=18" \
  -an -c:v libx264 -preset veryfast -crf 18 -pix_fmt yuv420p \
  processed_01.mp4

Subtitle rules:

Keep subtitles short; use meaning subtitles, not full narration.
Place near bottom but not over important faces/hands.
Use strong border for readability.
Avoid long multi-line subtitles in cinematic videos.

Concatenate clips

printf "file '%s'\n" processed_*.mp4 > concat.txt
ffmpeg -y -f concat -safe 0 -i concat.txt -c copy stitched.mp4

Add background audio and mix

A simple generated ambience bed can be enough for drafts:

ffmpeg -y \
  -f lavfi -i "anoisesrc=color=pink:amplitude=0.018:duration=123" \
  -f lavfi -i "sine=frequency=196:duration=123:sample_rate=44100" \
  -filter_complex "[1:a]volume=0.018[a1];[0:a][a1]amix=inputs=2:duration=first,afade=t=in:st=0:d=3,afade=t=out:st=119:d=4[a]" \
  -map "[a]" -c:a libmp3lame -q:a 4 music_bed.mp3

Mix with narration:

ffmpeg -y -i narration.mp3 -i music_bed.mp3 \
  -filter_complex "[0:a]volume=1.0[a0];[1:a]volume=0.38[a1];[a0][a1]amix=inputs=2:duration=first:dropout_transition=0,alimiter=limit=0.95[a]" \
  -map "[a]" -c:a aac -b:a 192k mixed_audio.m4a

Mux final video:

ffmpeg -y -i stitched.mp4 -i mixed_audio.m4a \
  -map 0:v:0 -map 1:a:0 \
  -c:v copy -c:a aac -shortest -movflags +faststart final_v1.mp4

Messaging-platform compressed copy

For Feishu/Lark or other messaging apps, create a smaller copy if the original is too large:

ffmpeg -y -i final_v1.mp4 \
  -vf "scale=540:960" \
  -c:v libx264 -preset medium -crf 28 \
  -c:a aac -b:a 96k \
  -movflags +faststart final_v1_feishu.mp4

Send the compressed copy in chat if the original fails, but always provide the original path too.

Quality Control Checklist

Before telling the user the video is done:

[ ] All storyboard clips generated and downloaded.
[ ] Metadata JSON records image URLs, task IDs, video URLs, local paths.
[ ] Final MP4 opens and has expected duration.
[ ] Aspect ratio is vertical 9:16.
[ ] Narration and subtitles roughly align.
[ ] Short subtitles do not cover key faces or action.
[ ] Main character keeps same face, hairstyle, outfit, and outerwear state.
[ ] Historical figures do not drift into generic fantasy characters.
[ ] Props remain recognizable and culturally accurate.
[ ] No visible watermarks, random text, mangled hands/faces, or severe flicker.
[ ] Provide both final original path and per-shot clip directory.

Common Pitfalls

Skipping staged approvals. This creates random visuals and user distrust. Confirm understanding, plan, script, assets, and storyboards before video.
Treating “出图” as standalone illustrations. For cinematic shorts, “出图” should usually mean character sheets, scenes, props, then storyboard frames.
Forgetting public image URLs. Image-to-video APIs cannot read local files; upload storyboards first.
Relying on one temporary file host. Temporary hosts fail or rate-limit. Prefer object storage; otherwise implement Uguu + Litterbox fallback.
No resume metadata. Long batches time out. Save metadata after every step.
Generating one giant video directly. Use per-shot clips. Bad shots can be rerun individually.
Overlong subtitles. Cinematic videos need short, readable captions, not full narration on screen.
Ignoring historical visual anchors. Real cultural figures need canonical period/style constraints.
Not providing paths. Users often want original files and clip files. Provide final MP4 path, compressed path, and video_clips/ directory.

Deliverables Template

When finished, reply with:

成片已完成：

高清原片：`/path/to/final_v1.mp4`
飞书压缩版：`/path/to/final_v1_feishu.mp4`
分段视频目录：`/path/to/video_clips/`
旁白文件：`/path/to/final_narration.txt`
工程目录：`/path/to/project/`

MEDIA:/path/to/final_v1_feishu.mp4

If the user asks for the original, send:

MEDIA:/path/to/final_v1.mp4