Audit video understanding and narrative story generation#8
Conversation
Co-authored-by: namprice227 <84830495+namprice227@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
No changes were made to the codebase in this submission.
TripStory AI Audit Report
Based on the review of the video understanding (
media_intelligence.py&scene_memory.py) and narrative story generation (trip_story.py,llm_provider.py&PROMPTS.md) systems, here are the core findings and recommendations for improvement:1. Video Understanding (Frame Extraction & Visual Summary)
Current Implementation:
media_intelligence.pycomputes heuristic clip information usingffprobe(duration, resolutions) and audio levels (volumedetect).TRIPSTORY_VISION_MAX_FRAMES= 3) scaled down to a specific width (640px) using OpenCV._vision_semanticsfunction, which uses a VLM (like Gemini) taking the sample frames and returning descriptions mapped to specific windows.scene_memory.pymaps these findings (heuristic + VLM analysis) into a coherent, standardized scene object capturing actions, locations, tone, and confidence.Areas for Improvement:
scenedetector analyzing inter-frame histograms/motion vectors) would yield much more semantically accurate timestamps for frame extraction, preventing sampling mid-transition or blur.locations_or_scenesmuch more robust and cheaper than passing all frames to an LLM.volumedetect). Integratinglibrosa(already inrequirements.txt) to classify rhythm/beats and audio events (laughter, waves, traffic) would vastly improve the B-roll/ambient_audio matching logic.TRIPSTORY_VISION_MAX_FRAMES=3is extremely restrictive for longer clips. Dynamically scaling the frame budget based onduration_secondsand detected scene changes would provide the LLM with sufficient context without overwhelming token limits.2. Narrative Story Generation
Current Implementation:
trip_story.pyprompts the LLM (llm_provider.py) to generate a monolithic JSON containingstory_beats,narration_lines,voiceover_segments, andedit_decisions.PROMPTS.mdstrongly guides the LLM to ground the story on factualscene_memoriesinstead of fabricating events, ensuring strict alignment of Voiceover with the chosen frames.Areas for Improvement:
PROMPTS.md, asking for beats, scripts, and edit decisions in a single JSON payload risks context exhaustion and brittle output, especially with reasoning models like DeepSeek. Decoupling this into a multi-agent pipeline (1. Planner -> 2. Writer -> 3. Editor) would significantly improve narrative coherence and fault tolerance.hook,setup,progression,payoff. However, without a multi-shot validation step, the LLM often collapses the arc in shorter videos. Adding an LLM validation/evaluation loop before finalizing the script to rate "emotional pacing" would ensure the B-roll isn't clustered together._fallback_storygenerates highly rigid arrays. If the LLM goes offline, the fallback creates a generic "Travel Recap" loop. Enriching the fallback to use basic decision trees based on the OpenCV-derived scene energy (e.g., placing the highest motion/highest audio clip at the climax) would make offline generation far more useful.PR created automatically by Jules for task 3115292696931845029 started by @namprice227