Skip to content

Audit video understanding and narrative story generation#8

Open
namprice227 wants to merge 1 commit into
namfrom
audit-video-narrative-3115292696931845029
Open

Audit video understanding and narrative story generation#8
namprice227 wants to merge 1 commit into
namfrom
audit-video-narrative-3115292696931845029

Conversation

@namprice227

Copy link
Copy Markdown
Owner

No changes were made to the codebase in this submission.

TripStory AI Audit Report

Based on the review of the video understanding (media_intelligence.py & scene_memory.py) and narrative story generation (trip_story.py, llm_provider.py & PROMPTS.md) systems, here are the core findings and recommendations for improvement:

1. Video Understanding (Frame Extraction & Visual Summary)

Current Implementation:

  • media_intelligence.py computes heuristic clip information using ffprobe (duration, resolutions) and audio levels (volumedetect).
  • Frame sampling extracts up to a maximum number of frames (TRIPSTORY_VISION_MAX_FRAMES = 3) scaled down to a specific width (640px) using OpenCV.
  • Visual summaries are generated via the _vision_semantics function, which uses a VLM (like Gemini) taking the sample frames and returning descriptions mapped to specific windows.
  • scene_memory.py maps these findings (heuristic + VLM analysis) into a coherent, standardized scene object capturing actions, locations, tone, and confidence.

Areas for Improvement:

  • Scene Change Detection: The current system relies on basic intervals or external semantic windows. Implementing an actual scene cut detection (using scenedetect or analyzing inter-frame histograms/motion vectors) would yield much more semantically accurate timestamps for frame extraction, preventing sampling mid-transition or blur.
  • Landmark Recognition Pipeline: The codebase references extracting locations, but relies heavily on the generalized VLM. Adding a dedicated, lightweight visual classifier (e.g., MobileNet or a specialized Places365 model) would make locations_or_scenes much more robust and cheaper than passing all frames to an LLM.
  • Audio/Speech Grounding: While transcription exists, audio mood/energy analysis is quite basic (volumedetect). Integrating librosa (already in requirements.txt) to classify rhythm/beats and audio events (laughter, waves, traffic) would vastly improve the B-roll/ambient_audio matching logic.
  • Handling Long Videos: TRIPSTORY_VISION_MAX_FRAMES=3 is extremely restrictive for longer clips. Dynamically scaling the frame budget based on duration_seconds and detected scene changes would provide the LLM with sufficient context without overwhelming token limits.

2. Narrative Story Generation

Current Implementation:

  • The story pipeline in trip_story.py prompts the LLM (llm_provider.py) to generate a monolithic JSON containing story_beats, narration_lines, voiceover_segments, and edit_decisions.
  • PROMPTS.md strongly guides the LLM to ground the story on factual scene_memories instead of fabricating events, ensuring strict alignment of Voiceover with the chosen frames.
  • Pacing is guided by target length, suggesting short segments (2-8 seconds).

Areas for Improvement:

  • Monolithic JSON Bottleneck: As noted in PROMPTS.md, asking for beats, scripts, and edit decisions in a single JSON payload risks context exhaustion and brittle output, especially with reasoning models like DeepSeek. Decoupling this into a multi-agent pipeline (1. Planner -> 2. Writer -> 3. Editor) would significantly improve narrative coherence and fault tolerance.
  • Rhythmic Pacing (Beat Snapping): The prompt enforces a 2-8 second duration, but video editors rely on musical pacing. Sending underlying audio tempo/beat metadata to the "Editor" prompt (or applying a deterministic post-process snap-to-beat algorithm) would make the output feel like a crafted recap rather than a slideshow.
  • Narrative Arc Enforcement: The LLM is asked for hook, setup, progression, payoff. However, without a multi-shot validation step, the LLM often collapses the arc in shorter videos. Adding an LLM validation/evaluation loop before finalizing the script to rate "emotional pacing" would ensure the B-roll isn't clustered together.
  • Deterministic Fallbacks: The _fallback_story generates highly rigid arrays. If the LLM goes offline, the fallback creates a generic "Travel Recap" loop. Enriching the fallback to use basic decision trees based on the OpenCV-derived scene energy (e.g., placing the highest motion/highest audio clip at the climax) would make offline generation far more useful.

PR created automatically by Jules for task 3115292696931845029 started by @namprice227

Co-authored-by: namprice227 <84830495+namprice227@users.noreply.github.com>
@google-labs-jules

Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant