Engineering6 min read

How Interactive Demos Are Built From Screen Recordings

Learn the technical process behind turning a screen recording into a clickable interactive demo — from frame extraction to AI-powered hotspot generation.

Riveo Engineering

6 min read

From Pixels to Clickable Demos

Most product teams are stuck in a painful loop: record a screen capture, upload it somewhere, then manually stitch together screenshots with hotspots in a clunky editor. It works — barely. The result is brittle, time-consuming to maintain, and falls apart the moment someone redesigns a button.

There is a better way. Modern interactive demo platforms can ingest a raw screen recording and automatically decompose it into a sequence of interactive steps — each with clickable hotspots, tooltips, and transitions. This post explains exactly how that pipeline works under the hood.

Step 1: Frame Extraction

The journey starts with a video file — typically an MP4 or WebM captured at 30 fps. The first job is to extract individual frames. Tools like ffmpeg make this straightforward: you decode the video stream and emit one PNG per frame, or sample at a lower rate (say 5 fps) to reduce volume without missing transitions.

At 30 fps, a two-minute recording produces 3,600 frames. Most of those frames are visually identical to their neighbors — a cursor blinking, a loading spinner rotating, or simply dead air while the presenter pauses. The real information lives in the transitions between meaningful states, and the challenge is finding them.

Step 2: Visual Deduplication

Not every frame matters. The goal of deduplication is to collapse runs of near-identical frames into a single representative image. A common approach uses perceptual hashing — algorithms like pHash or dHash that generate compact fingerprints of each frame. When consecutive frames produce hashes within a small Hamming distance, they belong to the same visual "plateau" and can be merged.

An alternative technique computes pixel-level difference maps between consecutive frames. If the percentage of changed pixels falls below a threshold (commonly 1–3% of the total area), the frames are considered duplicates. This works especially well for screen recordings where most of the viewport remains static between interactions.

Step 3: Scene Clustering and Step Boundaries

After deduplication, you're left with a timeline of distinct frames — but you still need to group them into logical steps. Think of a form-fill flow: typing in a field produces many micro-changes, but the meaningful boundaries are before the user starts typing, after the field is complete, and after the submit button is clicked.

Scene detection algorithms analyze the magnitude of visual change over time. A large spike in the difference signal — like a modal appearing, a page navigation, or a panel sliding in — marks a scene boundary. Smaller, continuous changes (scrolling, typing) get absorbed into the current scene. The output is an ordered list of "key frames," each representing a distinct step in the workflow.

Handling Edge Cases

Scrolling: Continuous scroll produces gradual pixel shifts. Velocity-based thresholds distinguish a meaningful scroll-to-section from a casual adjustment.
Animations and transitions: CSS animations create intermediate frames that aren't useful as steps. Temporal smoothing filters these out by requiring a minimum plateau duration before recognizing a new scene.
Hover states: Subtle hover effects can trigger false boundaries. Restricting change detection to regions outside the cursor footprint helps.

Step 4: Hotspot Generation

Each key frame now represents a static snapshot, but an interactive demo needs click targets. Hotspot generation identifies the regions a viewer should click to advance to the next step.

The simplest approach diffs consecutive key frames and highlights the region of maximum change. If frame N shows a dropdown closed and frame N+1 shows it open, the bounding box of the changed region is a strong candidate for the hotspot. More sophisticated pipelines use object detection or OCR to identify buttons, links, and input fields — then match those elements against the transition to infer which one was clicked.

The result is a coordinate and size for each hotspot, anchored to the key frame's resolution. These get rendered as translucent overlays in the final demo viewer.

Step 5: AI Enrichment

Raw frames and hotspots form the skeleton, but a polished demo needs context: tooltip text explaining what the viewer should do, step titles for navigation, and sometimes branching logic. This is where language models add significant value.

By passing each key frame (and its surrounding context) through a vision-language model, you can auto-generate descriptions like "Click the Create Project button to start a new workspace." The model sees the UI, reads the text within it via OCR, and produces natural-language guidance. Teams can then review and refine these suggestions rather than writing every tooltip from scratch.

Why This Beats Manual Screenshot Stitching

The traditional approach — take 15 screenshots, upload them, draw rectangles, write copy — takes 30–60 minutes per demo and breaks whenever the product changes. A recording-based pipeline compresses creation to the time it takes to perform the workflow once on screen. Updates are equally fast: re-record, re-process, done.

The best demo is one that can be rebuilt in under a minute. If maintenance is painful, demos go stale — and stale demos erode trust faster than no demo at all.

Automation also enforces consistency. Every demo produced by the pipeline has uniform hotspot sizing, tooltip placement, and transition timing. Manual builds introduce drift — different team members make different aesthetic choices, and over time the library feels disjointed.

Wrapping Up

Turning a screen recording into a polished interactive demo is a pipeline problem: frame extraction, deduplication, scene clustering, hotspot generation, and AI-powered enrichment. Each stage reduces entropy and adds structure, transforming a flat video into something a prospect can actually explore. The result is faster creation, easier maintenance, and demos that feel alive rather than static.

Ready to create your first interactive demo?

Upload a product recording and get a shareable, interactive demo in minutes. No design work required.

Get started free