Video2026-03-108 min readBack to blog

How we built the Country Heart music video with generated music, annotation, reviewed reference frames, and audio-synced LTX 2.3 clips

This piece was built as a structured pipeline rather than a single giant render. The song was generated first, the finished audio was annotated against the lyrics, the timing JSON was split into clean segment windows, and every segment received its own reviewed reference-frame pass before the final audio-synced video jobs were rendered.

Final master
1920x1088
Segment count
33 clips
Review pass
4 ref images each
Motion cadence
25 fps
Country Heart music-video pipeline preview art

Final video

The finished cut below is assembled from individually generated segment renders that all reference the same master song. Each approved clip was normalized, selected, and stitched back into a single edit.

The pipeline stayed modular

Music came first, so every downstream decision inherited one stable audio source.
Annotation JSON converted the song into usable structure instead of relying on manual timing guesses.
Reference-image approval separated visual art direction from motion rendering, which improved the odds of strong clip generation.
Audio-synced LTX 2.3 jobs could focus on one task per segment: animate a known frame with a known lyric window and a known mood.

Five stages from song to finished master

Step 0101

Generate the song

The project started as a full generated country-tech track. We developed the lyrics and music together, then exported the final edit that would become the timing source for the entire video.

Step 0202

Annotate the audio

We paired the finished song with the transcript and ran audio annotation to produce phrase-level timing JSON. That timing map gave us clean anchors for every lyrical section instead of guessing where lines began and ended.

Step 0303

Plan segments

The annotation data was snapped into practical video windows, with start and stop points chosen to preserve complete lyric phrases and give each section enough runway for believable motion.

Step 0404

Create reference frames

Every segment got a batch of high-resolution reference images designed for its exact lyrical or instrumental moment. The visual ideas were intentionally varied so the full piece would feel like a directed video, not one scene repeated over and over.

Step 0505

Render audio-synced clips

The approved reference frames became start frames for audio-conditioned LTX 2.3 image-to-video jobs. Each clip was rendered as a self-contained performance shot and then assembled into the final master with the original song laid back over the edit.

The song established the whole editorial spine

The audio was not a finishing touch. It was the organizing structure. Once the track existed as a complete edited piece, every other asset in the workflow could key off one stable rhythm, one vocal line, and one fixed lyrical order.

Prompt-driven song generation established the melody, arrangement, and lyric framing.
The final audio edit became the single source of truth for sync timing across the whole build.
The transcript was used as a guide for annotation so the JSON aligned with the sung phrases, not just raw beats.

Timing was planned from annotation, not instinct

Timing note 1

Annotation JSON defined the phrase boundaries.

Timing note 2

Start and stop points were selected to encompass full lyric sections cleanly.

Timing note 3

Clip lengths were kept inside the LTX frame budget while still leaving room for motion to begin immediately.

Reference-image generation made the clip pass sharper

Before any LTX 2.3 clip was rendered, each segment went through a reference-image stage. That kept the final clip generation grounded in approved visuals instead of hoping a single text prompt would solve identity, composition, mood, and motion all at once.

Visual pass 1

Lyric-heavy sections leaned on readable faces, stable mouth visibility, and direct performance framing.

Visual pass 2

Non-lyrical bridges were treated as visual contrast moments with more atmosphere, mood, and scene variation.

Visual pass 3

Reference-image review happened before clip generation, so the animation jobs inherited cleaner framing and stronger identity continuity.

Approved frames became LTX 2.3 start frames

Once the reference-image pass was approved, each selected frame became the anchor for an audio-sync image to video render. That let the final motion jobs start from a strong composition while still following the phrasing and breath timing of the original track.

Reviewed start framesReference audioLTX 2.3

The result is a video that reads as one continuous musical piece even though it was built from many separately rendered shots.

The stack in one sentence

Generate the song, annotate it with the lyrics, carve that timing into segment windows, generate and review purpose-built reference images for every section, then render audio-synced LTX 2.3 clips from those approved start frames and assemble them back into a final master.