Turn words into video: getting started with text-to-video

No footage, no camera skills, no editing — but you want a short clip. That is exactly what text-to-video is for: describe a shot in words and AI conjures a moving image from nothing.

Set expectations first: text-to-video only makes a few seconds at a time, best for B-roll, mood shots and intro flourishes — not a full short film in one go. Treat it as an on-demand footage library: the more specific your words, the closer the result to the frame in your head.

When to use it

When you need a B-roll shot, a moving intro, or want to turn a line of copy into a visual — but have no footage on hand.

How to do it

Open Jimeng / Kling / Hailuo and pick text-to-video
Write the shot in one line: subject + motion + setting + camera
Generate a few takes and keep the one with the most natural motion
Tweak a word or two and re-run — don’t pile on requests at once

Weak vs strong

❌ How most people write it

Generate a video of a city.

✅ Do this instead

Generate a video: a city skyline at dusk, glass towers reflecting an orange sunset, distant traffic lights drifting slowly, camera tilting up from low, calm and soothing mood.

The left is too vague and AI just guesses; the right names subject, light, motion and camera, so the result is far more reliable.

Copy-paste prompt

Generate a video: 【subject, e.g. “an orange cat”】is【action, e.g. “slowly grooming on a windowsill”】, in【setting, e.g. “a sunlit afternoon room”】, camera【slowly pushing in / static / gentle orbit】, realistic and softly lit.

Worked examples

Example · A line of copy into a B-roll shot

Generate a video: a hot coffee on a wooden table, white steam curling up, an open book beside it, a rainy morning outside the window, static camera with slight depth of field, warm and cozy.

You get：A few seconds of cozy B-roll, ready as a video intro or copy backdrop — no shooting needed.

Level up

Build a film: generate several B-roll shots and stitch them in an editor with captions and music
Try across tools: feed the same description to Jimeng / Kling / Hailuo and keep the best output
Image first: generate a still you like, then use image-to-video to animate it — more controllable

Common mistakes

Cramming many actions into one line — clips are only seconds long; keep it simple or it warps
Expecting a finished film — it outputs clips, not full videos; long pieces need stitching
Asking for big on-screen text — AI-rendered words are often garbled; add captions in post

FAQ

The result looks distorted or the motion is weird — what now?

Generate several takes and pick the best, and simplify the description with gentler motion. Text-to-video is inherently random; simple, stable shots succeed more often.

How long a video can it make?

Usually just a few seconds per run. For longer content, generate in segments and stitch them rather than expecting one long take.

Pro tip：Keep “subject + action + setting + camera” as a fixed four-part checklist — fill it in each time for steady results.

Turn words into video: getting started with text-to-video

When to use it

How to do it

Weak vs strong

Copy-paste prompt

Worked examples

Level up

Common mistakes

FAQ

Related tips

Bring old photos to life with Kling

Write a short-video script with AI: hook, pacing, call-to-action