Text-to-Video
Build upon an All-in-One product framework, the Kling 3.0 model series supports full multimodal input and output spanning text, images, audio, and video, bringing the understanding, generation, and editing of video together in one streamlined AI workflow. The models integrate multiple tasks, including text-to-video, image-to-video, reference-to-video, and in-video editing, into a single, native multimodal architecture, enabling the models to follow complex narrative logic, deliver precise shot control, and maintain strong prompt adherence.
1,219
Feb 2026
video
$0.168/s
Catalog
Category rows come directly from Artificial Analysis when the endpoint exposes category-level Elo scores.
| Category | Elo | 95% CI | Appearances |
|---|---|---|---|
| Fantasy | 1,332 | -32/32 | 464 |
| Action | 1,327 | -25/25 | 761 |
| Screens | 1,293 | -33/33 | 393 |
| Text | 1,268 | -29/29 | 573 |
| Cartoon and anime | 1,267 | -24/24 | 727 |
| People | 1,265 | -11/11 | 3,347 |
| Sports | 1,265 | -25/25 | 731 |
| Indoor | 1,260 | -18/18 | 1,452 |
| Fashion | 1,253 |
| -28/28 |
| 631 |
| 3D animation | 1,249 | -43/43 | 209 |
| Buildings | 1,239 | -14/14 | 2,304 |
| Long prompt | 1,238 | -23/23 | 929 |