Text-to-Video
Built with a unified multimodal audio-video joint generation architecture, Seedance 2.0 supports four input modalities: text, image, audio, and video. Compared with Version 1.5, Seedance 2.0 delivers a substantial leap in generation quality. It achieves a higher usability rate for complex interaction and motion scenes, with significant improvements in physical accuracy, visual realism, and controllability, making it well-suited for high-quality creation scenarios.
1,273
Mar 2026
video
$7/1M
Catalog
Category rows come directly from Artificial Analysis when the endpoint exposes category-level Elo scores.
| Category | Elo | 95% CI | Appearances |
|---|---|---|---|
| Fantasy | 1,398 | -30/30 | 739 |
| Multi-scene | 1,371 | -31/31 | 758 |
| Action | 1,367 | -23/23 | 1,211 |
| Specific location or era | 1,333 | -21/21 | 1,480 |
| Long prompt | 1,331 | -22/22 | 1,363 |
| Cartoon and anime | 1,328 | -23/23 | 1,112 |
| Sports | 1,327 | -23/23 | 1,093 |
| People | 1,312 | -11/11 | 5,152 |
| Transport |
| 1,312 |
| -15/15 |
| 2,812 |
| Text | 1,310 | -27/27 | 874 |
| Buildings | 1,310 | -13/13 | 3,643 |
| Fashion | 1,308 | -26/26 | 982 |