Categories: AI

by Soontaek Lim

Share

by Soontaek Lim

The world of AI content creation is changing fast. A fascinating area is turning still images into videos. It’s a field that’s taking off, with companies like Midjourney now offering their image-to-video (I2V) tools. This article explains how advanced models like Stable Diffusion 3.5, ControlNet, and WAN 2.1 VACE can be used to animate static images.

The Rise of Image-to-Video Generation

Generating video from still images isn’t just a niche idea anymore; it’s a growing trend. Midjourney, known for its high-quality image generation, has expanded into motion, letting users animate their creations. This makes video creation more accessible, allowing more people to turn their visual ideas into moving stories. These tools are versatile, animating everything from realistic portraits to landscapes and architectural designs.

How It Works: A Technical Workflow

At its core, this process combines advanced diffusion models with specialized video generation architectures. Our exploration, detailed in a confidential “Ultra Tendency” project from January 8, 2025, outlines a method that uses Stable Diffusion 3.5 for initial image creation, enhanced by the precise control of ControlNet. The generated images are then passed to WAN 2.1 VACE for the crucial video creation step.

Guiding the AI: The Power of ControlNet

ControlNet is a key innovation in AI image generation. It acts as a sophisticated guide, letting users influence the output with structured inputs. This goes beyond simple text prompts, offering detailed control over the image’s composition, pose, and depth. The main types of ControlNet inputs include:

  • Canny Edge Maps: These maps preserve the outlines and structure of a reference image. By converting an image into a black-and-white sketch of its edges, ControlNet ensures the AI follows the original layout and composition. This is useful when you want to keep an image’s structure but change its style or colors.

The typical process involves:

  1. Loading a pre-trained diffusion model and encoding a text prompt using CLIP.
  2. Loading a ControlNet model trained on Canny edges.
  3. Upscaling and processing an input image with a Canny edge detector.
  4. Using these extracted edges, along with the text prompt, to guide the generation via ControlNet.
  5. Finally, using latent sampling (like KSampler) and VAE encoding to produce the final image.

Example Workflow Snippet (Conceptual):

  • Depth Maps: These maps provide information about a scene’s 3D structure, showing how far objects are from the camera. White pixels represent nearby objects, and black pixels represent distant ones. Depth maps improve realism by guiding perspective, lighting, shadows, and scaling, leading to more immersive visuals.
  • OpenPose: This technique focuses on human poses. It detects key body points (like head, shoulders, elbows) and creates a skeleton-like pose map. This map then guides the AI to accurately replicate specific body positions and movements, ensuring consistency in character animation or action sequences.

You can also combine these ControlNet inputs. For example, using both Canny edges and a depth map provides a more complete structural and spatial guide to the AI, allowing for highly controlled image generation. You can also adjust the strength of each ControlNet input for finer tuning.

Bringing Images to Life: The Role of WAN 2.1 VACE

Once you have high-quality, structurally guided images, the next step is to turn them into video. This is where WAN 2.1 VACE (Video All-in-one Composable Editor) comes in. VACE is a versatile model designed for various video tasks, including:

  • I2V (Image-to-Video): The main function is to convert static images into dynamic video sequences.
  • R2V (Reference-to-Video): Generating videos based on reference inputs, which could include style references or motion capture data.
  • V2V (Video Editing): Modifying existing video content.
  • MV2V (Masked Video Editing): Performing selective edits on videos using masks for precise control.

VACE’s power comes from its composability. Users can integrate different tasks into a single pipeline, creating flexible and robust video generation workflows. This is especially useful with pre-built templates in platforms like ComfyUI, which often include ready-to-use VACE workflows for video generation.

A typical VACE setup, as shown in a ComfyUI diagram (dated 01/08/2025), involves loading diffusion models, LoRAs, CLIP, and VAE components. Users configure text prompts (both positive and negative) and sampling parameters. Reference images can also be included to guide the output. This modular approach allows for extensive customization by adjusting models, prompts, and other parameters.

Technical Specification Example:

  • Model: WAN 2.1 VACE (14B)
  • Output: 720p, 5-second videos
  • VRAM Usage: ~37GB (on a Colab A100 GPU)

The results are impressive, with the AI generating coherent video sequences that maintain the structural essence of the input images while introducing dynamic motion.

The Future of AI-Generated Video

Advancements in image-to-video generation, powered by models like WAN 2.1 and guided by techniques like ControlNet, open up exciting possibilities. While the current focus is on creating compelling visuals for marketing and creative projects, the potential applications extend further. Imagine using these tools to:

  • Enhance Project Documentation: Visualize project progress, architectural designs, or complex processes with dynamic video narratives instead of static diagrams.
  • Create Engaging Tutorials: Illustrate step-by-step instructions with animated visuals, making them more accessible and impactful.
  • Improve Customer Demos: Present solutions and their impact in a visually captivating video format, leaving a stronger impression on clients.

The integration of voice cloning, as hinted at in future research, further amplifies this potential, allowing for rich, narrated video content that can effectively communicate the value and impact of our work. While challenges remain in applying these technologies to precise data visualizations or text-heavy diagrams, the rapid progress suggests that the future of creative and informative content lies in the seamless integration of AI-driven visual storytelling.

This exploration into image-to-video generation highlights a significant leap in AI capabilities. By understanding and using these powerful tools, we can unlock new avenues for creativity and communication, bringing our ideas to life in ways never before possible.

Share