Table of Contents
- Introduction: A seismic shift in video editing for businesses and creators
- Why this matters right now
- What is Ditto? The essentials
- ComfyUI: the interface that makes the magic practical
- System requirements and practical hardware considerations
- Installation overview: get Ditto running with ComfyUI
- Understanding the workflow: what each section does
- Key runtime parameters explained
- Examples that demonstrate Ditto’s capabilities
- Performance tips and best practices
- Limitations and important caveats
- Legal and ethical considerations for Canadian organisations
- How Canadian industries can capitalise on Ditto
- Operationalizing Ditto: a practical rollout plan for IT leaders
- Troubleshooting common errors and practical fixes
- Real examples to test yourself
- What to expect as the ecosystem matures
- Comparing cloud vs offline workflows
- Conclusion: Where Ditto fits in your tech stack
- Call to action
- FAQ
- Final thoughts
Introduction: A seismic shift in video editing for businesses and creators
Imagine editing a finished video as simply as writing a sentence. Change a character’s outfit, insert a vintage street lamp in the background, convert anime into photorealistic footage, or shift the entire style to Pixar or Rick and Morty — all by typing a prompt. That capability is no longer fantasy. A new open source AI system called Ditto, built on the WAN video synthesis foundation, makes text-driven video editing accessible, fast, and capable of running offline.
This guide walks through what Ditto is, how it integrates with ComfyUI, how to install and run it locally, step-by-step usage advice, real-world examples, limitations you need to know, and what this means for Canadian businesses — from Toronto advertising agencies and VFX houses in the GTA to startups across the Prairies. I wrote this as a hands-on, nuts-and-bolts walkthrough while also mapping out strategic implications so IT leaders and creative directors can make an informed decision about adoption.
Why this matters right now
We are at an inflection point where generative AI is moving from textual and still-image creativity into fully editable video. For enterprises, the practical benefits are immediate:
- Lower production costs: rapid iteration on visual assets without full reshoots.
- Offline capability: run high-quality edits without exposing footage to cloud vendors — essential for privacy-sensitive industries and regulated sectors.
- Creative agility: marketing teams can test multiple visual styles or scenarios quickly.
- Democratization of film production: smaller studios and Canadian indie filmmakers can produce high-value effects without expensive pipelines.
For Canadian organizations, the offline nature aligns with privacy and data residency concerns under PIPEDA and emerging provincial rules. Running models locally reduces third-party data processing risk and gives IT departments control over sensitive assets.
What is Ditto? The essentials
Ditto is a text-driven video editing model released by Ant Group. It builds on WAN, a strong open source video model, and extends its capabilities to permit edits through textual prompts. Ditto comes in several flavours optimized for different use cases:
- Ditto global — general-purpose edits across scene, character, or lighting; it is the broadest model.
- Ditto style — focuses specifically on style transformations like cartoon, Pixar, origami, or other stylistic domains.
- Ditto sim2real — converts animated footage into realistic visual domain; extremely interesting for turning anime into live-action-style renders.
Ditto is available as downloadable models, and when paired with supporting components such as a VAE and a text encoder, it can run entirely on a local workstation using ComfyUI.
ComfyUI: the interface that makes the magic practical
ComfyUI is the go-to open source interface for running image and video diffusion workflows locally. It exposes nodes, workflows, and a manager for community custom nodes. For enterprises and power users, ComfyUI lets you avoid writing low-level glue code — instead you build or import workflows visually, change nodes, and run experiments from a GUI. That visual approach is why many hobbyist and professional users prefer it over raw script-based alternatives.
In the Ditto workflow, ComfyUI handles:
- Model selection (selecting your Ditto model, text encoder, VAE, and optional LoRA)
- Uploading input footage and setting frame counts, fps, width and height
- Routing the video through WAN VAE and WAN video sampler to produce the final edited frames
- Plugging in acceleration LoRAs like CosVid to speed up inference
System requirements and practical hardware considerations
Before you begin, know the hardware reality. For the official Ditto workflow, expect a VRAM requirement of at least 11 gigabytes. That is the minimum; in practice, models and convenience often benefit from GPUs with 16 GB or more of VRAM. I ran tests on a laptop GPU with 16 GB VRAM and found the tool practical — but there are trade-offs.
Key hardware and software considerations:
- GPU VRAM: Aim for 16 GB to comfortably run workflows with acceleration LoRAs and reasonable step counts. 11 GB may work for trimmed settings but expect longer runtimes and potential memory issues on larger projects.
- Disk storage: Models are large. Expect multiple gigabytes per model — some text encoders are over 11 GB alone. Plan for tens to hundreds of gigabytes for a multi-model setup.
- Offline operation: All components can be downloaded once and then used offline; this is excellent for security-conscious teams.
- Quantized binaries and GGUF: Expect optimized quantized versions to arrive. These will reduce VRAM requirements and expand deployment to modest GPUs.
Installation overview: get Ditto running with ComfyUI
Below is a step-by-step outline to install and run Ditto locally with ComfyUI. This assumes basic familiarity with installing ComfyUI. Adapt paths and exact filenames to your environment.
Step 1. Prepare ComfyUI and ComfyUI Manager
- Open ComfyUI and go to the Custom Nodes area.
- Open the Custom Nodes command prompt (CMD) to clone repositories and install custom nodes. Use ComfyUI Manager to handle node dependencies.
- If you do not have ComfyUI Manager installed, clone the repository into the Custom Nodes folder using the provided git clone command in the Ditto repo instructions.
ComfyUI Manager simplifies installing and updating community nodes and ensures the environment has everything Ditto needs.
Step 2. Download the models and supporting files
Ditto requires several large files. Download the official model files to the correct ComfyUI model folders:
- Ditto diffusion model files (three files around 6.1 GB each). Place them into ComfyUI/models/diffusion_models.
- WAN text encoder (this can be more than 11 GB). Place it into ComfyUI/models/text_encoders.
- WAN VAE file. Place it into ComfyUI/models/VAE and create a subfolder named WAN (or WAN_VAE depending on the naming recommended). Place the VAE inside that folder to ensure ComfyUI finds it.
- CosVid LoRA (optional but highly recommended). This is roughly 200 MB; place it inside ComfyUI/models/loras or ComfyUI/models/lauras depending on how your ComfyUI is structured.
Downloading all files will require significant bandwidth and disk space. Keep an eye on where files land and maintain the folder structure ComfyUI expects.
Step 3. Download workflow.json and import into ComfyUI
The Ditto GitHub includes a ready-made workflow.json. Download this workflow and then drag and drop it into ComfyUI. ComfyUI will populate the node graph automatically. If any nodes are flagged missing in red, go to Manager and click Install Missing Custom Nodes.
Updating ComfyUI before running for the first time is a good practice — it ensures compatibility with community nodes and features the workflow may rely on.
Understanding the workflow: what each section does
The ComfyUI workflow may look intimidating initially. Here is a breakdown of the essential nodes and their roles so you can confidently modify parameters and troubleshoot.
- Model selection nodes: Choose your Ditto model (global, style, or sim2real), the WAN text-video model, the VAE, and the text encoder from dropdown menus in the workflow.
- Prompt nodes: Enter your positive prompt (what you want) and negative prompt (what to avoid). The negative prompt is particularly helpful to prevent unwanted artifacts.
- Input video node: Upload the footage you want to edit. You also set frames per second and the output clip length here. The workflow truncates or restricts frames to the length you specify.
- Resize and aspect nodes: Set final resolution and choose the resizing strategy: crop, stretch, or preserve with padding. For vertical or nonstandard aspect ratios, update width and height explicitly.
- WAN VAE and WAN video sampler: WAN VAE converts video frames into latent space and back, while the WAN video sampler synthesizes the new frames according to the model and prompt.
- Scheduler and sampling parameters: Choose the sampling algorithm (UniPC is a sensible default), step count, CFG, and seed.
Key runtime parameters explained
When you press run, a handful of parameters determine the trade-off between fidelity, speed, and creative control:
- Steps: Number of diffusion steps. WAN generally performs well in the 20 to 30 step range for high quality, but with CosVid LoRA you can reduce steps to 4 for speed at the cost of some fidelity.
- CFG: Classifier-Free Guidance scale. Higher values make the model follow your prompt more literally. Lower values allow greater randomness and creative variation.
- Seed: Determines deterministic output. A fixed seed with identical settings produces the same output; randomizing seed gives varied outputs.
- Scheduler: The sampler algorithm used in diffusion. UniPC is a sensible default but experiment if you need different dynamics.
- Width/Height/FPS/Frames: Controls output resolution, aspect ratio, and final clip duration. If your original clip length is longer than the specified output frames, it will be truncated.
Examples that demonstrate Ditto’s capabilities
Here are practical, concrete examples to illustrate what Ditto can do — and where it still needs improvement.
Full character redesign
Use case: You have footage of a presenter and want to test different wardrobe and character concepts without reshooting.
Prompt example: Turn this into a dark mystical sorceress with glowing green eyes.
Result summary: The system can change clothing, hair, and overall character appearance convincingly. However, facial expressions and micro-movements are not always preserved perfectly. If maintaining exact speech-driven expressions matters, consider using a complementary tool like WanAnimate to transfer expressions explicitly.
Micro-editing items and microdetails
Use case: Change a shirt from blue to white, make a bird iridescent, or recolour a cat.
Prompt examples: Turn her top white. Turn the bird’s brown plumage to iridescent blue and green. Make this cat black.
Result summary: When using Ditto global, small edits occasionally trigger larger scene shifts. The current global model sometimes changes background or lighting unintentionally. The developers have indicated local editing models are planned to address micro-edits more precisely, but they were not released at the time of this writing.
Inserting objects
Use case: Add a vintage street lamp behind a subject, or a faint aurora in the sky.
Prompt examples: Add a vintage street lamp in the background. Add a faint aurora overhead
Result summary: Ditto performs well at inserting objects; results look cohesive in many examples. Object insertion works best when the prompt includes spatial context and object descriptors. Lighting consistency is usually acceptable but can vary depending on source footage complexity.
Style transfer and creative remixes
Use case: Transform a live action scene into origami-style animation, Pixar-like 3D rendering, or a Rick and Morty aesthetic.
Prompt example: Turn into 3D Pixar style. Turn into origami style. Recreate in Rick and Morty style.
Result summary: Style transfers are among Ditto’s strengths when using the style model. Stylized outputs are visually compelling and appropriate for creative marketing and concept exploration.
Sim2real — turning anime into realistic footage
Use case: Studios and VFX houses that want to prototype live-action interpretations of animated scenes.
Prompt example: Turn it into the real domain.
Result summary: The sim2real Ditto model is one of the most impressive features. In tests, it converts Studio Ghibli-style footage into realistic renders that are visually coherent. This opens intriguing possibilities for previsualization: anime directors could quickly experiment with realistic look development, or content owners can prototype live-action adaptations without full-scale production.
Performance tips and best practices
To get the best results and iterate efficiently, apply these practices:
- Start small: Use short clips (a few seconds) for experimentation. Tune your prompt, CFG, and step count before scaling to longer sequences.
- Use CosVid LoRA for speed: CosVid LoRA accelerates generation dramatically, allowing you to drop steps to as low as 4. Expect some fidelity loss, so only use it for rapid iteration and then re-render final shots without LoRA or with higher steps.
- Lock the seed for reproducibility: If you want to generate consistent renders across team members, set an explicit seed.
- Use the negative prompt: Actively include elements to avoid. Negative prompts reduce hallucinations like extraneous objects or visual artifacts.
- Watch aspect ratio and resize strategy: If you upload vertical footage but output horizontals, choose crop or stretch intentionally. Unexpected distortions come from automatic resizing defaults.
- Split complex edits into passes: If you want to change both character clothing and background, consider doing separate passes or masking workflows once local editing models arrive. This reduces unwanted changes in unrelated areas.
Limitations and important caveats
Ditto is powerful but not omnipotent. Here are the constraints you need to be aware of when planning production or adopting it in a commercial workflow.
- Facial expressions and lip sync: The current models may not faithfully preserve nuanced facial expressions or micro movements. WanAnimate is better suited for transferring expressions if you need precise emotion or speech alignment.
- Micro-editing precision: The global model can result in broad scene changes when you intend a localized edit. Local editing models were planned but not yet released at the time of this writing.
- Quality vs speed trade-off: CosVid LoRA lets you reduce steps dramatically for speed, but expect artifacting and color shifts. For final renders, use higher step counts.
- Model bias and artifacts: Like all generative models, Ditto can create unexpected artifacts, hallucinated objects, or style drift depending on prompts and input footage complexity.
- Large models and resource constraints: Model files are large and require planning for storage and GPU VRAM. Quantized files and GGUFs may reduce these costs in the future.
Legal and ethical considerations for Canadian organisations
Adopting text-driven video editing in a commercial context raises important legal and ethical questions, especially for regulated industries and large brands. Canadian businesses need to consider:
- Privacy compliance: Ensure use of personal data in footage complies with PIPEDA and provincial privacy laws. Offline processing helps, but policies and consent obligations remain.
- Deepfake risks: The ability to modify people’s appearance or fabricate realistic footage raises reputational and legal risks. Companies should define governance processes and provenance metadata for edited media.
- Copyright and IP ownership: When turning anime or branded IP into realistic footage, ensure you have rights to modify or adapt the source content.
- Disclosure and transparency: For marketing and advertising, consider disclosing that footage has been AI generated if the edits could mislead audiences.
- Employee training and policies: Create internal controls and training so creative teams use the tools responsibly and understand regulatory exposure.
How Canadian industries can capitalise on Ditto
Ditto unlocks applied value across many sectors in Canada. Here are the top practical applications and strategic actions for leaders.
Advertising and marketing teams
Marketing teams can iterate ad concepts faster, test multiple visual approaches on the same footage, and localize campaigns for regional audiences without costly reshoots. Agencies in Toronto and Vancouver can prototype dozens of creative variants on the day a shoot wraps, accelerating time-to-campaign.
Film and VFX houses
Small VFX studios and post houses can prototype live-action looks for animated IP, speed up look development, and generate crowd or background variations quickly. For Canadian independent filmmakers, the barrier to high-quality visuals drops dramatically. Ditto is an especially compelling tool for previsualization and concept exploration.
Education and training
Universities and colleges teaching media production can adopt Ditto to teach contemporary workflows. Because it runs offline, labs can run workshops without cloud costs and with better control over student data.
Media and newsrooms
News organizations can use Ditto cautiously for creative storytelling or graphics but must be careful about authenticity. Transparency policies and provenance tracking must be enforced to avoid eroding trust.
Retail and e-commerce
Retailers can generate product-focused short clips showing variations of garments or staged scenes without photographing every prototype — speeding catalogue production and A B testing marketing variants.
Operationalizing Ditto: a practical rollout plan for IT leaders
Here’s a pragmatic plan to evaluate and integrate Ditto into a studio or enterprise environment while managing risk.
- Proof of concept: Run pilot projects on non-sensitive footage to understand quality, runtime, and failure modes. Compare outputs with current manual VFX workflows.
- Hardware assessment: Audit available GPUs. Start with a 16 GB workstation for small teams. Plan for centralized GPU servers for larger workloads, and consider GPU virtualization options for team access.
- Governance and policy: Draft policies for usage, approvals, and disclosure for AI-edited media. Ensure legal reviews for rights and privacy compliance.
- Training and enablement: Create playbooks for creative teams on prompt design, seeds, negative prompts, and best practices to avoid artifacts and ensure repeatable quality.
- Scaling and automation: When ready, automate workflows for batch processing of multiple clips and integrate with existing DAM or post pipelines. Use ComfyUI Manager and headless ComfyUI instances for scheduled jobs.
Troubleshooting common errors and practical fixes
When working with large models and community workflows, you will encounter issues. Here are common problems and remedies:
- Missing custom nodes: If nodes are outlined in red when loading the workflow, open ComfyUI Manager and click Install Missing Custom Nodes. Update ComfyUI prior to installing nodes to avoid API mismatches.
- Out of memory errors: Lower resolution, reduce batch sizes, or use quantized models. If possible, swap out LoRAs that are memory heavy. Consider running smaller step counts for experimentation.
- Unexpected scene changes: Add negative prompts to prevent background shifts. If you need micro-edits, wait for local editing models, or explore masking strategies in separate passes.
- Quality degradation with CosVid LoRA: Use CosVid for quick iteration, but render final assets with the LoRA disabled or with higher steps for higher fidelity.
- Slow inference: Experiment with step count, scheduler, and enabling acceleration LoRAs. Ensure GPU drivers and CUDA versions are current.
Real examples to test yourself
Here are a few test prompts and scenarios to try while evaluating Ditto. Use short 2-5 second clips for these experiments to save time while tuning.
- Turn her top white and keep all other scene elements unchanged. Use negative prompts to avoid background modifications.
- Insert a vintage street lamp in the background with soft evening lighting. Specify location relative to the subject in the prompt.
- Convert a short Studio Ghibli style clip into the real domain using Ditto sim2real. Observe facial realism and background texture fidelity.
- Stylize a live-action clip into a 3D Pixar-like world and compare different CFG and step settings to find the best balance.
- Change animal fur color precisely (cat to black) and measure whether details like whiskers and fur patterns remain intact.
What to expect as the ecosystem matures
Expect rapid evolution. The open source community moves quickly on performance optimizations (quantizations into GGUF), smaller footprint variants, and better local editing models. This will mean:
- Lower VRAM versions for wider accessibility across Canadian SMEs.
- Improved micro-editing precision as local editing models get released.
- More user-friendly GUIs and presets targeted at creative teams.
- Proliferation of specialized LoRAs to accelerate domain-specific outputs like cinematic lighting, corporate headshots, or product rendering.
Comparing cloud vs offline workflows
There are clear trade-offs between cloud-hosted generative services and running Ditto locally:
- Cloud services: Offer scalability and simplified setup. Many cloud providers provide low-latency GPU instances and preinstalled stacks. But they raise concerns about data residency, security, and long-term costs.
- Local offline: Offers control, lower marginal cost for repeated runs, and better privacy posture. However, it requires an initial investment in hardware, storage, and local DevOps skills.
For Canadian enterprises that handle sensitive customer footage, legal compliance or provenance needs, offline local workflows will often be the prudent choice. For rapid scaling or burst rendering needs, hybrid models can work: perform sensitive work locally and use cloud bursts for large batch jobs with appropriate contractual protections.
Conclusion: Where Ditto fits in your tech stack
Ditto is a watershed moment in the practical adoption of generative video AI. It turns many formerly expensive, time-consuming video tasks into prompt-driven operations that anyone with a capable workstation can run. For Canadian businesses, the implications are strategic: faster creative cycles, cost savings, new creative possibilities, and an opportunity to centralize governance over sensitive assets by running entirely offline.
Adoption is not without challenges: current limits on expression transfer, precision micro-edits, and large model sizes mean Ditto is an augmentation to existing pipelines rather than a wholesale replacement. But the trajectory is clear: as quantized models and local-editing variants land, the value proposition will only increase.
If you lead creative operations, post-production, or IT strategy, now is the time to pilot the technology. Run a few short experiments, establish policies, and map out a roadmap for integration. The creative and operational upside is significant. The technology will continue to improve; firms that move early will gain a practical advantage in agility and visual innovation.
Call to action
Is your organization ready to experiment with text-driven video editing? Start with a brief proof-of-concept, protect sensitive content with an offline workflow, and involve legal and creative stakeholders early. Try a few quick tests, then convene a review to determine the next steps. Share your experiences and lessons learned — the Canadian tech community benefits when practitioners exchange findings.
Text-driven video editing will reshape how media is created. For Canadian companies, this is an opportunity to gain creative and operational advantages while keeping control over data and IP.
FAQ
What is Ditto and how does it differ from other video AI models?
Ditto is an open source text-driven video editing model built on WAN. It allows you to edit videos using textual prompts, offering three main variants: global (general edits), style (style-focused transformations), and sim2real (converts animated footage into realistic renders). Compared to earlier video diffusion models, Ditto focuses on direct, prompt-based editing and integrates into workflows such as ComfyUI for local, offline operation.
Do I need a powerful GPU to run Ditto locally?
Yes. The official workflow suggests at least 11 GB of VRAM as a minimum, with 16 GB recommended for comfortable experimentation. Expect to need significant disk space for model files. Future quantized versions and GGUF builds may reduce VRAM requirements, broadening accessibility.
Can Ditto perform precise micro-edits like swapping clothing without altering the rest of the scene?
Not reliably with the current global model. The global model sometimes causes broader scene changes even when the prompt targets a localized change. Developers plan to release local editing models that will enable more precise micro-edits. For now, you can mitigate unwanted changes with careful negative prompts, separate passes, or masking strategies.
How fast is Ditto and what settings affect runtime?
Runtime depends on GPU power, step count, and whether you use acceleration LoRAs like CosVid. With CosVid LoRA and reduced steps, generation can be relatively fast (examples show minutes for a few-second clip on a 16 GB GPU). Higher step counts and higher resolutions increase runtime. Use the CosVid LoRA for rapid iteration and re-render higher-fidelity outputs without LoRA for final assets.
Will using Ditto violate privacy laws or cause compliance problems in Canada?
Using Ditto itself is not inherently illegal, but processing personal data requires compliance with PIPEDA and any applicable provincial privacy laws. Running models offline gives you stronger control over data residency and reduces third-party processing risk. Ensure you have consent for footage involving individuals and implement governance for deepfake and IP use cases.
Can Ditto replace traditional VFX pipelines?
Not entirely in its current form. Ditto is excellent for rapid iteration, previsualization, and creative exploration. However, traditional VFX pipelines remain important for precise compositing, high-resolution final frames, and control over expression and motion fidelity. Over time, Ditto and similar tools will become integrated components of modern VFX workflows rather than outright replacements.
How do I get started with ComfyUI and Ditto?
Install ComfyUI and ComfyUI Manager. Download the Ditto model files, WAN text encoder, and WAN VAE into the appropriate ComfyUI model folders. Import the provided workflow.json into ComfyUI by dragging and dropping it into the interface. Update ComfyUI and install any missing custom nodes via ComfyUI Manager. Configure the model dropdowns, upload an input clip, set the prompt and parameters, and run. Expect to experiment with step counts, CFG, and seed settings to find the right balance for your use case.
Are there enterprise-level uses for Ditto in Canada?
Absolutely. Advertising agencies, boutique post houses, e-commerce teams, education institutions, and media organizations can use Ditto for fast content iteration, look development, product variations, and training workflows. The offline capability is particularly attractive for regulated industries and companies concerned about IP and data residency.
When will micro-editing and better expression transfer be available?
Community and developer activity is fast-moving, and local editing models were planned for release after the global releases. Check the Ditto repository and follow the WAN and community repositories for updates. For expression transfer today, WanAnimate is a complementary tool that can help preserve facial expressions more accurately.
How can teams reduce the risk of misuse of this technology?
Implement governance policies, role-based access controls for editing workflows, and mandatory provenance metadata that labels edited content. Train teams on ethical use, require legal reviews for content that could mislead audiences, and maintain clear disclosure policies for any consumer-facing materials that include AI-generated edits.
Final thoughts
Ditto is a major step toward democratizing high-quality video editing with generative AI. For Canadian businesses, the combination of offline capability, strong visual results, and open source accessibility means this technology can be adopted pragmatically, responsibly, and with a tangible business case. Start with a controlled pilot, involve legal and creative leaders early, and build governance that enables innovation without exposing your brand or clients to unnecessary risk.
Is your team experimenting with text-driven video editing? What business problems would you solve first with this toolset? Share strategies, successes, and problems you encounter — the Canadian tech community grows stronger when practitioners collaborate.



