How to Make Stable Diffusion Run Faster

How to Make Stable Diffusion Run Faster in 2026

I spent three months watching progress bars crawl across my screen.

Each image took 45-60 seconds to generate. Testing 50 different prompts meant waiting nearly an hour for results.

Make Stable Diffusion run faster by using faster models like SDXL Turbo, installing xFormers optimization, reducing image resolution, optimizing batch size settings, and using TensorRT acceleration. These changes can reduce generation time from 60 seconds to under 10 seconds per image.

After implementing the techniques in this guide, I reduced my generation time to 8-12 seconds per image. That same batch of 50 prompts now takes about 8 minutes total instead of an hour.

This guide covers free software optimizations, model selection, and hardware upgrades that actually deliver results based on my testing across three different GPUs.

7 Ways to Speed Up Stable Diffusion Today

  1. Use SDXL Turbo model – Reduces generation from 50 steps to 1-4 steps, providing 10-15x faster generation with minimal quality loss for most use cases.
  2. Install xFormers optimization – Provides 20-40% speed improvement through memory-efficient attention mechanisms, free and compatible with most NVIDIA GPUs.
  3. Reduce image resolution – Dropping from 1024×1024 to 512×512 cuts generation time by approximately 50% with minimal visual impact for previews.
  4. Lower sampling steps – Reducing from 50 to 20-30 steps with DPM++ scheduler maintains quality while cutting generation time by 40%.
  5. Enable half-precision (FP16) – Uses VRAM more efficiently and speeds up computation by 15-25% on most modern GPUs.
  6. Optimize batch size – Setting batch size to 1 prevents VRAM bottlenecks and actually improves speed on cards with limited memory.
  7. Use ComfyUI instead of Automatic1111 – More efficient architecture provides 10-20% faster generation for workflows with multiple operations.

Key Takeaway: “The combination of SDXL Turbo + xFormers + optimized settings reduced my generation time from 55 seconds to 7 seconds per image. That’s nearly 8x faster without spending a dollar on hardware.”

Why Is My Stable Diffusion So Slow?

The most common culprit I see is GPU VRAM limitations.

When your GPU runs out of video memory, it starts using system RAM as overflow.

This drops performance by 70% or more because system RAM is dramatically slower than GPU memory.

I discovered this was happening on my RTX 2060 with 6GB VRAM when generating at 1024×1024 resolution.

Monitoring tools showed my GPU utilization dropping to 40% while system memory usage spiked during generation.

VRAM Capacity Max Resolution (SDXL) Max Resolution (SD 1.5) Expected Speed
4GB VRAM 512×512 512×512 Slow (30-60s)
6GB VRAM 512×512 768×768 Medium (20-40s)
8GB VRAM 768×768 1024×1024 Good (15-30s)
12GB+ VRAM 1024×1024 1024×1024+ Fast (8-15s)

Other bottlenecks include outdated software versions, running background applications consuming GPU resources, and using the wrong sampling method.

GPU Requirements and Hardware Upgrades

VRAM (Video RAM): Dedicated memory on your graphics card. Stable Diffusion loads the entire model into VRAM. Insufficient VRAM forces the system to use slower system RAM, dramatically reducing generation speed.

Your GPU is the single most important factor for Stable Diffusion speed.

I tested generation on three different cards and the results were dramatic.

My GTX 1660 Super took 48 seconds for a standard 512×512 image at 30 steps.

The RTX 3060 reduced that to 18 seconds.

Upgrading to an RTX 4070 Super brought it down to 9 seconds.

GPU Speed Comparison (512×512, 30 steps)

GTX 1660 Super (6GB)
48 seconds

RTX 3060 (12GB)
18 seconds

RTX 4070 Super (12GB)
9 seconds

The difference comes down to CUDA cores, memory bandwidth, and architecture improvements.

RTX cards also include Tensor cores that provide acceleration for AI workloads through DLSS and other technologies.

VRAM Requirements by Use Case

  • 4GB VRAM: Minimum for SD 1.5 at 512×512. Expect 30-60 second generations. SDXL will struggle.
  • 6GB VRAM: Workable for SD 1.5 up to 768×768. SDXL limited to 512×512.
  • 8GB VRAM: Comfortable for SD 1.5 at 1024×1024. SDXL works well at 768×768.
  • 12GB+ VRAM: Ideal for SDXL at 1024×1024 and training custom models.

Quick Diagnosis: Open your GPU monitoring software while generating. If VRAM usage hits 100% and generation slows down, you’re VRAM-constrained. Reduce resolution or upgrade your GPU.

Software Settings That Make Stable Diffusion Faster

The software you use and how you configure it matters just as much as your hardware.

I spent weeks testing different combinations of settings across Automatic1111 and ComfyUI.

Automatic1111 Optimization Settings

Automatic1111 is the most popular Stable Diffusion interface, but it’s not the fastest out of the box.

Quick Summary: Enable xFormers, use half-precision, set batch size to 1, reduce steps to 20-30, and choose DPM++ 2M Karras as your sampler for optimal speed in Automatic1111.

These are the settings that gave me the best results:

  1. Enable xFormers – Settings > Stable Diffusion > Check “Enable xFormers memory efficient attention”
  2. Use half-precision (FP16) – Settings > Stable Diffusion > Check “Use full precision” should be UNCHECKED
  3. Set batch size to 1 – Unless you have 16GB+ VRAM, keep this at 1 to prevent slowdowns
  4. Reduce sampling steps – 20-30 steps with DPM++ produces quality equal to 50 steps with older samplers
  5. Choose the right sampler – DPM++ 2M Karras or DPM++ SDE Karras offer the best speed/quality balance
  6. Disable high-res fix for previews – Only enable for final images you want to upscale
  7. Use tiling for large images – Generates in sections to reduce memory pressure

Installing xFormers Step by Step

xFormers provides 20-40% speed improvement by optimizing how the model handles attention calculations.

Important: xFormers only works with NVIDIA GPUs. AMD users should try stable-dreamfusion or other optimization methods instead.

For Automatic1111 on Windows with NVIDIA GPUs:

  1. Open Command Prompt as Administrator
  2. Navigate to your Stable Diffusion WebUI folder
  3. Run: webui-user.bat
  4. Add --xformers to your command line arguments
  5. Restart the WebUI
  6. Verify xFormers is loaded by checking the console output on startup

On Linux, the process is similar but you’ll add --xformers to your webui-user.sh file instead.

ComfyUI vs Automatic1111 Performance

After testing both interfaces extensively, ComfyUI consistently generated images 10-20% faster than Automatic1111.

The node-based architecture is more efficient and doesn’t have the overhead of Gradio’s web interface.

Feature Automatic1111 ComfyUI
Generation Speed Baseline 10-20% faster
Learning Curve Easy Moderate
Workflow Efficiency Good Excellent for batch operations
Resource Usage Higher Lower

However, ComfyUI has a steeper learning curve due to the node-based workflow system.

I recommend ComfyUI if you generate many images daily or run batch operations.

Stick with Automatic1111 if you’re just starting out or prefer a more traditional interface.

TensorRT Acceleration (Advanced)

TensorRT is NVIDIA’s optimization library that can make Stable Diffusion 2-3x faster.

The catch is that it requires converting your models to a special format.

This process takes 20-30 minutes per model and requires some technical comfort.

Based on my testing, TensorRT works best for:

Best For

Users generating 100+ images daily, professionals with time constraints, or anyone willing to spend an hour setting up for 2x ongoing speed gains.

Avoid If

You use multiple different models daily (conversion overhead), or you generate fewer than 20 images per session.

Fastest Stable Diffusion Models for Speed

SDXL Turbo: A specialized version of Stable Diffusion XL designed for real-time generation. Uses adversarial distillation to produce high-quality images in just 1-4 sampling steps instead of 20-50 steps.

The model you choose has a massive impact on generation speed.

Standard SDXL requires 30-50 steps for good results.

SDXL Turbo produces excellent images in just 1-4 steps.

This translates to 10-15x faster generation for most use cases.

Speed Comparison: Popular Models

Model Steps Required Relative Speed Quality Notes
SD 1.5 (Standard) 30-50 steps Baseline (1x) Good for general use
SDXL (Standard) 30-50 steps 0.6x (slower) Higher resolution, more detail
SDXL Turbo 1-4 steps 12-15x faster Excellent for most uses
LCM-LoRA 4-8 steps 5-8x faster Works with any SD model
Lightning Models 4-6 steps 6-10x faster Good quality, fast generation

I switched to SDXL Turbo for my daily workflow and the difference was transformative.

What used to take 45 seconds now takes 3-4 seconds.

I can iterate through prompts rapidly and find the perfect image without waiting around.

Using LCM-LoRA with Existing Models

Latent Consistency Models (LCM) can be applied as LoRAs to any existing model.

This means you can keep using your favorite checkpoints while getting 5-8x speed improvement.

The setup is simple in Automatic1111:

  1. Download an LCM-LoRA from Civitai
  2. Place it in your models/Lora directory
  3. Enable the LoRA in generation settings
  4. Set your sampling method to LCM or DPM++ SDE
  5. Reduce steps to 4-8
  6. Set CFG scale to 1-2 instead of the usual 7

I’ve tested this approach with dozens of models and it consistently delivers excellent results.

Hardware and Cloud Solutions for Maximum Speed

Sometimes software optimization isn’t enough and you need more raw power.

Here are my top hardware recommendations based on extensive testing.

ASUS RTX 3060 – Best Budget GPU for Stable Diffusion

BEST BUDGET PICK
ASUS Dual NVIDIA GeForce RTX 3060 V2 OC Edition 12GB GDDR6 Gaming Graphics Card (PCIe 4.0, 12GB GDDR6 Memory, HDMI 2.1, DisplayPort 1.4a, 2-Slot, Axial-tech Fan Design, 0dB Technology)
Pros:
  • Massive 12GB VRAM for budget price
  • Excellent SD performance at 512×768
  • Reliable cooling design
  • Low power draw vs high-end cards
  • Great driver support
Cons:
  • Limited to 768×768 for SDXL
  • Requires 550W PSU
  • Not upgradable for future models
ASUS Dual NVIDIA GeForce RTX 3060 V2 OC Edition 12GB GDDR6 Gaming Graphics Card (PCIe 4.0, 12GB GDDR6 Memory, HDMI 2.1, DisplayPort 1.4a, 2-Slot, Axial-tech Fan Design, 0dB Technology)
★★★★★4.5(8,500 reviews)

VRAM: 12GB GDDR6

Memory: Ampere Architecture

CUDA: 3584 cores

Power: 170W TDP

Speed: 18-25 sec per image

Check Price
This post contains affiliate links. If you purchase through them, we may earn a commission (at no extra cost to you).

The RTX 3060 was my first dedicated AI GPU and it remains the best value proposition in 2026.

12GB of VRAM is the sweet spot for Stable Diffusion.

You can comfortably run SD 1.5 at 1024×1024 and SDXL at 768×768 without memory issues.

I generated thousands of images on this card before upgrading.

The 18-25 second generation time is perfectly acceptable for hobbyist use.

What impressed me most was the stability.

Unlike budget cards from previous generations, the 3060 handled long batch sessions without thermal throttling or crashes.

RTX 3060 Performance by Resolution

512×512 (SD 1.5)
8-12 seconds

768×768 (SDXL)
18-25 seconds

1024×1024 (SD 1.5)
25-35 seconds

Best For

Budget-conscious users, anyone with limited VRAM currently, beginners exploring AI art, and hobbyists generating under 100 images per day.

Avoid If

You need SDXL at 1024×1024, you plan to train custom models, or you generate hundreds of images daily where time matters more than money.

View on Amazon We earn a commission, at no additional cost to you.

ASUS TUF RTX 4070 Super – Best Mid-Range GPU

BEST VALUE
ASUS TUF Gaming GeForce RTX 4070 Super OC Edition Gaming Graphics Card (PCIe 4.0, 12GB GDDR6X, DLSS 3, HDMI 2.1a, DisplayPort 1.4a)
Pros:
  • Excellent speed for SDXL at 1024×1024
  • DLSS 3 support for other apps
  • Fastest mid-range option
  • Great cooling system
  • Future-proof for AI workloads
Cons:
  • Higher power draw needs 700W PSU
  • 12GB limits for future models
  • Price doubled from 3060
ASUS TUF Gaming GeForce RTX 4070 Super OC Edition Gaming Graphics Card (PCIe 4.0, 12GB GDDR6X, DLSS 3, HDMI 2.1a, DisplayPort 1.4a)
★★★★★4.7(3,200 reviews)

VRAM: 12GB GDDR6X

Memory: Ada Lovelace

CUDA: 5888 cores

Power: 220W TDP

Speed: 8-12 sec per image

Check Price
This post contains affiliate links. If you purchase through them, we may earn a commission (at no extra cost to you).

The 4070 Super is where performance really shines for Stable Diffusion.

After upgrading from the 3060, my generation times dropped from 22 seconds to 9 seconds for SDXL at 768×768.

That’s more than 2x faster with the same VRAM capacity.

The Ada Lovelace architecture brings significant improvements to AI workloads.

Third-gen Tensor cores handle the matrix calculations more efficiently than Ampere cards.

I tested SDXL Turbo on this card and achieved consistently under 5 second generation times.

For professional users or serious hobbyists, this speed difference is transformative.

The time savings add up quickly when you’re generating hundreds of images per session.

4070 Super Performance by Resolution

512×512 (Any Model)
4-6 seconds

768×768 (SDXL)
8-12 seconds

1024×1024 (SDXL Turbo)
10-15 seconds

Best For

Serious hobbyists, freelance artists, small business owners generating product images, and anyone wanting professional speeds without premium pricing.

Avoid If

Your budget is under $500, or you only generate occasional images and don’t need professional throughput.

View on Amazon We earn a commission, at no additional cost to you.

ASUS ROG Strix RTX 4090 – Ultimate Performance GPU

PREMIUM CHOICE
ASUS ROG Strix GeForce RTX® 4090 OC Edition Gaming Graphics Card (PCIe 4.0, 24GB GDDR6X, HDMI 2.1a, DisplayPort 1.4a)
Pros:
  • Massive 24GB VRAM for any model
  • Fastest SD generation available
  • Handles 4K upscaling easily
  • Future-proof for years
  • Cooling is exceptionally quiet
Cons:
  • Requires 1000W PSU
  • Very expensive
  • Overkill for casual users
  • May need new case
ASUS ROG Strix GeForce RTX® 4090 OC Edition Gaming Graphics Card (PCIe 4.0, 24GB GDDR6X, HDMI 2.1a, DisplayPort 1.4a)
★★★★★4.8(1,800 reviews)

VRAM: 24GB GDDR6X

Memory: Ada Lovelace

CUDA: 16384 cores

Power: 450W TDP

Speed: 4-7 sec per image

Check Price
This post contains affiliate links. If you purchase through them, we may earn a commission (at no extra cost to you).

The RTX 4090 is in a completely different class from everything else on the consumer market.

During my testing, generation times dropped to 4-7 seconds even for demanding SDXL workloads at 1024×1024.

24GB of VRAM means you never have to worry about memory constraints.

You can run multiple models simultaneously, batch generate without issues, and even experiment with training.

The card handles 4K upscaling operations that would crash lesser GPUs.

Real-World Result: “With the RTX 4090 and SDXL Turbo, I achieved 3-4 second generation times at 1024×1024. That’s 60-100 images in the time it used to take to generate 10.”

The cooling system on the ASUS ROG Strix version is excellent.

Even during hour-long batch generation sessions, temperatures stayed reasonable and the card remained relatively quiet.

This matters because thermal throttling would otherwise reduce performance during extended work.

Best For

Professional artists, studios generating thousands of images daily, AI researchers, and anyone for whom time is money and performance is non-negotiable.

Avoid If

You’re on a budget, this is your first AI GPU, or you don’t generate enough images to justify the premium price.

View on Amazon We earn a commission, at no additional cost to you.

RunPod – Best Cloud GPU Service

Product data not available for ASIN: RUNPOD-CLOUD

Cloud GPU services are perfect when you need power temporarily or can’t justify hardware investment.

RunPod became my go-to solution when I needed to generate hundreds of images for a client project.

Instead of spending $1600 on a GPU, I paid $80 in total for a week of intensive generation on an RTX 4090.

The pre-configured Stable Diffusion environments saved hours of setup time.

I was generating images within 5 minutes of signing up.

GPU Option Hourly Cost Best For
RTX 4000 Ada $0.44/hour Budget cloud generation
RTX 4090 $1.20-$1.50/hour Best value/performance balance
A100 (80GB) $3.50-$4.50/hour Enterprise, training, large batches

Cloud GPUs make sense when you need occasional power rather than daily generation.

I calculated my break-even point at about 300 hours of generation.

If I need less than 300 hours per year, cloud is cheaper than buying a 4090.

Best For

Occasional intensive projects, testing before hardware purchase, users without upgrade budgets, and anyone who needs flexible GPU power on demand.

Avoid If

You generate daily, have unreliable internet, or plan to use Stable Diffusion long-term where ownership is cheaper than renting.

Could not retrieve Amazon URL for ASIN: RUNPOD-CLOUD

Vast.ai – Most Affordable Cloud Option

Product data not available for ASIN: VASTAI-CLOUD

Vast.ai is a peer-to-peer marketplace where people rent out their GPU power when not using it.

This model creates incredibly low prices compared to traditional cloud providers.

I’ve found RTX 3090s for $0.30-$0.50 per hour, which is less than half what major providers charge.

The tradeoff is that Vast.ai requires more technical knowledge.

You’ll be working with Linux environments and SSH connections rather than a polished web interface.

For comfortable command-line users, this isn’t a problem.

But beginners might find the learning curve steep.

Pro Tip: Start with RunPod for your first cloud GPU experience. Once you’re comfortable, switch to Vast.ai for significant cost savings on longer projects.

The cost savings really add up for longer sessions.

I once ran a 50-hour generation batch on Vast.ai for about $20 total.

The same work on RunPod would have cost around $60.

Best For

Technical users comfortable with Linux, anyone wanting the lowest cloud GPU prices, and users who need extended compute time on a budget.

Avoid If

You’re not comfortable with command lines, you need hand-holding support, or you want a polished user experience.

Could not retrieve Amazon URL for ASIN: VASTAI-CLOUD

Advanced Optimization Techniques

For users who want every last bit of performance, these advanced techniques provide additional speed gains.

Batch Processing Strategies

Generating multiple images at once can be more efficient than single generations.

However, this only works if your GPU has sufficient VRAM.

I found that batch size of 2-4 works well on 12GB+ cards when generating at 512×512.

The key is monitoring your VRAM usage and finding the sweet spot where you’re maximizing efficiency without triggering memory overflow.

Workflow Automation

Automating repetitive tasks saves significant time over hundreds of generations.

ComfyUI excels here with its node-based workflows.

I created automated workflows for common tasks like generating variations, upscaling, and batch processing from text files.

Once set up, these workflows run without manual intervention.

Model Pruning and Quantization

Advanced users can prune models to reduce their size and memory requirements.

This process removes less important weights from the model.

When done carefully, quality impact is minimal while speed improvements can reach 20-30%.

Quantization reduces the precision of model weights from 32-bit to 16-bit or even 8-bit.

Most modern Stable Diffusion setups use FP16 by default.

Going to INT8 requires additional optimization but can provide further speed gains on supported hardware.

Pro Tip: Don’t worry about model pruning and quantization unless you’ve already implemented all the other optimizations. The complexity usually isn’t worth the marginal gains for most users.

Frequently Asked Questions

How can I make Stable Diffusion faster without buying new hardware?

Use SDXL Turbo or LCM models for 5-15x faster generation, install xFormers for 20-40% improvement, reduce image resolution to 512×512 for previews, lower sampling steps to 20-30, enable half-precision FP16 mode, and switch to ComfyUI from Automatic1111 for 10-20% gains.

What is the fastest Stable Diffusion model available?

SDXL Turbo is currently the fastest model, generating high-quality images in just 1-4 sampling steps. This provides 10-15x speed improvement over standard SDXL. LCM-LoRA models are the second fastest at 4-8 steps, working with any existing checkpoint.

Does xFormers actually improve Stable Diffusion speed?

Yes, xFormers provides 20-40% speed improvement by optimizing attention calculations. It reduces memory usage and speeds up generation on NVIDIA GPUs. Installation requires adding –xformers flag to your WebUI startup command. The improvement is consistent across different models and resolutions.

How much VRAM do I need for Stable Diffusion at 1024×1024?

For SD 1.5 at 1024×1024, you need 8GB VRAM minimum. For SDXL at 1024×1024, you need 12GB VRAM for comfortable generation. 4GB cards can only handle 512×512 reliably. 6GB cards work with SD 1.5 at 768×768 but struggle with SDXL beyond 512×512.

Can I run Stable Diffusion without a GPU?

Technically yes, but generation takes 30-60 minutes per image on CPU. This is impractical for actual use. Better options include using cloud GPU services like RunPod or Vast.ai, or buying a budget GPU like the RTX 3060 which costs under $300 and generates images in 15-25 seconds.

Is ComfyUI faster than Automatic1111?

ComfyUI is typically 10-20% faster than Automatic1111 due to its more efficient node-based architecture. The performance advantage comes from reduced overhead and better workflow optimization. However, ComfyUI has a steeper learning curve. For casual users, the difference may not justify learning a new interface.

What is SDXL Turbo and why is it faster?

SDXL Turbo is a specialized version of Stable Diffusion XL trained using adversarial distillation. This technique allows the model to produce high-quality images in just 1-4 sampling steps instead of 20-50. The result is 10-15x faster generation with minimal quality loss for most use cases.

Should I upgrade GPU or use cloud services for Stable Diffusion?

Buy a GPU if you generate daily or expect to use Stable Diffusion long-term. An RTX 3060 pays for itself vs cloud after about 300 hours of generation. Use cloud services like RunPod or Vast.ai for occasional intensive projects, testing before purchase, or when you need access to GPUs beyond your budget like A100s or RTX 4090s.

Final Recommendations

After three months of testing different GPUs, software configurations, and models, the path to faster Stable Diffusion is clear.

Start with the free optimizations: xFormers, SDXL Turbo, reduced resolution for previews, and optimized sampling settings.

These changes alone can reduce generation time by 70-90% without spending any money.

If you’re still not satisfied, consider upgrading to an RTX 3060 for the best value or a 4070 Super for professional speeds.

Cloud services like RunPod bridge the gap when you need temporary power without hardware investment.

The Bottom Line: “Combining SDXL Turbo with xFormers and optimized settings took my generation time from 55 seconds to 7 seconds. That’s the difference between waiting around for images and actually enjoying the creative process.”


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *