Why Terraform plans take 10 minutes (and don't have to)

You changed one security group rule. One line. The diff is three characters wide.

Terraform takes eleven minutes to tell you what it thinks about that.

This isn’t a rant. It’s a walk through the machinery. Once you see why terraform plan does what it does, the slowness stops feeling like a bug and starts feeling like an architectural decision made in 2014 that nobody revisited.

What actually happens during a plan

When you run terraform plan, terraform does three things in sequence:

Parse the config. Read your .tf files, build an in-memory graph of every resource and its dependencies. This is fast — sub-second for most configurations.
Refresh state. For every resource in state, make an API call to the cloud provider to get its current attributes. This is the slow part. Every resource. Every time. Regardless of what you changed.
Diff. Compare the refreshed state against the desired config. Generate the plan. Also fast.

Step 2 is where your eleven minutes went. If you have 800 resources in state, terraform makes 800 API calls to AWS (or GCP, or Azure) before it can tell you anything about your three-character change.

The refresh exists for a reason

Terraform doesn’t trust its own state file. That’s a deliberate design choice, and it’s not wrong — someone could have changed a resource through the console, another tool, or a different terraform workspace. The state file might be stale.

So terraform checks. Every resource. It calls ReadResource on the provider for each one, gets back the current attributes, and updates its in-memory state before computing the diff.

When terraform was designed for 50-resource deployments, this was fine. A refresh across 50 resources takes a few seconds. The full-state refresh was a reasonable tradeoff: correctness over speed, and the speed cost was negligible.

At 800 resources, the tradeoff isn’t negligible anymore. At 2,000, it’s hostile.

API rate limiting makes it worse

Cloud providers rate-limit their APIs. AWS will throttle DescribeInstances calls if you hit them too fast. When terraform refreshes 800 resources, it’s making hundreds of concurrent API calls across multiple resource types. Some of those calls get throttled. Terraform retries with exponential backoff.

This is where the variance comes from. A plan that takes 8 minutes on Tuesday takes 14 minutes on Thursday because AWS was feeling protective about its DescribeSecurityGroups endpoint. You didn’t change anything. Your provider didn’t change anything. The cloud’s rate limiter just had a different mood.

And it compounds. If your team has three engineers running plans against the same AWS account simultaneously, you’re tripling the API pressure. Everyone’s plans get slower, not just the person who triggered the throttle. Monday mornings — when everyone opens their laptops and runs terraform plan on their branches — are the worst. You can practically watch the plan times drift upward between 9:00 and 9:30 AM.

-refresh=false is a trap

You’ve probably tried terraform plan -refresh=false. It skips step 2 entirely. Plans finish in seconds.

The problem: you’re now diffing against stale state. If someone changed something through the console, or another workspace applied changes since your last refresh, your plan is lying to you. It’ll show a clean diff when there should be drift. Or worse, it’ll produce an apply that silently reverts someone else’s work because terraform thinks the old state values are the truth.

-refresh=false trades correctness for speed. That’s the opposite of what terraform was trying to do with the full refresh. You’ve flipped the tradeoff completely.

-target narrows the blast radius, mostly

terraform plan -target=aws_security_group.web tells terraform to only refresh and plan for that specific resource and its dependencies. This should be fast, and sometimes it is.

But -target has sharp edges. Terraform still evaluates upstream dependencies of the targeted resource. If your security group references a VPC, terraform refreshes the VPC too. If that VPC is referenced by 200 other resources through data sources and terraform_remote_state, you might pull in more of the graph than you expected.

And -target requires you to know the full resource address. If you changed a module that contains twelve resources, you need twelve -target flags. If you added a new resource that doesn’t exist in state yet, -target doesn’t help — terraform needs to evaluate the config to know about it.

Most teams don’t use -target for day-to-day work. It’s a break-glass option, not a workflow.

State splitting is the standard advice

When plans get slow, the universal recommendation is: split your state. Move networking into one state, compute into another, IAM into a third.

This works. Smaller states mean fewer resources to refresh. Your 800-resource monolith becomes four 200-resource states, and each plan takes a quarter of the time.

Until you need to coordinate across them.

The VPC lives in the networking state. The security groups reference it via terraform_remote_state. The EC2 instances in the compute state reference the security groups the same way. Now a networking change requires a three-step deploy: apply networking, wait for state to propagate, apply security groups, wait again, apply compute.

You traded slow plans for deployment choreography. The aggregate plan time across all three states is roughly the same — you’ve just distributed it. And if two engineers need the same smaller state, you’re back to lock contention, just with a shorter queue.

State splitting also doesn’t help with the rate limiting problem. Three engineers planning against three different states in the same AWS account still hit the same API rate limits. The throttling doesn’t know about your state boundaries.

The graph is right there

This is the part that should bother you.

Terraform already builds a dependency graph. Step 1 — the fast part — constructs a complete graph of every resource, every dependency edge, every module boundary. Terraform knows that your security group change only affects the security group and the three instances that reference it. It computed that relationship graph before it started the refresh.

Then it ignores that information and refreshes everything anyway.

The dependency graph is a solved problem inside terraform. It’s used for ordering creates and destroys. It’s used for parallelizing applies. The graph exists, it’s correct, and it gets thrown away before the refresh starts.

The state backend — S3, GCS, Azure Blob, Terraform Cloud — stores state as a single JSON blob. It has no concept of individual resources, no understanding of dependencies, no ability to tell terraform “only these four resources are relevant to your change.” The backend is a key-value store with a lock on it. It stores the blob, returns the blob, and gets out of the way.

That architectural boundary — smart client, dumb storage — is why the refresh is all-or-nothing. Terraform can’t ask the backend “what changed since my last plan?” because the backend doesn’t know what a resource is. It just knows there’s a JSON file and someone has the lock.

What a graph-aware backend would look like

If the state backend understood the dependency graph — if it stored resources individually with their relationships intact — the math changes completely.

A plan that changes one security group could ask the backend: “give me this security group and everything downstream of it.” The backend returns four resources instead of 800. Terraform refreshes four resources. The plan finishes in seconds.

This isn’t theoretical. The dependency graph already exists in two places: terraform computes it from config on every run, and it’s embedded in the state file’s resource dependency metadata. A backend that parsed state into its component resources could use that graph to scope refreshes to only what matters.

Nobody built backends this way because terraform was designed when 50 resources was a big deployment. The blob-and-lock model was fine for that era. But the era changed, the deployments grew, and the backend architecture didn’t move with them.

Your eleven-minute plan isn’t terraform being slow. It’s terraform being thorough across an architecture that doesn’t give it any way to be selective. The information to be selective is right there. It’s just on the wrong side of the storage boundary.