When to Rewrite, When to Refactor

Every CTO will face this decision at least once: do we rewrite this system from scratch, or do we refactor it incrementally? It's one of the highest-stakes technical decisions you can make, and the industry's track record on getting it right is, frankly, terrible.

Joel Spolsky famously called rewrites "the single worst strategic mistake that any software company can make." Netscape's rewrite took so long that it effectively handed the browser market to Internet Explorer. But there are also celebrated rewrites — Slack rebuilt their desktop client from scratch and shipped a dramatically better product. Shopify has executed multiple large-scale rewrites that accelerated their platform's evolution.

So which is it? Is rewriting always a mistake, or is it sometimes the right call? After leading both successful and failed rewrites throughout my career, I've developed a framework for making this decision. It's not a formula — the variables are too complex for that — but it provides structure for what is otherwise an emotionally charged debate.

Why We Get This Decision Wrong

The rewrite-versus-refactor decision is uniquely prone to cognitive biases. Engineers tend to overestimate the cleanliness of a greenfield system and underestimate the accumulated wisdom embedded in legacy code. That ugly conditional block with seventeen branches? It probably handles seventeen real edge cases discovered in production over three years. A rewrite that doesn't account for all of them will ship bugs that the old system solved long ago.

On the flip side, organizations tend to underestimate the compounding cost of incremental complexity. When a system has been patched, extended, and worked around for years, the cognitive overhead of understanding it becomes a tax on every future change. Refactoring can address this incrementally, but sometimes the foundation itself is the problem — and you can't refactor a foundation while the building stands on it.

The decision also gets distorted by politics. Engineers advocate for rewrites because building new things is more fun than maintaining old things. Managers resist rewrites because they're expensive and risky. Neither motivation is a good basis for a technical decision.

The Framework: Five Questions

When a team comes to me proposing a rewrite, I walk through five questions. The answers don't produce a binary yes/no, but they clarify the trade-offs and expose assumptions that might otherwise go unexamined.

Question One: What is the actual problem we're solving?

This sounds obvious, but it's remarkable how often teams propose a rewrite without clearly articulating the problem. "The code is messy" is not a problem statement. "We cannot add new payment methods without a six-week development cycle due to tight coupling in the payment processing module" is a problem statement.

Precise problem definition often reveals that the pain is localized. If the problem is in the payment module, you don't need to rewrite the entire system. You might need to rewrite the payment module, or you might be able to refactor it by extracting interfaces and replacing the implementation incrementally.

Question Two: Have we genuinely tried refactoring?

I require teams to demonstrate that they've made a serious attempt at refactoring before I'll approve a rewrite. Not a token effort — a genuine, well-planned refactoring initiative with clear milestones and dedicated engineering time.

At a previous company, a team wanted to rewrite our notification service. It was a monolithic Rails application that had grown organically over four years. Delivery was unreliable, adding new channels was difficult, and the codebase was hard to understand. The rewrite proposal estimated eight months and three engineers.

I asked the team to spend four weeks on a focused refactoring effort first. They extracted the delivery logic into a separate module with a clean interface, added comprehensive tests around the existing behavior, and replaced the synchronous delivery path with an async queue. After four weeks, the notification service was still a Rails app with plenty of rough edges, but adding new channels went from a multi-week effort to a two-day effort. The primary business problem was solved. The rewrite was shelved.

Sometimes refactoring genuinely doesn't work. If the data model is fundamentally wrong, if the technology platform has been abandoned by its maintainers, or if the architecture cannot support a critical business requirement regardless of how it's restructured — then refactoring is putting lipstick on a structural problem. But you need to prove that, not assume it.

Question Three: Can we do a strangler fig migration?

The strangler fig pattern — incrementally replacing a legacy system by building new functionality alongside it and gradually routing traffic to the new implementation — is the most reliable migration strategy I've encountered. It's slower than a big-bang rewrite, but it's dramatically safer.

One of my most successful "rewrites" was actually a two-year strangler fig migration. We had a monolithic e-commerce platform that needed to become a set of microservices. Instead of rewriting everything at once, we identified the service boundaries, built new services for new features, and gradually migrated existing functionality service by service. At any point during those two years, the system was fully functional. There was no "rewrite switchover day" with its associated risk. The old system shrank organically as the new services absorbed its responsibilities.

If a strangler fig approach is feasible for your situation, it's almost always the right choice. The question is whether the old and new systems can coexist during the transition. If they can share a data layer, or if you can introduce an API gateway to route between them, strangler fig is on the table.

Question Four: Do we understand why the old system is the way it is?

This is the question that separates successful rewrites from failed ones. Before you rebuild, you need to deeply understand the existing system — not just its architecture, but its history. Why were certain decisions made? What business constraints shaped the design? What edge cases does it handle that aren't documented anywhere?

I led a rewrite early in my career that failed precisely because we didn't respect this question. We were replacing a billing system that had been in production for seven years. The old system was a mess — spaghetti code, no tests, a database schema that looked like it had been designed by committee during an earthquake. We were confident we could do better.

Six months into the rewrite, we were still discovering edge cases. Tax calculation rules for specific jurisdictions. Proration logic for mid-cycle plan changes. Currency rounding rules that differed by payment processor. Grace period handling that varied by customer tier. Every one of these was implemented in the old system, buried in the code we'd dismissed as "messy." Each one represented a customer-facing behavior that someone, somewhere, depended on.

The rewrite eventually shipped, but it took 18 months instead of the estimated 6, and the first three months in production were a continuous stream of bug reports from customers whose billing behaved differently than it used to. The "ugly" old code wasn't ugly for no reason — it was ugly because billing is genuinely complex, and that complexity has to live somewhere.

Question Five: Can the team execute a rewrite successfully?

A rewrite is one of the most demanding projects an engineering team can undertake. It requires maintaining the old system while building the new one, migrating data and traffic without disruption, and keeping the business running throughout. It requires deep understanding of the problem domain, strong architectural judgment, and the organizational patience to see a multi-month project through to completion.

Be honest about your team's capacity. If your team is already stretched thin maintaining the existing system, adding a parallel rewrite effort will make both worse. If you've never shipped a major system migration before, a big-bang rewrite is a particularly risky place to learn.

When Rewrites Succeed

The successful rewrites I've been part of share common characteristics. The problem was well-defined and the legacy system's behavior was thoroughly documented before the rewrite began. The team had deep domain expertise and understood the business logic embedded in the old system. The migration was incremental, with a clear rollback strategy at each phase. And critically, the organization committed sufficient resources and timeline — successful rewrites always take longer than estimated, and leadership that pulls the plug at month six of a twelve-month rewrite creates the worst possible outcome.

One rewrite I'm particularly proud of was a data ingestion platform. The old system was built on a technology that was literally end-of-life — the vendor had announced they would stop supporting it within 18 months. Refactoring wasn't an option because the core technology was the problem. We ran both systems in parallel for four months, comparing outputs to verify correctness, before we cut over. The investment in parallel running caught 23 behavioral differences that our test suite had missed.

When Refactoring Wins

Refactoring wins more often than our industry acknowledges. The unglamorous work of adding tests to legacy code, extracting modules, defining interfaces, and improving naming — this work compounds over time. A system that's been thoughtfully refactored over two years is often better than a rewrite could produce, because it retains all the production-hardened edge case handling while gaining structural clarity.

The key to successful refactoring is making it continuous rather than episodic. Teams that say "we'll refactor when we have time" never have time. Teams that treat refactoring as a normal part of every feature development cycle — the Boy Scout rule of leaving code better than you found it — keep their systems healthy without needing the heroic intervention of a rewrite.

The Decision Is Never Purely Technical

Ultimately, the rewrite-versus-refactor decision isn't just a technical question. It's a business question, a team question, and a risk management question. The right choice depends on your competitive situation, your team's capabilities, your organization's risk tolerance, and the specific nature of the technical problems you're facing.

What I can tell you with confidence is this: the worst outcomes come from making this decision emotionally. The engineer who's frustrated with legacy code and the manager who's afraid of risk are both letting emotion drive a decision that deserves rigorous analysis. Use the framework. Ask the hard questions. And whatever you decide, commit fully — a half-hearted rewrite and a half-hearted refactoring effort are both worse than either approach executed with conviction.