The AI modernization gold rush — and what it’s missing
The past twelve months have seen an explosion of interest in using LLMs to modernize IBM mainframe applications. Anthropic’s recent post on Claude Code and COBOL modernization captured headlines and rattled IBM’s share price. Thoughtworks published a thoughtful response grounding the conversation in delivery reality. AWS, Google, and a growing number of startups are all positioning AI as the answer to the mainframe application modernization problem.
At Heirloom Computing, we’ve spent over a decade building a deterministic COBOL-to-Java transpiler that compiles — not translates — mainframe applications to run on the JVM. We’ve migrated some of the largest mainframe workloads on record, including PHEAA’s student loan servicing platform, the biggest mainframe-to-AWS migration ever completed. So when the industry frames modernization as “let AI translate your COBOL,” we have a perspective on what that framing gets right, what it gets wrong, and where the real opportunity lies.
The short version:
LLMs are genuinely transformative for parts of the modernization journey. They are also structurally unsuited to the part that matters most.
Understanding why, requires being precise about what “modernization” actually means.
What are we actually trying to do here?
This is where most conversations go wrong. They jump straight to solutions — “use AI to translate COBOL to Java” — without pausing on the problem.
No enterprise wakes up wanting their COBOL to be Java. They wake up with a different set of problems: they can’t hire people who understand the technology; their deployment cycle is measured in months; they’re paying mainframe license fees that could fund an entire cloud infrastructure; their regulators are asking questions about operational resilience they can’t confidently answer.
These are platform problems, not translation problems.
The COBOL isn’t the disease — it’s a symptom of an operational model that can’t adapt at the speed the business requires.
This distinction matters because it determines what success looks like. If modernization is a translation problem, success is code that looks like modern Java. If it’s a platform problem, success is operational capability: CI/CD, cloud deployment, horizontal scaling, observable systems, and — above everything else — the certainty that the migrated system produces exactly the same results as the one it replaced.
That last point is the one the AI narrative tends to glide past. And it’s the one that matters most.
What does “correct” actually mean at enterprise scale?
A bank processing trillions of dollars in annual transactions doesn’t want 99% behavioral fidelity in its migrated system. It needs 100%. Not because the bank is being unreasonable, but because the difference between 99% and 100% across millions of daily transactions is the difference between a functioning financial system and regulatory intervention.
This is worth dwelling on because it’s where the category error happens. “How accurate is the AI?” sounds like a rigorous question. But it treats a formal correctness problem as if it were a quality-on-a-spectrum problem. It’s a bit like asking “how much does trust cost?” — grammatically sound, but it puts trust in the wrong category. Trust isn’t a commodity you buy in quantities. It’s a relationship property. You either trust the output or you don’t. You either run it in production or you don’t.
100% correct is not ten times better than 90% correct. It’s a categorically different thing. It’s the difference between a bridge that bears weight and a bridge that probably bears weight. No one crosses the second bridge.
Deterministic correctness isn’t expensive. It’s priceless —
— not because it costs infinity, but because it exists outside the category of things that have a price. You can’t buy it by spending more on LLM inference. You can only get it by choosing an approach that produces it as a structural property.
What does a deterministic transpiler actually do?
A deterministic transpiler is a compiler. It takes COBOL source and produces JVM-compatible Java that does exactly the same thing — byte for byte, transaction for transaction. It doesn’t interpret. It doesn’t infer. It doesn’t approximate. The relationship between input and output is as fixed and verifiable as the relationship between a C program and its assembly output.
The output isn’t pretty. The Java looks like COBOL expressed in Java syntax. Paragraphs become methods, sections become classes, but the deep structure — control flow, data layouts, procedural patterns — reflects the mainframe application’s architectural assumptions.
Is this idiomatic? No. Is it provably correct, immediately deployable, and fully testable? Yes.
This matters because it makes the testing problem tractable. You can validate the output against the same test cases that validated the original COBOL — same JCL, same transactions, same batch files, byte-for-byte comparison. The testing strategy isn’t “hope we caught all the edge cases.” It’s “verify that the compiler works.”
The runtime layer handles the rest. CICS transactions, batch processing, JCL execution, BMS screen maps, IMS messaging, VSAM file access, RACF security — all reimplemented as Java libraries running on the JVM. The application doesn’t know it’s been migrated. It just runs — on-premise, any cloud or even on the mainframe, with modern operational tooling wrapped around it.
Why can’t LLMs do this?
As Thoughtworks rightly pointed out, the core challenge of large COBOL systems isn’t that the code is unreadable — COBOL was designed to be readable. It’s scale and cognitive load. But the problem with LLMs goes deeper than that. It’s architectural.
Hallucination isn’t a bug, it’s the design. LLMs generate output token by token, each selection based on a probability distribution. This is extraordinarily powerful for tasks where creative variation is an asset — summarization, documentation, code suggestion, test generation. It’s structurally unsuited to tasks where any deviation from the correct output is a defect. When a packed decimal field is moved to a display numeric with an implicit decimal point, the byte-level behavior must be preserved exactly.
An LLM that “usually” gets this right is unusable at scale, because the failures are silent, contextual, and scattered across millions of lines of code.
Context windows can’t hold a system. A mainframe application isn’t a collection of independent programs. It’s a system: schedulers orchestrate workflows, implemented as JCL batch jobs that read VSAM files written by CICS transactions that call IMS databases backed by DB2 tables, with the results consumed by a plethora of interconnected services and third-party systems. A single business process may span dozens of programs, hundreds of copybooks, and thousands of interdependencies. Even at a million tokens, an LLM can’t hold the full dependency graph. It processes each program in partial context, missing the systemic interdependencies that define actual production behavior. A compiler resolves these dependencies at compile time, because that’s what compilers do.
The testing paradox. Proponents of LLM-based modernization argue that comprehensive testing mitigates the hallucination risk. But this creates a paradox: if you have comprehensive tests, you can refactor with confidence using any tool — you don’t need the LLM. If you don’t have comprehensive tests — and most mainframe installations don’t — you have no way to validate the LLM’s output. Deterministic transpilation sidesteps this entirely. The output is structurally equivalent to the input, so the same tests that validated the COBOL validate the Java.
Scale is the killer. Individual programs can be translated by LLMs with reasonable fidelity, especially with human review. This is the seductive demo. But mainframe modernization isn’t a demo. It’s a program of work spanning millions of lines of code, thousands of programs, and years of execution, all within an operational framework that defines business processes. At this scale, even a 1% error rate produces tens of thousands of defects, each requiring human investigation. Deterministic transpilation scales linearly. The compiler processes its millionth line with the same fidelity as its first. No fatigue, no context degradation, no accumulating probability of error.
So where does AI actually belong?
This is where the conversation gets genuinely interesting — and where the either/or framing the market has adopted shows its limitations.
The Thoughtworks piece describes a five-stage modernization workflow: pre-processing, AI-assisted discovery, AI-assisted planning, AI-assisted reverse engineering, and AI-assisted forward engineering. That’s a solid framework. What we’d add is a clearer separation of what needs to be deterministic and what benefits from being probabilistic.
Deterministic transpilation handles stage 1 and the mechanical part of stage 5. It compiles the code, preserves the behavior, liberates the platform. This isn’t AI-assisted — it’s compiler-executed. The output is mathematically equivalent to the input. Full stop.
AI handles stages 2, 3, 4, and the intelligent part of stage 5. Once you have a deterministic Java baseline that’s provably correct, and AI becomes genuinely transformative:
Knowledge extraction: LLMs are exceptional at reading code and explaining what it does in plain language. That third layer of meaning — the institutional knowledge, the “why” behind the “what” — is where AI shines. Approximate understanding of intent is genuinely useful in documentation. A human reading AI-generated documentation can evaluate it, correct it, enrich it. A computer executing AI-generated code can’t.
Test generation: With a provably correct baseline, AI can generate comprehensive test suites that capture current behavior, creating a safety net for everything that follows. The deterministic transpilation guarantees the baseline is trustworthy. The AI accelerates test creation.
Idiomatic refactoring: With tests in place, AI can propose refactoring strategies — extracting services, introducing design patterns, modernizing data access — with confidence that regressions will be caught. Incremental, reversible, validated at every step.
Service decomposition: AI can analyze call graphs, data dependencies, and transaction boundaries to propose how to break a monolith into services. This is analytical, advisory work — exactly the kind of nuanced, context-sensitive reasoning LLMs do well.
The architecture isn’t “deterministic transpilation instead of AI.” It’s “deterministic transpilation as the foundation for AI-driven modernization.”
The compiler gives you the provably correct foundation. AI gives you the architect, the interior designer, and the landscape gardener. Neither does the other’s job.
What about the “just translate it” approach?
Thoughtworks made this point well: a direct translation would, in the best case, faithfully reproduce existing architectural constraints, accumulated technical debt, and outdated design decisions. It wouldn’t address weaknesses — it would restate them in a different language.
We’d go further. There are three possible outcomes of LLM-based code translation at scale:
It works perfectly: This hasn’t happened in any production migration we’re aware of, but if it did, you’d have idiomatic Java that faithfully reproduces the original system. You’d then need to prove it’s correct — which, without the original system to compare against, is extremely difficult.
It works mostly: This is the realistic case. You’d have Java that’s correct for most paths and incorrect for some. Finding the incorrect paths requires testing every execution path in the original system, which is precisely the problem you were trying to avoid. The economics of this at scale are ruinous.
It works for the demo: A small, well-scoped program gets translated impressively. The results are extrapolated to the entire portfolio. The program begins. The defect rate becomes apparent at month six. By then, significant budget has been committed, and invested stakeholders amplify the commitment to an increasingly low probability of success.
Deterministic transpilation produces a fourth outcome:
Java that is correct by construction, immediately comparable against the original, and deployable to modern infrastructure.
It’s not elegant. It’s not idiomatic. It’s not the kind of output that demos well. But it works — provably, repeatedly, and at any scale.
The real opportunity
The mainframe modernization market has historically offered enterprises a difficult choice: stay on the mainframe and accept the operational constraints, or undertake a multi-year rewrite with all its attendant risk. AI hasn’t changed this binary. What it’s done is made the rewrite option faster and cheaper — but no less risky.
Deterministic transpilation offers a third transcendent path: compile your way off the mainframe with mathematical certainty, then use AI to progressively modernize the architecture at whatever pace the business requires. Phase 1 is deterministic — compile, validate, deploy. Phase 2 is AI-powered — document, test, refactor, decompose. Each phase plays to the strengths of its technology. Neither asks a tool to do something it’s structurally incapable of doing.
We agree with Thoughtworks that solid mainframe skills remain essential and that modernization hasn’t been automated. We agree that AI tools are making significant impact on discovery, planning, and delivery. Where we’d push the conversation further is on the specific question of behavioral preservation: this is not a problem that admits of probabilistic solutions. It’s a problem that demands a compiler.
The enterprises that understand this distinction will modernize successfully. The enterprises that chase the elegance of AI-generated code without first establishing a deterministic foundation will discover, expensively, that you can’t build on a foundation of probability.
100% correct isn’t a number on a scale. It’s a different kind of thing entirely.