Programmers are spending more time checking and using AI-written code than typing it themselves. In some startups, tools powered by large language models are already handling most of the code. One founder recently shared that nearly all the code in one batch of start-ups came from automated tools.
To build on this, Microsoft Research created debug-gym — a tool that gives coding agents the same basic tools that human developers use when fixing mistakes. This includes setting breakpoints, printing variable values, working through files and writing simple test cases.
Instead of guessing what might fix a bug based only on error messages, debug-gym lets AI agents run small tests, gather clues and use that to adjust their answers. The tool is meant to get machines to act more like real engineers, especially when something doesn’t work the first time.
Why Does Debugging Matter So Much?
The fact that developers spend much of their time checking if something works, then tracing what went wrong when it doesn’t can be repetitive, but it’s where most of the real thinking happens.
AI tools are good at writing first drafts, but when the output breaks, they often don’t know what to do next. That’s because they’re not trained to stop, test and try again… They just try to finish the sentence based on what they’ve seen before.
With human programmers, its different. They test, pause, check what the programme is doing, and then adjust their code. This helps them understand the root of a problem instead of guessing at the surface level. Until machines can copy this behaviour, they’ll always fall short in this part of the job.
Debug-gym opens the door to that kind of behaviour. It lets coding agents go through those steps, run code, check what happened, and decide what to try next. This makes them less reliant on lucky guesses and more grounded in the actual task in front of them.
More from News
- TikTok To Add Feature Similar To “Community Notes”
- What Will An OpenAI Social Media App Mean For Existing Platforms?
- Alipay Now Has Palm Recognition Tech For Payments
- Google Drops Certain Country Code Domains, Here’s Why
- Does Meta Now Use EU User Data To Train Its AI?
- Experts Share: Could The Removal Of IP Laws Affect Innovation And The Creative Industry?
- Could Meta Lose Instagram And WhatsApp?
- Netflix Tests Out AI Search Engine For Select Users
How Might These Methods Be Refined?
The first tests show that coding agents using debug-gym tools did better than those without them. In one set of 300 test tasks, the success rate jumped almost doubled. Still, the best result topped out below 50%. That’s not enough to hand over full responsibility.
One reason is that the language models being used haven’t been trained on enough real debugging examples. They don’t yet know how to ask useful questions or track information across different checks. Their habits come from finishing sentences, not from fixing broken software.
Microsoft plans to train smaller helper models that gather useful details. These small models would feed that context into larger models that generate the code. That way, the heavy thinking is shared. It also makes the system cheaper to run, since only one part needs to be large.
The team behind debug-gym is also working on collecting more real debugging data. Every time a human or agent uses the tool and tests a fix, that step can become part of the next training round. This could slowly help AI agents develop better habits over time.
Right now, AI can help speed things up. It can suggest fixes and write boilerplate code, especially for tasks it has seen before. But when it comes to tracking down bugs in unfamiliar systems, it still needs help.