
Build a Multi-File Refactor Agent with GPT-5.5
GPT-5.5 is OpenAI's strongest model yet for long agentic coding. Here's how to build an agent that does safe multi-file refactors across a real codebase.
Key takeaways
- Build the refactor agent with GPT-5.5 on the Responses API, which takes an input list and returns tool calls as items in response.output.
- Keep two phases separate: the model proposes file writes into a staging dict, and a human approves a real diff before anything hits disk.
- Resolve every file path against a fixed repo root in the executor so the model cannot edit files outside the project.
- Work inside a committed git tree and run the test suite after applying, since git checkout is the undo button and tests catch silent breakage.
OpenAI shipped GPT-5.5 (model id gpt-5.5, snapshot gpt-5.5-2026-04-23) and it is now their strongest public model for the kind of coding work that runs for a while and touches a lot of files. Not the "write me a quicksort" demo, but the messy stuff: renaming a function that is imported in nineteen places, threading a new argument through three layers, fixing a bug whose cause is two files away from the symptom. That is where earlier models tended to lose the plot around file seven.
The model alone does not refactor your codebase, though. It produces text. To actually read files, plan a change, and write it back, you need a loop around it: tools the model can call, an executor that runs those tools, and a human checkpoint before anything hits disk. That loop is the agent, and that is what we are building here.
I want to be honest up front about the risk, because the whole point of this post is doing it safely. An agent that can write files can also overwrite the wrong file, delete a function it misread, or confidently apply a "fix" that breaks twelve tests. We are going to design around that, not pretend it away.
What you'll need
- Python 3.11 or newer.
- The OpenAI Python SDK (
pip install openai), version recent enough to support the Responses API. - An API key in
OPENAI_API_KEY. - A git repo you do not mind experimenting on. Commit everything first so you have a clean baseline to diff against and revert to.
- A test command for that repo (
pytest,npm test, whatever you use). The agent leans on this as its safety net later.
A quick note on the API choice. GPT-5.5 works best with the Responses API (client.responses.create), and OpenAI recommends it for anything involving reasoning, tool calls, or multiple turns, which is exactly our situation. So that is what we use throughout. The Responses API takes an input list rather than the messages list you may remember from Chat Completions, and tool calls come back as items in response.output. Those two differences trip people up, so keep them in mind.
Step 1: Define the tools
The agent needs to do three things to a codebase: list files, read a file, and write a file. Each one becomes a function tool. In the Responses API, a tool is a plain dict with type: "function", a name, a description, and a JSON Schema for its parameters.
The descriptions matter more than people expect. The model decides when to call a tool based mostly on what you wrote in the description, so spell out what it does and when to use it.
import os
import json
import subprocess
from pathlib import Path
from openai import OpenAI
client = OpenAI()
# The directory the agent is allowed to touch. Everything is resolved
# against this so the model cannot wander outside the project.
ROOT = Path("/path/to/your/repo").resolve()
tools = [
{
"type": "function",
"name": "list_files",
"description": (
"List source files in the project, relative to the repo root. "
"Use this first to understand the layout before reading anything."
),
"parameters": {
"type": "object",
"properties": {
"subdir": {
"type": "string",
"description": "Optional subdirectory to limit the listing, e.g. 'src/api'.",
}
},
"required": [],
"additionalProperties": False,
},
},
{
"type": "function",
"name": "read_file",
"description": "Read the full contents of one file, given its path relative to the repo root.",
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Path relative to repo root."}
},
"required": ["path"],
"additionalProperties": False,
},
},
{
"type": "function",
"name": "write_file",
"description": (
"Propose new full contents for a file. This does NOT write to disk immediately. "
"The change is staged for human review and applied only after approval. "
"Always read a file before proposing a rewrite of it."
),
"parameters": {
"type": "object",
"properties": {
"path": {"type": "string", "description": "Path relative to repo root."},
"content": {"type": "string", "description": "The complete new file contents."},
},
"required": ["path", "content"],
"additionalProperties": False,
},
},
]
Notice the write_file description says plainly that nothing hits disk yet. That is deliberate. We want the model to think of writes as proposals, not as final actions, and the staging behavior we build in step 3 backs that up.
Step 2: Implement the executors
The tool definitions above are just schemas. Now we write the actual Python that runs when the model calls each one. The read and list functions are real side-effect-free operations. The write function does not write anything yet, it stashes the proposal in a dict so we can show a diff and ask for approval later.
# Staged writes live here: { relative_path: proposed_content }
pending_writes = {}
def _safe_path(rel_path: str) -> Path:
"""Resolve a path and refuse anything outside ROOT."""
full = (ROOT / rel_path).resolve()
if not str(full).startswith(str(ROOT)):
raise ValueError(f"Path escapes repo root: {rel_path}")
return full
def list_files(subdir: str = "") -> str:
base = _safe_path(subdir) if subdir else ROOT
out = []
for p in base.rglob("*"):
if p.is_file() and ".git" not in p.parts and "node_modules" not in p.parts:
out.append(str(p.relative_to(ROOT)))
return "\n".join(sorted(out)) or "(no files)"
def read_file(path: str) -> str:
return _safe_path(path).read_text(encoding="utf-8")
def write_file(path: str, content: str) -> str:
_safe_path(path) # validate before staging
pending_writes[path] = content
return f"Staged a proposed change to {path} ({len(content)} chars). Not yet written to disk."
DISPATCH = {
"list_files": list_files,
"read_file": read_file,
"write_file": write_file,
}
The _safe_path guard is small but it is the difference between an agent that edits your project and one that can edit ~/.ssh/config. The model will sometimes produce a path you did not expect, so the executor, not the model, enforces the boundary.
Step 3: The agent loop
Here is the core. We send the user's refactor request plus the tools, read the model's output, run any function calls it asked for, feed the results back, and repeat until it stops calling tools and gives a final answer.
The shape to internalize: tool calls arrive as items in response.output with type == "function_call". Each has a name, an arguments field (a JSON string you parse), and a call_id. You reply by appending an item of type "function_call_output" carrying the same call_id and your output string. Match those call_id values exactly or the model loses track of which result belongs to which call.
SYSTEM = """You are a careful refactoring agent working in a real codebase.
Workflow:
1. Use list_files and read_file to understand the code before changing anything.
2. Make a short plan and state it before you start editing.
3. Apply the refactor by calling write_file with COMPLETE new file contents,
once per file you need to change. Span as many files as the change requires.
4. Never propose a write to a file you have not read in this session.
5. When every file is staged, summarize what you changed and why, then stop.
"""
def run_agent(task: str, max_turns: int = 40):
input_list = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": task},
]
for turn in range(max_turns):
response = client.responses.create(
model="gpt-5.5",
input=input_list,
tools=tools,
reasoning={"effort": "high"}, # long agentic work earns the deeper thinking
)
# Carry the model's own output items forward as conversation state.
input_list += response.output
called_a_tool = False
for item in response.output:
if item.type != "function_call":
continue
called_a_tool = True
name = item.name
args = json.loads(item.arguments)
try:
result = DISPATCH[name](**args)
except Exception as e:
result = f"ERROR: {e}"
input_list.append({
"type": "function_call_output",
"call_id": item.call_id,
"output": str(result),
})
if not called_a_tool:
print(response.output_text) # final summary from the model
return
print("Hit max turns without finishing. Review the staged changes carefully.")
A few decisions worth calling out. I set reasoning.effort to high rather than the default medium, because multi-file refactors are exactly the case where the extra deliberation pays off (the default is fine for cheaper, simpler tasks). I append response.output back into input_list wholesale so the model keeps its own reasoning and tool-call context across turns. And max_turns exists so a confused agent caps out instead of looping forever and billing you for it.
Step 4: Plan first, then apply
Notice that the loop above never touches disk. By the time run_agent returns, pending_writes holds every proposed file change, but the working tree is untouched. That separation is the whole safety story: the model plans and proposes in one phase, and a human approves in another.
This two-phase design is also why the system prompt asks the model to state its plan before editing. You get to read the reasoning, see which files it intends to change, and catch a bad assumption before any byte is written.
Step 5: Dry-run diff and approval
Before we write anything, we show a real diff of each staged change against what is on disk, then ask for a yes. This is the gate. Nothing gets past it without a human typing y.
import difflib
def review_and_apply():
if not pending_writes:
print("No changes were proposed.")
return
for path, new_content in pending_writes.items():
target = _safe_path(path)
old = target.read_text(encoding="utf-8") if target.exists() else ""
diff = difflib.unified_diff(
old.splitlines(keepends=True),
new_content.splitlines(keepends=True),
fromfile=f"a/{path}",
tofile=f"b/{path}",
)
print(f"\n===== {path} =====")
print("".join(diff) or "(no textual difference)")
answer = input(f"\nApply {len(pending_writes)} file change(s)? [y/N] ").strip().lower()
if answer != "y":
print("Aborted. Nothing written.")
return
for path, new_content in pending_writes.items():
target = _safe_path(path)
target.parent.mkdir(parents=True, exist_ok=True)
target.write_text(new_content, encoding="utf-8")
print(f"Wrote {len(pending_writes)} file(s).")
pending_writes.clear()
Read those diffs like you would read a teammate's pull request, because that is what they are. The model is good, not infallible, and a five-line diff in the wrong file is much easier to catch here than after it ships.
Step 6: Run the tests
Once the files are written, run your test suite and let the agent see the result. If something broke, you have a clean git baseline to revert to, and you can feed the failures back into a follow-up request.
def run_tests(cmd: str = "pytest -q"):
print(f"\nRunning: {cmd}")
result = subprocess.run(cmd, shell=True, cwd=ROOT, capture_output=True, text=True)
print(result.stdout[-4000:])
print(result.stderr[-2000:])
return result.returncode == 0
# Putting it together:
if __name__ == "__main__":
run_agent("Rename the function `fetch_user` to `get_user` everywhere it is "
"defined or called, and update the imports accordingly.")
review_and_apply()
if run_tests():
print("Tests pass.")
else:
print("Tests failing. Inspect, then `git checkout .` to revert if needed.")
That is the full arc: the agent reads and plans, you approve a diff, the files get written, and the tests tell you whether the refactor actually held together.
Gotchas
Big repos blow past the context window. GPT-5.5 has a large context (over a million tokens), but list_files on a real project plus the contents of every file the model reads adds up faster than you would think, and you pay for every token of it. Do not feed the whole tree in blindly. Scope list_files to the relevant subdirectory, and let the model read files on demand rather than front-loading everything. If a refactor genuinely spans hundreds of files, break it into batches.
Destructive edits are the real danger. write_file takes complete file contents, which means a model that misremembers a file can quietly drop a function it did not mention. The diff review in step 5 is your main defense, so do not automate it away just because the agent has been right ten times in a row. Working only inside a committed git tree matters too. git checkout . is the undo button, and you want it to exist.
Tests are the safety net, not a formality. An agent can produce a refactor that looks perfect in the diff and still breaks behavior. The test run is what tells you the rename did not miss a dynamic call or a string reference. If your repo has thin test coverage, a refactor agent is riskier there, and that is worth knowing before you point one at it.
Cost adds up across turns. Every turn resends the growing input_list, so a forty-turn session re-bills a lot of context. Keep max_turns sane, use high reasoning only when the task warrants it, and watch your usage the first few runs so the bill does not surprise you.
Wrapping up
The interesting shift with GPT-5.5 is not that it writes better functions, it is that it can hold a multi-file change in its head long enough to finish the job without drifting. That makes a refactor agent genuinely useful instead of a toy.
The agent we built is deliberately small: three tools, a loop, a diff gate, a test run. You can grow it from here, maybe with an apply_patch-style tool for surgical edits instead of full rewrites, or a git tool so it can branch and commit on its own. But keep the two phases separate. The model proposes, a human approves, and the tests get the final word. Let the agent move fast inside those rails and it will save you real time. Take the rails away and it will eventually rewrite a file you cared about.
API details here reflect OpenAI's docs as of June 2026 (model id gpt-5.5, the Responses API, and the function-calling shape). If you are reading this later, check the current docs, since model ids and API surfaces move.