Automating code deletion with Gemini (and a little Python)§

Today I used Gemini 2.0 Flash and a little Python to automate the process of removing code from over 200 GN build files. Here’s how it went.

Background§

Back in Q1 2025 we migrated pigweed.dev from GN to Bazel. I.e. previously we built the site with GN and now we build it with Bazel. We kept the old GN build around in case some unforeseen issues with the Bazel build came up and we needed to quickly revert to the “known good” GN build. Fast forward to today, no such issues have popped up. The Bazel build has been running smoothly in production for a few months now. Keeping the old GN docs build is slowing down Pigweed contributors, because they have to keep both the GN and Bazel docs builds happy. It’s time to turndown the GN build.

The actual process of turning down the GN docs build is fairly simple. I just needed to remove the documentation generation (docgen) features from all of the BUILD.gn files. It’s easy to detect the existence of docgen features; you just search for a few certain keywords (e.g. pw_docgen or pw_doc_group) within each BUILD.gn file’s contents. It’s probably feasible to automate this through regex-based automation but I wanted to find out if Gemini could handle this task. Also, if this approach works, there are lots of other problems that aren’t reducible to regex automation where this approach may come in handy.

Approach§

My general approach was to:

  • Grab all source files in the repo.

  • Filter out everything except the BUILD.gn files.

  • Filter out the BUILD.gn files that do not use the docgen features.

  • Use Gemini to remove the docgen features from each BUILD.gn file, one-by-one. I.e. one Gemini API invocation for each file.

My former boss Meggin Kearney calls this the hybrid approach to GenAI. For some tasks, such as listing out all files and determining relevant files, you’re sometimes better off just using “classical automation” e.g. a Python script. Only use GenAI for the things that can’t be handled by classical automation.

Implementation§

  1. Create and activate a virtual environment in the root directory of the repository:

    python3 -m venv venv
    . venv/bin/activate.fish
    
  2. Install the Gemini Python library:

    python3 -m pip install google-genai
    
  3. Save the following script as edit.py:

    from json import dump, load
    from os import environ, walk
    from pathlib import Path
    from subprocess import run
    from typing import Dict
    
    from google.genai import Client
    
    
    def _is_ignored(root: Path, target: Path, ignored: list[Path]) -> bool:
        """Check if Git is ignoring the path."""
        # Ignore Git's directory itself.
        if str(target).lower().endswith(".git"):
            return True
        # Check if this path matches something in ``.gitignore``.
        command = ["git", "-C", str(root), "check-ignore", str(target)]
        result = run(command, capture_output=True, text=True)
        return str(target) in result.stdout
    
    
    def _is_in_ignored_dir(target: Path, ignored: list[Path]):
        """Check if this path is in an ignored directory."""
        for maybe_parent_dir in ignored:
            if str(maybe_parent_dir) in str(target):
                return True
        return False
    
    
    def collect_paths(root: Path) -> list[Path]:
        """Collect all paths in the repository."""
        paths: list[Path] = []
        ignored: list[Path] = []
        for current_working_dir, _, files in walk(root):
            cwd = Path(current_working_dir)
            if _is_in_ignored_dir(cwd, ignored):
                continue
            if _is_ignored(Path(root), cwd, ignored):
                ignored.append(cwd)
                continue
            for file in files:
                path = cwd / Path(file)
                if _is_ignored(Path(root), path, ignored):
                    ignored.append(path)
                    continue
                paths.append(path)
        return paths
    
    
    def _is_gn_build_file(path: Path) -> bool:
        """Check if the path is a GN build file."""
        return str(path).endswith("BUILD.gn")
    
    
    def _uses_docgen(path: Path) -> bool:
        """Check if the GN build file has any of the docgen keywords."""
        with open(path, "r") as f:
            content = f.read()
        keywords = ["pw_docgen", "pw_doc_group", "pw_size_diff", "pw_doc_gen"]
        for keyword in keywords:
            if keyword in content:
                return True
        return False
    
    
    def filter_paths(root: Path, paths: list[Path]) -> list[Path]:
        targets = []
        for path in paths:
            if not _is_gn_build_file(path):
                continue
            if not _uses_docgen(path):
                continue
            targets.append(path)
        return targets
    
    
    def _remove_backticks(edits: str) -> str:
        """Remove the backticks that Gemini adds at the start and end of the output."""
        lines = edits.splitlines()
        if lines[0].startswith("```"):
            lines.pop(0)
        last = len(lines) - 1
        if lines[last].startswith("```"):
            lines.pop(last)
        return "\n".join(lines)
    
    
    def remove_docgen_code(prompt: str, target: Path, gemini: Client) -> None:
        print(f"[info] Editing {str(target)}")
        with open(target, "r") as f:
            src = f.read()
        contents = prompt + src
        response = gemini.models.generate_content(
            model="gemini-2.0-flash",
            contents=contents,
        )
        if response.text is None:
            return
        edits = _remove_backticks(response.text)
        with open(target, "w") as f:
            f.write(edits)
    
    
    def main():
        root = Path(".")
        paths = collect_paths(root)
        targets = filter_paths(root, paths)
        print(f"[info] {len(targets)} files will be edited")
        with open("prompt.md", "r") as f:
            prompt = f.read()
        gemini = Client(api_key=environ["GEMINI_API_KEY"])
        for target in targets:
            remove_docgen_code(prompt, target, gemini)
    
    
    if __name__ == "__main__":
        main()
    

    (This script assumes that git is a globally available command.)

  4. Save the instructions as prompt.md:

    # GN DOCS BUILD TURNDOWN
    
    ## BACKGROUND
    
    Previously we built our docs with GN. Now, we build them with Bazel.
    We no longer need the GN document generation (docgen) features.
    
    ## GOAL
    
    Your task is to remove the GN docgen features from the GN file that
    I provide you. You must not modify any other lines in the GN files.
    You must output the edited GN file with no explanation. The keywords
    `docs`, `pw_docgen`, `pw_doc_group`, `pw_size_diff`, and `pw_doc_gen`
    indicate docgen features that need to be removed.
    
    ## EXAMPLE
    
    When provided a file like this:
    
    ```
    # Copyright 2024 The Pigweed Authors
    #
    # Licensed under the Apache License, Version 2.0 (the "License"); you may not
    # use this file except in compliance with the License. You may obtain a copy of
    # the License at
    #
    #     https://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
    # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
    # License for the specific language governing permissions and limitations under
    # the License.
    import("//build_overrides/pigweed.gni")
    import("$dir_pw_docgen/docs.gni")
    import("$dir_pw_unit_test/test.gni")
    import("$dir_pw_build/target_types.gni")
    import("$dir_pw_unit_test/test.gni")
    
    config("public_include_path") {
      include_dirs = [ "public" ]
      visibility = [ ":*" ]
    }
    
    pw_source_set("my_library") {
      public = [ "public/my_library/foo.h" ]
      deps = [":an", ":unsorted", ":list"]
      public_configs = [ ":public_include_path",
      ]
    }
    
    pw_doc_group("docs") { sources = [ "docs.rst" ] }
    
    pw_test_group("tests") {
    }
    ```
    
    You should modify the file like this:
    
    ```
    # Copyright 2024 The Pigweed Authors
    #
    # Licensed under the Apache License, Version 2.0 (the "License"); you may not
    # use this file except in compliance with the License. You may obtain a copy of
    # the License at
    #
    #     https://www.apache.org/licenses/LICENSE-2.0
    #
    # Unless required by applicable law or agreed to in writing, software
    # distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
    # WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
    # License for the specific language governing permissions and limitations under
    # the License.
    import("//build_overrides/pigweed.gni")
    import("$dir_pw_unit_test/test.gni")
    import("$dir_pw_build/target_types.gni")
    import("$dir_pw_unit_test/test.gni")
    
    config("public_include_path") {
      include_dirs = [ "public" ]
      visibility = [ ":*" ]
    }
    
    pw_source_set("my_library") {
      public = [ "public/my_library/foo.h" ]
      deps = [":an", ":unsorted", ":list"]
      public_configs = [ ":public_include_path",
      ]
    }
    
    pw_test_group("tests") {
    }
    ```
    
    ## INSTRUCTIONS
    
    Remove the GN docgen features from the following file. Remember that
    the keywords `docs`, `pw_docgen`, `pw_doc_group`, `pw_size_diff`, and
    `pw_doc_gen` represent the docgen features that should be deleted.
    You must output the edited GN file completely, without explanation.
    

    (In the Python script, the source code of a single BUILD.gn file is appended after these instructions.)

  5. Run the script:

    python3 edit.py
    

Results§

235 files were modified. Perhaps most miraculously, the build continued to work after all of Gemini’s removals, modulo a couple minor issues.

Here’s the change: https://pwrev.dev/286672

Cost§

I think my total Gemini API costs were between $30-40. I’m not sure at the moment because Google Cloud seems to be having issues around not displaying billing costs in realtime.

If I had been more careful with making sure the script worked completely before ever invoking the Gemini API, I think the cost would have been between $10-20. As I was developing the script, I would let Gemini run on a bunch of files and then realize that something was wrong and would have to start all over again.

Time§

The script took 20 minutes to run. There are some obvious ways to add more parallelism, but I was already running into quota limit issues. It was simpler for me to keep the script single-threaded, processing only one file at a time.

It took me about a day to get the setup working completely. Right now, it feels a bit like I fell in xkcd’s Automation trap. But if I am able to successfully adapt this setup for other tasks with minimal further customization, then I do think I will start to see some significant time savings.

Reviewing all 235 modified files took another 30 or 40 minutes.

Accuracy§

Gemini 2.0 Flash was very good at only removing the code that I told it to remove and leaving the rest untouched. I only saw one generation error, where Gemini left off a quotation mark:

"extendhfsf2.c",
extendsfdf2.c",

Extra backticks§

Due to how the prompt is structured, Gemini assumed that it should start and end its response with triple backtick characters. Writing a little Python to remove the extra backticks was easier than coaxing Gemini to stop doing this.

Going beyond the instructions§

In some of the files, Gemini went beyond the instructions that I gave it by removing pw_size_diff code. This was interesting, because Gemini was technically correct to remove this code. pw_size_diff is basically a docgen feature. But I did not tell Gemini to remove this code. In this case, it worked OK. In a team meeting we decided that the pw_size_diff code should be removed. (I subsequently added pw_size_diff to the list of docgen keywords to look for.) But in other situations it may be a problem if Gemini does not follow my instructions closely.

There was one (and only one!) case where two extra dependencies were added to a rule that creates a C++ library. I’m not sure why it happened only once in the 235 files that were edited. Pretty weird and disconcerting.

Following the instructions too closely§

In other cases, Gemini followed my instructions too closely. E.g. there was an if block like this:

# We depend on emboss, so we can only compute size when emboss is in the build.
if (dir_pw_third_party_emboss != "") {
  pw_size_diff("use_passthrough_proxy_size_report") {
    title = "pw_bluetooth_proxy Passthrough Size Report"
    base = "$dir_pw_bloat:bloat_base"
    binaries = [
      {
        target = "size_report:use_passthrough_proxy"
        label = "Create and use proxy as a simple passthrough"
      },
    ]
  }
} else {
  pw_size_diff("use_passthrough_proxy_size_report") {
    title = "pw_bluetooth_proxy Passthrough Size Report"
    base = "$dir_pw_bloat:bloat_base"
    binaries = [
      {
        target = "$dir_pw_bloat:bloat_base"
        label = "Emboss not configured."
      },
    ]
  }
}

Gemini correctly deleted the pw_size_diff code but left behind a now empty and useless if block:

# We depend on emboss, so we can only compute size when emboss is in the build.
if (dir_pw_third_party_emboss != "") {
} else {
}

Newline munging§

A lot of newlines at the end of files got messed up. This was not a big deal because Pigweed has a utility (pw format) for enforcing consistent newline style at the end of all files. If we didn’t have pw format this would have been very annoying to fix.