Avoiding Cloud NAT cost for Artifact Registry image pulls

TL;DR Link to heading

A private GKE cluster’s outbound traffic to *.googleapis.com and *.pkg.dev flows through Cloud NAT by default and pays $0.0385/GB data processing on every byte, in both directions. The GCP UI says “Private Google Access is in effect” for the subnet, which makes it sound like that traffic already bypasses NAT. It does not. To bypass NAT for Google API traffic, I added a private Cloud DNS zone resolving the Google API hostnames to the restricted.googleapis.com VIP range (199.36.153.4/30) and a VPC route sending that /30 via the default internet gateway. After that, the traffic stays on Google’s backbone and skips the NAT gateway.

The same fix shifted nothing on a second cluster. NAT flow logs revealed cloud-sql-proxy connecting to Cloud SQL public IPs on port 3307 was the dominant traffic there. Fix: allocate a private IP on each Cloud SQL instance and pass --private-ip to the proxy. Hourly NAT receive bytes fell from 1.5-4 GB during work hours to a flat 0.3-0.4 GB.

What I noticed Link to heading

The dominant networking SKU on one of our private GKE projects was Networking Cloud Nat Data Processing, billing about $600/month and spiking to $74 on a single day. I assumed most of that was image pulls and pushes. The cluster hosts our self-hosted GitHub Actions runners, so every CI build pushes a container image to Artifact Registry and pulls a base image too.

First attempt: moving Artifact Registry to the same region Link to heading

The first hypothesis was: “the multi-region Artifact Registry sits outside europe-north1, so pulls and pushes cross regions and pay both inter-region egress and NAT data processing.”

I migrated all the image repos to a new project hosting a regional Artifact Registry in europe-north1 (europe-north1-docker.pkg.dev/<project>/...), updated every consumer workflow to push and pull from there, and watched.

The cross-region egress line did drop, which was worth doing on its own. The Cloud NAT data processing line did not. Still $5–30 per day on the baseline, still spiking when CI was busy.

That was the clue: NAT billed the same traffic regardless of which region the endpoint lived in.

Why the new Artifact Registry didn’t help Link to heading

GCP bills Cloud NAT data processing per GB of traffic that traverses the NAT gateway, on both directions of every flow. An image pull of a 500 MB layer costs ~500 MB on the response leg even though the request itself is tiny. NAT doesn’t care that both ends of the flow are in europe-north1. The thing it cares about is whether the destination IP is one the VPC routes directly, or one it has to send through NAT.

By default, a pod on a private GKE cluster looks up europe-north1-docker.pkg.dev and gets back a public IP. The VPC’s only egress path to a public IP is via Cloud NAT. So the pull traffic, same region or not, still walks through the gateway and pays.

“Private Google Access is in effect” Link to heading

When I opened the subnet in the GCP console it said, in green:

Private Google Access is in effect (even though it has not been enabled manually) for packets sent from this subnet’s primary and secondary IP ranges because Cloud NAT is configured for those ranges.

The string reads like reassurance: PGA is on, you’re good. It does not mean what it sounds like. It means “you have a working path to Google APIs because Cloud NAT exists.” It does not say the traffic bypasses NAT. The private_ip_google_access flag on the subnet matters when there is no NAT at all: an air-gapped subnet that needs Google API reachability without an external path.

Once I read it that way the bill made sense. We had connectivity to Google APIs, and the path went through NAT, and we paid for it.

The bypass Link to heading

To keep traffic to Google APIs off Cloud NAT, I needed three pieces:

A private Cloud DNS zone that resolves the relevant hostnames to the private VIPs.
A VPC route that sends that VIP range via the internal Google backbone.
A firewall rule allowing the VPC to reach the VIP range (defensive, usually open by default).

GCP publishes two VIP ranges for this, documented at Configure Private Google Access:

private.googleapis.com at 199.36.153.8/30. Covers the full Google API surface.
restricted.googleapis.com at 199.36.153.4/30. Covers a subset, but enforces VPC Service Controls boundaries.

I went with restricted because the list covers everything I use (Artifact Registry, Secret Manager, Cloud Storage, Cloud Logging, BigQuery, IAM), and the VPC-SC enforcement is a free safety upgrade. If you find yourself needing an API that isn’t on the restricted list, swap to private.

Terraform for the three pieces:

locals {
  private_googleapis_zones = {
    googleapis = "googleapis.com."
    pkg-dev    = "pkg.dev."
    gcr-io     = "gcr.io."
  }
}

resource "google_dns_managed_zone" "private_googleapis" {
  for_each   = local.private_googleapis_zones
  project    = var.gcp_project
  name       = "private-${each.key}"
  dns_name   = each.value
  visibility = "private"

  private_visibility_config {
    networks {
      network_url = google_compute_network.vpc.self_link
    }
  }
}

resource "google_dns_record_set" "private_googleapis_a" {
  for_each     = local.private_googleapis_zones
  project      = var.gcp_project
  managed_zone = google_dns_managed_zone.private_googleapis[each.key].name
  name         = each.value
  type         = "A"
  ttl          = 300
  rrdatas      = ["199.36.153.4", "199.36.153.5", "199.36.153.6", "199.36.153.7"]
}

resource "google_dns_record_set" "private_googleapis_wildcard" {
  for_each     = local.private_googleapis_zones
  project      = var.gcp_project
  managed_zone = google_dns_managed_zone.private_googleapis[each.key].name
  name         = "*.${each.value}"
  type         = "CNAME"
  ttl          = 300
  rrdatas      = [each.value]
}

resource "google_compute_route" "restricted_googleapis" {
  project          = var.gcp_project
  name             = "restricted-googleapis-vip"
  network          = google_compute_network.vpc.name
  dest_range       = "199.36.153.4/30"
  next_hop_gateway = "default-internet-gateway"
  priority         = 1000
}

The wildcard CNAME is what catches everything. europe-north1-docker.pkg.dev, secretmanager.googleapis.com, storage.googleapis.com, all the regional and service-specific subdomains land on the private VIPs without me having to enumerate them.

The route is the surprising line. The next hop is default-internet-gateway even though we want this traffic to stay off the internet. GCP routes packets destined for the 199.36.153.0/24 VIP range over its own backbone regardless of what the next-hop says; you just need a route to that range for the VPC’s routing table to deliver packets there. The name is misleading but the behaviour is correct.

Proving it works Link to heading

After the terraform applied I exec’d into a pod on the cluster and asked DNS what those names resolve to now:

kubectl run dns-check --rm -i --restart=Never --image=alpine -- sh -c '
  apk add --no-cache bind-tools curl >/dev/null
  for host in europe-north1-docker.pkg.dev secretmanager.googleapis.com; do
    printf "%-45s " "$host"
    dig +short "$host" | head -1
  done
  echo
  for host in europe-north1-docker.pkg.dev secretmanager.googleapis.com; do
    printf "%-45s " "https://$host/"
    curl -sI -o /dev/null -w "%{http_code} (resolved=%{remote_ip})\n" --max-time 5 "https://$host/"
  done
'

Output:

europe-north1-docker.pkg.dev                  199.36.153.5
secretmanager.googleapis.com                  199.36.153.7

https://europe-north1-docker.pkg.dev/         302 (resolved=199.36.153.5)
https://secretmanager.googleapis.com/         404 (resolved=199.36.153.7)

Before the change, those hostnames resolved to public IPs in the 64.x/74.x/142.x ranges and the response traffic walked through Cloud NAT. Now they resolve to the restricted VIPs and stay on the backbone.

The 404 on Secret Manager’s root is expected. There’s no resource at /. What I’m checking is that the connection completes and returns a real HTTP response in about a second. If the route were wrong the curl would hang and time out.

Reversibility Link to heading

Delete the four resources and traffic falls back to public DNS resolution and the existing NAT path. The NAT gateway never went anywhere. It’s still in the VPC ready to handle external destinations (docker.io upstream on pull-through cache misses, npm, pypi, GitHub, Sentry, Slack webhooks, anything not on *.googleapis.com/*.pkg.dev/*.gcr.io). All of that keeps working; only the Google API subset stops being expensive.

What I’d watch for Link to heading

The Networking Cloud Nat Data Processing line on the billing export drops within a day or two of the change. The drop is proportional to how much of your NAT traffic was Google API egress in the first place. For a CI-heavy cluster pushing containers all day, that’s most of it. For a cluster whose workloads call out to a lot of third-party APIs, the savings will be smaller and the NAT line will keep the non-Google portion.

“Private Google Access is in effect” describes connectivity. The billing is a separate question, and I assumed for too long that the green checkmark answered both.

Second cluster, no drop Link to heading

The bypass shaved 10% off the CI cluster bill. Flow logs showed the rest going to github.com and Microsoft IPs: runner control plane, action downloads, artifact uploads. github.com has no private VIP and no backbone bypass. Those bytes need workflow hygiene (shallow clones, action caches, pre-baked runner images); I left them for a separate sweep.

Days later I applied the same fix to our pre-prod cluster: a dozen backend services, queues, batch jobs, and a few Postgres databases behind Cloud SQL. Baseline NAT bill was ~$185/month against the CI cluster’s ~$600.

I shipped the same restricted.googleapis.com configuration, confirmed with dig from a pod that the API hostnames resolved to the restricted VIPs, and watched the bill.

The Networking Cloud Nat Data Processing byte counter did not move. The bypass worked. I could see Google API traffic on the backbone from the VPC routes and from inside a pod. Something else was driving the bill.

Finding where the bytes went Link to heading

I’d guessed image pulls and Google API traffic dominated the Cloud NAT line. On the CI cluster the guess held. On the pre-prod cluster nothing moved, so I stopped guessing.

I enabled NAT flow logging at filter=ALL for a few hours. By default Cloud NAT only logs errors; full logging captures every connection. Ingestion costs a few dollars, trivial against the bill I was investigating.

I ran one aggregation:

gcloud logging read 'logName="projects/<proj>/logs/compute.googleapis.com%2Fnat_flows"' \
  --project=<proj> --limit=10000 \
  --format='value(jsonPayload.connection.dest_ip,jsonPayload.connection.dest_port)' \
  | awk '{print $1":"$2}' | sort | uniq -c | sort -rn | head

Output, truncated and anonymised to /24s:

7493 34.88.x.x:3307
1368 34.88.y.y:3307
1197 35.228.z.z:3307
1011 140.82.121.x:443
 373 35.228.w.w:3307
 ...

Port 3307 is Cloud SQL. The top five destinations were Cloud SQL public IPs in europe-north1. The busiest, the Postgres metadata DB an Airflow scheduler hits on a one-second poll, accounted for half the NAT flows. GitHub control plane traffic from a handful of background jobs took the next bucket.

Why Cloud SQL Auth Proxy goes via public IP Link to heading

A Cloud SQL instance has a public IP by default and an optional private IP. Cloud SQL Auth Proxy picks the public IP unless you pass --private-ip and the instance has one.

So the sidecar opens TCP to a public Cloud SQL IP, the VPC routes it through Cloud NAT, and Google charges $0.0385/GB on every response byte for the lifetime of the workload. For an Airflow scheduler polling its metadata DB on a one-second loop, that’s a steady stream of small queries against a flat per-GB rate.

The fix is two steps, both reversible:

Allocate a private IP on each Cloud SQL instance. In Terraform that’s a private_network line in the ip_configuration block. This restarts the instance. A regional HA pair fails over for 30 to 60 seconds; a single-zone instance cold-restarts for a few minutes. The public IP stays put so existing consumers keep working.
Add --private-ip to every cloud-sql-proxy invocation. The next time the proxy starts it picks the private endpoint.

After both land, the proxy reaches the instance over VPC peering and Cloud NAT never sees the bytes.

Verifying it used the private IP Link to heading

The proxy logs Listening on 127.0.0.1:5432 and Accepted connection either way. Check at the network layer instead. From inside the proxy container’s network namespace:

$ kubectl debug <pod> --image=nicolaka/netshoot --target=cloud-sql-proxy --profile=netadmin -- ss -ant
...
ESTAB  172.18.9.150:48774  10.99.24.20:3307

10.99.24.20 is the private IP I’d just allocated, and the proxy picked it.

NAT flow logs corroborated it. After the cutover, flows from the pod’s node to the instance’s public IP stopped. Hourly NAT receive bytes on the cluster fell from 1.5–4 GB during work hours to a flat 0.3–0.4 GB. The drop held overnight and into the next morning.

What I’d watch for now Link to heading

The original Google API bypass still pays off on a CI cluster pushing containers all day. I shipped it to a second cluster by reflex without checking what that cluster’s bill was for. The Cloud SQL traffic sat in flow logs I hadn’t bothered to enable.

If I were starting again on a fresh cluster I’d do this in order:

Enable NAT flow logging at filter=ALL for an hour or two during peak.
Aggregate by dest_ip:dest_port. Most of the time the first ten lines carry the answer.
Fix the biggest bucket.

For a CI cluster it’s usually image pulls and Google API egress. For a service cluster it’s Cloud SQL via public IP, or a payment provider’s API. Where there’s a private VIP I bypass; otherwise I live with it or redesign.

Private Google Access is in effect describes connectivity, not billing. cloud-sql-proxy picks the public IP by default. Next time I’ll check both before assuming the bill will drop.