Migrating from kube-prometheus-stack to Google Managed Prometheus

I spent a couple of days migrating our monitoring stack from self-hosted kube-prometheus-stack (KPS) to GKE’s native Google Managed Prometheus (GMP). The end result is simpler, cheaper, and removes about 2 TiB of persistent storage we no longer need. But the migration had enough non-obvious gotchas that I wanted to write it all down.

Why migrate? Link to heading

kube-prometheus-stack bundles Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, and the Prometheus operator into one Helm chart. It works well, but on GKE you’re duplicating what the platform already provides:

GKE managed collection handles metric scraping via collectors in gmp-system
Cloud Monitoring stores metrics and supports PromQL natively in Metrics Explorer
Managed kube-state-metrics and cAdvisor are available as GKE components — no need to self-host
GKE managed alertmanager handles alert routing without running your own StatefulSet

With KPS, I was maintaining Prometheus replicas with 800 GiB PVCs each in production, plus Alertmanager, Grafana, and a bunch of ServiceMonitors. GMP replaces all of that with managed infrastructure.

The migration strategy Link to heading

I rolled this out environment by environment — staging first, then production — with a long soak period between. The approach:

Deploy GMP alongside KPS (both running in parallel)
Validate GMP alerts fire correctly
Swap DNS so GMP gets the primary alertmanager hostname
Scale KPS to zero
Decommission KPS entirely

Running both in parallel is important. You’ll get duplicate alerts for a while, but that’s much better than missing alerts during the cutover.

Step 1: Enable GMP on the cluster Link to heading

In Terraform, enable managed collection and the built-in components:

resource "google_container_cluster" "cluster" {
  # ...
  monitoring_config {
    managed_prometheus {
      enabled = true
    }
    enable_components = [
      "SYSTEM_COMPONENTS",
      "STORAGE",
      "HPA",
      "POD",
      "DAEMONSET",
      "DEPLOYMENT",
      "STATEFULSET",
      "CADVISOR",
      "KUBELET",
    ]
  }
}

SYSTEM_COMPONENTS deploys kube-state-metrics and a few other collectors. The rest give you granular resource metrics — storage, HPA, pod/deployment/statefulset/daemonset status, cAdvisor container metrics, and kubelet stats. Once this is applied, you’ll see collector pods in the gmp-system namespace.

You’ll also need monitoring.viewer on the GKE node service account so the collectors can write to Cloud Monitoring.

Step 2: Set up alerting rules with ClusterRules Link to heading

GMP uses its own CRDs instead of PrometheusRule. The main one for alerting is ClusterRules:

apiVersion: monitoring.googleapis.com/v1
kind: ClusterRules
metadata:
  name: kubernetes-apps
  namespace: gmp-public
spec:
  groups:
    - name: kubernetes-apps
      interval: 60s
      rules:
        - alert: KubePodCrashLooping
          expr: max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}[5m]) >= 1
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"

I organised these using kustomize — a base/ directory with cluster-agnostic rules, and per-environment overlays for things like GCP project IDs in console URLs:

google-managed-prometheus/
├── base/
│   ├── kustomization.yaml
│   ├── cluster-rules-kubernetes-apps.yaml
│   ├── cluster-rules-node-exporter.yaml
│   └── ...
├── staging/
│   ├── kustomization.yaml
│   └── cluster-rules.yaml   # environment overrides
└── production/
    └── ...

One thing to note: GMP ClusterRules must go in the gmp-public namespace (or another namespace you’ve configured). The rule evaluator only watches specific namespaces.

Step 3: Set up PodMonitoring Link to heading

PodMonitoring is GMP’s replacement for ServiceMonitor. It tells the collectors which pods to scrape:

apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: my-app
  namespace: production
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: my-app
  endpoints:
    - port: metrics
      interval: 30s

The main difference from ServiceMonitor: PodMonitoring is namespace-scoped by default. If your app runs in production, the PodMonitoring goes in production. There’s also ClusterPodMonitoring for cross-namespace scraping.

Step 4: Configure the managed alertmanager Link to heading

This is where the biggest gotcha lives.

GMP ships a managed alertmanager in gmp-system. You configure it by creating a secret named alertmanager in the gmp-public namespace:

kubectl create secret generic alertmanager \
  --namespace gmp-public \
  --from-file=config.yaml=alertmanager-config.yaml

The config format is standard alertmanager YAML. But there’s a critical limitation: the managed alertmanager does not support external template files.

Why external templates don’t work Link to heading

The managed alertmanager uses a config-reloader sidecar that copies config.yaml from the mounted secret to a shared emptyDir volume. The alertmanager container reads from that volume. But the config-reloader only copies the file named config.yaml — any other files in the secret (like slack.tmpl) are not copied to the shared volume and are invisible to the alertmanager container.

This means you can’t do:

# This won't work with GMP managed alertmanager
templates:
  - '/etc/alertmanager/config/*.tmpl'

The fix: inline everything Link to heading

You have to inline all your Go templates directly in the alertmanager config YAML. Use YAML literal block scalars (|) and Go template whitespace trimming ({{- / -}}) to keep things manageable:

receivers:
  - name: slack-alerts
    slack_configs:
      - channel: '#alerts'
        send_resolved: true
        color: |
          {{- if eq .Status "firing" -}}
            {{- if eq (index .Alerts 0).Labels.severity "critical" -}}
              danger
            {{- else if eq (index .Alerts 0).Labels.severity "error" -}}
              danger
            {{- else if eq (index .Alerts 0).Labels.severity "warning" -}}
              warning
            {{- else -}}
              #439FE0
            {{- end -}}
          {{- else -}}
            good
          {{- end -}}
        title: |
          [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts | len }}{{ end }}] {{ (index .Alerts 0).Labels.alertname }}
        text: |
          {{- range .Alerts }}
          *Env:* {{ .Labels.namespace }}
          {{- if .Labels.pod }} | *Pod:* {{ .Labels.pod }}{{ end }}
          {{- if .Annotations.summary }}
          *Summary:* {{ .Annotations.summary }}
          {{- end }}
          {{- if .Annotations.description }}
          *Description:* {{ .Annotations.description }}
          {{- end }}
          {{ end }}

This is verbose but it works. The whitespace trimming ({{- and -}}) is essential — without it you get blank lines everywhere in your Slack messages.

If you have complex templates (silence links with label deduplication, for instance), they all need to be inlined in the url: or text: fields directly. It’s ugly but there’s no alternative with the managed alertmanager today.

Silence link gotcha Link to heading

If your templates generate Alertmanager silence links, you’ll need to inline the full label deduplication logic. A silence URL needs matchers for each unique label, and the Go template for that involves range, variable assignment, and urlquery — all of which must be inlined in a single YAML field. I found it easier to keep the original template logic and just move it inline rather than trying to simplify it.

Step 5: OperatorConfig Link to heading

The OperatorConfig CRD controls how GMP behaves cluster-wide. The key setting is telling the rule evaluator to use the managed alertmanager:

apiVersion: monitoring.googleapis.com/v1
kind: OperatorConfig
metadata:
  name: config
  namespace: gmp-public
spec:
  collection:
    filter:
      matchOneOf: []
  managedAlertmanager:
    configSecret:
      name: alertmanager
      key: config.yaml

Apply this manually — it’s a one-time operation:

kubectl apply -f operator-config.yaml

Note: earlier GMP versions required you to explicitly configure the rule evaluator’s alertmanager target. Current versions automatically route to the managed alertmanager when ClusterRules exist, so you can omit the rules.alerting.alertmanagers section.

Step 6: Node exporter Link to heading

The enable_components list covers workload and cluster-level metrics (pods, deployments, cAdvisor, kubelet), but it doesn’t include host-level metrics — things like CPU steal, memory pressure, disk I/O, and network stats at the node level. Those come from node-exporter, which GMP doesn’t manage for you.

So you need to self-deploy it. Google provides official guidance for this. I deployed a DaemonSet in gmp-public alongside the ClusterRules and OperatorConfig — it makes sense to keep GMP-related infrastructure together rather than scattering it into monitoring.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: gmp-public
spec:
  selector:
    matchLabels:
      app: node-exporter
  template:
    spec:
      hostPID: true
      hostNetwork: true
      containers:
        - name: node-exporter
          image: prom/node-exporter:v1.8.2
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --path.rootfs=/host/root
          ports:
            - containerPort: 9100
          volumeMounts:
            - name: proc
              mountPath: /host/proc
              readOnly: true
            - name: sys
              mountPath: /host/sys
              readOnly: true
            - name: root
              mountPath: /host/root
              readOnly: true
      volumes:
        - name: proc
          hostPath: { path: /proc }
        - name: sys
          hostPath: { path: /sys }
        - name: root
          hostPath: { path: / }

Step 7: Querying metrics Link to heading

With GMP, you don’t need a self-hosted Prometheus frontend or Grafana. Cloud Monitoring’s Metrics Explorer supports PromQL natively:

https://console.cloud.google.com/monitoring/metrics-explorer?project=YOUR_PROJECT

Switch the query language to PromQL and your existing queries work as-is. For dashboards, use Cloud Monitoring dashboards — they support PromQL-based widgets and even SQL-based analytics queries for more complex visualisations.

This was the easiest win. No more maintaining Grafana, managing its PVCs, or keeping dashboard JSON in sync.

Step 8: The DNS hostname swap Link to heading

I wanted zero downtime on the alertmanager UI. The approach:

Create a temporary DNS record for KPS alertmanager (kps-alertmanager.example.com)
Point the primary hostname (alertmanager.example.com) at the GMP alertmanager
Update any internal services (like incident management tools) to use the in-cluster address: alertmanager.gmp-system.svc.cluster.local:9093
After validation, remove the temporary KPS DNS record

This way, anyone bookmarking alertmanager.example.com seamlessly switches to the GMP version.

Step 9: Decommissioning kube-prometheus-stack Link to heading

This is where the order matters. Don’t just helm uninstall and walk away.

9a. Scale down first Link to heading

Disable components in your KPS values and let ArgoCD sync:

# values.yaml
prometheus:
  prometheusSpec:
    replicas: 0
alertmanager:
  alertmanagerSpec:
    replicas: 0
nodeExporter:
  enabled: false
kubeStateMetrics:
  enabled: false
prometheusOperator:
  enabled: false
grafana:
  enabled: false

This scales everything to zero while keeping the Helm release intact. Monitor for a few days to make sure GMP is handling everything.

9b. Enable ArgoCD pruning Link to heading

If you manage KPS through ArgoCD with prune: false (common for stateful workloads), you’ll need to enable pruning before the Helm uninstall. Otherwise, orphaned resources will stick around. If you run into sync issues along the way, I wrote about troubleshooting stuck ArgoCD syncs.

# ArgoCD Application
spec:
  syncPolicy:
    automated:
      prune: true   # Enable before uninstall

9c. Uninstall the Helm release Link to heading

helm uninstall kube-prometheus-stack -n monitoring

9d. Clean up orphaned CRDs Link to heading

KPS leaves behind PrometheusRule and ServiceMonitor CRDs even after uninstall. Check for orphans:

kubectl get prometheusrules -n monitoring
kubectl get servicemonitors -n monitoring

Delete anything that’s no longer needed. The prometheus-operator CRDs themselves can stay if you have other uses for them.

9e. Snapshot and delete PVCs Link to heading

Before deleting Prometheus PVCs, consider snapshotting them. On GCP, a disk snapshot costs ~€0.024/GB/month versus ~€0.16/GB/month for the live disk. For an 800 GiB Prometheus PVC, that’s €2/month instead of €128/month — and you still have the data if you need to investigate historical metrics.

# Snapshot the disk backing the PVC
gcloud compute disks snapshot DISK_NAME \
  --snapshot-names=prometheus-final-snapshot \
  --zone=ZONE \
  --project=PROJECT

# Then delete the PVC
kubectl delete pvc prometheus-data-0 -n monitoring

For Alertmanager PVCs, I didn’t bother with snapshots — the data is transient silence/notification state that’s not worth preserving.

9f. Clean up DNS and routes Link to heading

Remove any DNS records, ingress resources, or HTTPRoutes that pointed at the old KPS services. Don’t forget WAF rules if you had hostname-specific firewall exceptions.

What I’d do differently Link to heading

Start with the alertmanager template inlining. I initially tried to use external .tmpl files and spent time debugging why templates weren’t found. If I’d known about the config-reloader limitation upfront, I’d have gone straight to inlining.

Don’t try to disable the managed alertmanager. If you deploy ClusterRules, GMP’s rule evaluator automatically starts the managed alertmanager. I spent time trying to prevent it from running while I was still using a self-deployed one. Just embrace it — configure it properly and let it take over.

Put GMP-adjacent resources in gmp-public. Node-exporter and the alertmanager oauth2-proxy belong alongside the ClusterRules and OperatorConfig — it keeps all GMP infrastructure in one place. The catch: gmp-public is managed by the GKE addon manager in Reconcile mode, so any labels you add to the namespace (like gateway-access=istio for an Istio Gateway) may be wiped on cluster upgrades. The fix is to declare the namespace in your GitOps repo with the labels you want — ArgoCD will reapply them after any upgrade strips them.

Snapshot Prometheus PVCs before deleting. This seems obvious in hindsight, but it’s easy to forget when you’re in cleanup mode. The snapshot costs almost nothing and gives you a safety net for historical data.

Cost savings Link to heading

For our setup, the migration eliminated:

Resource	Count	Size	Monthly cost
Prometheus PVCs (production)	2	800 GiB each	~€256
Prometheus PVCs (staging)	2	250 GiB each	~€80
Alertmanager PVCs	4	10 GiB each	~€6
GMP alertmanager PVC	1	1 GiB	~€0.16
Grafana PVC	1	10 GiB	~€1.60
Total saved		2,131 GiB	~€344/month
Snapshot (retained)	1	359 GiB	-€9/month
Net savings			~€335/month

Plus you’re no longer running Prometheus, Alertmanager, Grafana, node-exporter, and kube-state-metrics pods — that’s compute savings on top. If you’re also looking to optimise the underlying node costs, I covered GKE ComputeClass cost optimisation separately.

The full checklist Link to heading

Here’s what the complete migration looks like, condensed into a checklist:

Terraform / infrastructure:

Enable managed_prometheus.enabled = true on the cluster
Enable kube-state-metrics, cadvisor, kubelet components
Grant monitoring.viewer to GKE node service account
Create temporary DNS for old alertmanager

Kubernetes resources:

Create ClusterRules for alerting (base + per-environment overrides)
Create PodMonitoring for application and infrastructure services
Deploy node-exporter DaemonSet + PodMonitoring
Create alertmanager secret with inlined templates
Apply OperatorConfig
Set up HTTPRoute/Ingress for alertmanager UI

Validation:

GMP collectors running in gmp-system
Alerts firing correctly to Slack/PagerDuty
Cloud Monitoring Metrics Explorer works with PromQL
Alertmanager UI accessible

Cutover:

Swap DNS — GMP gets the primary alertmanager hostname
Update internal services to use in-cluster GMP alertmanager address
Scale KPS to zero
Soak for a few days

Decommission:

Enable ArgoCD pruning on KPS application
Uninstall KPS Helm release
Delete orphaned CRDs (PrometheusRule, ServiceMonitor)
Snapshot Prometheus PVCs
Delete all monitoring PVCs
Remove old DNS records, ingress/HTTPRoutes, WAF rules
Clean up ArgoCD application manifests and values files