I spent a couple of days migrating our monitoring stack from self-hosted kube-prometheus-stack (KPS) to GKE’s native Google Managed Prometheus (GMP). The end result is simpler, cheaper, and removes about 2 TiB of persistent storage we no longer need. But the migration had enough non-obvious gotchas that I wanted to write it all down.
Why migrate? Link to heading
kube-prometheus-stack bundles Prometheus, Alertmanager, Grafana, node-exporter, kube-state-metrics, and the Prometheus operator into one Helm chart. It works well, but on GKE you’re duplicating what the platform already provides:
- GKE managed collection handles metric scraping via collectors in
gmp-system - Cloud Monitoring stores metrics and supports PromQL natively in Metrics Explorer
- Managed kube-state-metrics and cAdvisor are available as GKE components — no need to self-host
- GKE managed alertmanager handles alert routing without running your own StatefulSet
With KPS, I was maintaining Prometheus replicas with 800 GiB PVCs each in production, plus Alertmanager, Grafana, and a bunch of ServiceMonitors. GMP replaces all of that with managed infrastructure.
The migration strategy Link to heading
I rolled this out environment by environment — staging first, then production — with a long soak period between. The approach:
- Deploy GMP alongside KPS (both running in parallel)
- Validate GMP alerts fire correctly
- Swap DNS so GMP gets the primary alertmanager hostname
- Scale KPS to zero
- Decommission KPS entirely
Running both in parallel is important. You’ll get duplicate alerts for a while, but that’s much better than missing alerts during the cutover.
Step 1: Enable GMP on the cluster Link to heading
In Terraform, enable managed collection and the built-in components:
resource "google_container_cluster" "cluster" {
# ...
monitoring_config {
managed_prometheus {
enabled = true
}
enable_components = [
"SYSTEM_COMPONENTS",
"STORAGE",
"HPA",
"POD",
"DAEMONSET",
"DEPLOYMENT",
"STATEFULSET",
"CADVISOR",
"KUBELET",
]
}
}
SYSTEM_COMPONENTS deploys kube-state-metrics and a few other collectors. The rest give you granular resource metrics — storage, HPA, pod/deployment/statefulset/daemonset status, cAdvisor container metrics, and kubelet stats. Once this is applied, you’ll see collector pods in the gmp-system namespace.
You’ll also need monitoring.viewer on the GKE node service account so the collectors can write to Cloud Monitoring.
Step 2: Set up alerting rules with ClusterRules Link to heading
GMP uses its own CRDs instead of PrometheusRule. The main one for alerting is ClusterRules:
apiVersion: monitoring.googleapis.com/v1
kind: ClusterRules
metadata:
name: kubernetes-apps
namespace: gmp-public
spec:
groups:
- name: kubernetes-apps
interval: 60s
rules:
- alert: KubePodCrashLooping
expr: max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}[5m]) >= 1
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
I organised these using kustomize — a base/ directory with cluster-agnostic rules, and per-environment overlays for things like GCP project IDs in console URLs:
google-managed-prometheus/
├── base/
│ ├── kustomization.yaml
│ ├── cluster-rules-kubernetes-apps.yaml
│ ├── cluster-rules-node-exporter.yaml
│ └── ...
├── staging/
│ ├── kustomization.yaml
│ └── cluster-rules.yaml # environment overrides
└── production/
└── ...
One thing to note: GMP ClusterRules must go in the gmp-public namespace (or another namespace you’ve configured). The rule evaluator only watches specific namespaces.
Step 3: Set up PodMonitoring Link to heading
PodMonitoring is GMP’s replacement for ServiceMonitor. It tells the collectors which pods to scrape:
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: my-app
namespace: production
spec:
selector:
matchLabels:
app.kubernetes.io/name: my-app
endpoints:
- port: metrics
interval: 30s
The main difference from ServiceMonitor: PodMonitoring is namespace-scoped by default. If your app runs in production, the PodMonitoring goes in production. There’s also ClusterPodMonitoring for cross-namespace scraping.
Step 4: Configure the managed alertmanager Link to heading
This is where the biggest gotcha lives.
GMP ships a managed alertmanager in gmp-system. You configure it by creating a secret named alertmanager in the gmp-public namespace:
kubectl create secret generic alertmanager \
--namespace gmp-public \
--from-file=config.yaml=alertmanager-config.yaml
The config format is standard alertmanager YAML. But there’s a critical limitation: the managed alertmanager does not support external template files.
Why external templates don’t work Link to heading
The managed alertmanager uses a config-reloader sidecar that copies config.yaml from the mounted secret to a shared emptyDir volume. The alertmanager container reads from that volume. But the config-reloader only copies the file named config.yaml — any other files in the secret (like slack.tmpl) are not copied to the shared volume and are invisible to the alertmanager container.
This means you can’t do:
# This won't work with GMP managed alertmanager
templates:
- '/etc/alertmanager/config/*.tmpl'
The fix: inline everything Link to heading
You have to inline all your Go templates directly in the alertmanager config YAML. Use YAML literal block scalars (|) and Go template whitespace trimming ({{- / -}}) to keep things manageable:
receivers:
- name: slack-alerts
slack_configs:
- channel: '#alerts'
send_resolved: true
color: |
{{- if eq .Status "firing" -}}
{{- if eq (index .Alerts 0).Labels.severity "critical" -}}
danger
{{- else if eq (index .Alerts 0).Labels.severity "error" -}}
danger
{{- else if eq (index .Alerts 0).Labels.severity "warning" -}}
warning
{{- else -}}
#439FE0
{{- end -}}
{{- else -}}
good
{{- end -}}
title: |
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts | len }}{{ end }}] {{ (index .Alerts 0).Labels.alertname }}
text: |
{{- range .Alerts }}
*Env:* {{ .Labels.namespace }}
{{- if .Labels.pod }} | *Pod:* {{ .Labels.pod }}{{ end }}
{{- if .Annotations.summary }}
*Summary:* {{ .Annotations.summary }}
{{- end }}
{{- if .Annotations.description }}
*Description:* {{ .Annotations.description }}
{{- end }}
{{ end }}
This is verbose but it works. The whitespace trimming ({{- and -}}) is essential — without it you get blank lines everywhere in your Slack messages.
If you have complex templates (silence links with label deduplication, for instance), they all need to be inlined in the url: or text: fields directly. It’s ugly but there’s no alternative with the managed alertmanager today.
Silence link gotcha Link to heading
If your templates generate Alertmanager silence links, you’ll need to inline the full label deduplication logic. A silence URL needs matchers for each unique label, and the Go template for that involves range, variable assignment, and urlquery — all of which must be inlined in a single YAML field. I found it easier to keep the original template logic and just move it inline rather than trying to simplify it.
Step 5: OperatorConfig Link to heading
The OperatorConfig CRD controls how GMP behaves cluster-wide. The key setting is telling the rule evaluator to use the managed alertmanager:
apiVersion: monitoring.googleapis.com/v1
kind: OperatorConfig
metadata:
name: config
namespace: gmp-public
spec:
collection:
filter:
matchOneOf: []
managedAlertmanager:
configSecret:
name: alertmanager
key: config.yaml
Apply this manually — it’s a one-time operation:
kubectl apply -f operator-config.yaml
Note: earlier GMP versions required you to explicitly configure the rule evaluator’s alertmanager target. Current versions automatically route to the managed alertmanager when ClusterRules exist, so you can omit the rules.alerting.alertmanagers section.
Step 6: Node exporter Link to heading
The enable_components list covers workload and cluster-level metrics (pods, deployments, cAdvisor, kubelet), but it doesn’t include host-level metrics — things like CPU steal, memory pressure, disk I/O, and network stats at the node level. Those come from node-exporter, which GMP doesn’t manage for you.
So you need to self-deploy it. Google provides official guidance for this. I deployed a DaemonSet in gmp-public alongside the ClusterRules and OperatorConfig — it makes sense to keep GMP-related infrastructure together rather than scattering it into monitoring.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: gmp-public
spec:
selector:
matchLabels:
app: node-exporter
template:
spec:
hostPID: true
hostNetwork: true
containers:
- name: node-exporter
image: prom/node-exporter:v1.8.2
args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
ports:
- containerPort: 9100
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
- name: root
mountPath: /host/root
readOnly: true
volumes:
- name: proc
hostPath: { path: /proc }
- name: sys
hostPath: { path: /sys }
- name: root
hostPath: { path: / }
Step 7: Querying metrics Link to heading
With GMP, you don’t need a self-hosted Prometheus frontend or Grafana. Cloud Monitoring’s Metrics Explorer supports PromQL natively:
https://console.cloud.google.com/monitoring/metrics-explorer?project=YOUR_PROJECT
Switch the query language to PromQL and your existing queries work as-is. For dashboards, use Cloud Monitoring dashboards — they support PromQL-based widgets and even SQL-based analytics queries for more complex visualisations.
This was the easiest win. No more maintaining Grafana, managing its PVCs, or keeping dashboard JSON in sync.
Step 8: The DNS hostname swap Link to heading
I wanted zero downtime on the alertmanager UI. The approach:
- Create a temporary DNS record for KPS alertmanager (
kps-alertmanager.example.com) - Point the primary hostname (
alertmanager.example.com) at the GMP alertmanager - Update any internal services (like incident management tools) to use the in-cluster address:
alertmanager.gmp-system.svc.cluster.local:9093 - After validation, remove the temporary KPS DNS record
This way, anyone bookmarking alertmanager.example.com seamlessly switches to the GMP version.
Step 9: Decommissioning kube-prometheus-stack Link to heading
This is where the order matters. Don’t just helm uninstall and walk away.
9a. Scale down first Link to heading
Disable components in your KPS values and let ArgoCD sync:
# values.yaml
prometheus:
prometheusSpec:
replicas: 0
alertmanager:
alertmanagerSpec:
replicas: 0
nodeExporter:
enabled: false
kubeStateMetrics:
enabled: false
prometheusOperator:
enabled: false
grafana:
enabled: false
This scales everything to zero while keeping the Helm release intact. Monitor for a few days to make sure GMP is handling everything.
9b. Enable ArgoCD pruning Link to heading
If you manage KPS through ArgoCD with prune: false (common for stateful workloads), you’ll need to enable pruning before the Helm uninstall. Otherwise, orphaned resources will stick around. If you run into sync issues along the way, I wrote about troubleshooting stuck ArgoCD syncs.
# ArgoCD Application
spec:
syncPolicy:
automated:
prune: true # Enable before uninstall
9c. Uninstall the Helm release Link to heading
helm uninstall kube-prometheus-stack -n monitoring
9d. Clean up orphaned CRDs Link to heading
KPS leaves behind PrometheusRule and ServiceMonitor CRDs even after uninstall. Check for orphans:
kubectl get prometheusrules -n monitoring
kubectl get servicemonitors -n monitoring
Delete anything that’s no longer needed. The prometheus-operator CRDs themselves can stay if you have other uses for them.
9e. Snapshot and delete PVCs Link to heading
Before deleting Prometheus PVCs, consider snapshotting them. On GCP, a disk snapshot costs ~€0.024/GB/month versus ~€0.16/GB/month for the live disk. For an 800 GiB Prometheus PVC, that’s €2/month instead of €128/month — and you still have the data if you need to investigate historical metrics.
# Snapshot the disk backing the PVC
gcloud compute disks snapshot DISK_NAME \
--snapshot-names=prometheus-final-snapshot \
--zone=ZONE \
--project=PROJECT
# Then delete the PVC
kubectl delete pvc prometheus-data-0 -n monitoring
For Alertmanager PVCs, I didn’t bother with snapshots — the data is transient silence/notification state that’s not worth preserving.
9f. Clean up DNS and routes Link to heading
Remove any DNS records, ingress resources, or HTTPRoutes that pointed at the old KPS services. Don’t forget WAF rules if you had hostname-specific firewall exceptions.
What I’d do differently Link to heading
Start with the alertmanager template inlining. I initially tried to use external .tmpl files and spent time debugging why templates weren’t found. If I’d known about the config-reloader limitation upfront, I’d have gone straight to inlining.
Don’t try to disable the managed alertmanager. If you deploy ClusterRules, GMP’s rule evaluator automatically starts the managed alertmanager. I spent time trying to prevent it from running while I was still using a self-deployed one. Just embrace it — configure it properly and let it take over.
Put GMP-adjacent resources in gmp-public. Node-exporter and the alertmanager oauth2-proxy belong alongside the ClusterRules and OperatorConfig — it keeps all GMP infrastructure in one place. The catch: gmp-public is managed by the GKE addon manager in Reconcile mode, so any labels you add to the namespace (like gateway-access=istio for an Istio Gateway) may be wiped on cluster upgrades. The fix is to declare the namespace in your GitOps repo with the labels you want — ArgoCD will reapply them after any upgrade strips them.
Snapshot Prometheus PVCs before deleting. This seems obvious in hindsight, but it’s easy to forget when you’re in cleanup mode. The snapshot costs almost nothing and gives you a safety net for historical data.
Cost savings Link to heading
For our setup, the migration eliminated:
| Resource | Count | Size | Monthly cost |
|---|---|---|---|
| Prometheus PVCs (production) | 2 | 800 GiB each | ~€256 |
| Prometheus PVCs (staging) | 2 | 250 GiB each | ~€80 |
| Alertmanager PVCs | 4 | 10 GiB each | ~€6 |
| GMP alertmanager PVC | 1 | 1 GiB | ~€0.16 |
| Grafana PVC | 1 | 10 GiB | ~€1.60 |
| Total saved | 2,131 GiB | ~€344/month | |
| Snapshot (retained) | 1 | 359 GiB | -€9/month |
| Net savings | ~€335/month |
Plus you’re no longer running Prometheus, Alertmanager, Grafana, node-exporter, and kube-state-metrics pods — that’s compute savings on top. If you’re also looking to optimise the underlying node costs, I covered GKE ComputeClass cost optimisation separately.
The full checklist Link to heading
Here’s what the complete migration looks like, condensed into a checklist:
Terraform / infrastructure:
- Enable
managed_prometheus.enabled = trueon the cluster - Enable
kube-state-metrics,cadvisor,kubeletcomponents - Grant
monitoring.viewerto GKE node service account - Create temporary DNS for old alertmanager
Kubernetes resources:
- Create
ClusterRulesfor alerting (base + per-environment overrides) - Create
PodMonitoringfor application and infrastructure services - Deploy node-exporter DaemonSet + PodMonitoring
- Create alertmanager secret with inlined templates
- Apply
OperatorConfig - Set up HTTPRoute/Ingress for alertmanager UI
Validation:
- GMP collectors running in
gmp-system - Alerts firing correctly to Slack/PagerDuty
- Cloud Monitoring Metrics Explorer works with PromQL
- Alertmanager UI accessible
Cutover:
- Swap DNS — GMP gets the primary alertmanager hostname
- Update internal services to use in-cluster GMP alertmanager address
- Scale KPS to zero
- Soak for a few days
Decommission:
- Enable ArgoCD pruning on KPS application
- Uninstall KPS Helm release
- Delete orphaned CRDs (PrometheusRule, ServiceMonitor)
- Snapshot Prometheus PVCs
- Delete all monitoring PVCs
- Remove old DNS records, ingress/HTTPRoutes, WAF rules
- Clean up ArgoCD application manifests and values files
Further reading Link to heading
- Google Managed Prometheus documentation — official GMP docs
- ClusterRules reference — GMP alerting CRDs
- PodMonitoring reference — metric scraping configuration
- Alertmanager configuration — managed alertmanager setup
- Cloud Monitoring PromQL — querying with PromQL in Metrics Explorer