The Postgres ownership migration that looked done but wasn't

TL;DR Link to heading

I moved object ownership in Postgres from a shared role to a dedicated role per service, one environment at a time, with the application serving throughout. REASSIGN OWNED does it in one statement and moves every object class. A FOR ALL TABLES publication blocked it on one database, so I went per-object and left functions, types, and schemas behind. A pg_shdepend sweep caught the gap before production, and it is my completeness check now.

Motivation Link to heading

One shared role owned most objects across several services’ databases. I wanted each service to own its objects under a dedicated role, so its migration tooling could run with least privilege instead of borrowing the shared role. The plan read simple: reassign ownership, flip the config that records the owner, repeat per environment.

One environment at a time, lower first Link to heading

Each environment is a separate database with its own copy of the schema, so I ran the migration in lower environments before production. A mistake in a lower environment stays there. The lock waits, the permission-denied blip, the object classes I forgot: each surfaced in a lower environment first and let me fix the procedure before production ran it.

The order inside an environment stayed the same each time: reassign ownership with live SQL, verify it, then flip the config that records the owner. Running the SQL first means the config change lands as a no-op confirmation rather than the step that moves data. A surprise then shows up in the SQL, where I am watching, instead of in an automated apply.

REASSIGN OWNED moves everything Link to heading

The one-liner does the whole job for a normal role:

REASSIGN OWNED BY app TO svc_admin;

This reassigns every object app owns in the current database: tables, sequences, views, schemas, functions, types. It is database-scoped, so I connected to each database and ran it once. Every service but one needed nothing more, and every object moved.

The publication that blocked it Link to heading

One service’s database had a logical-replication publication declared FOR ALL TABLES, feeding a change-data-capture pipeline. REASSIGN OWNED tries to move the publication along with everything else, and a non-superuser cannot own a FOR ALL TABLES publication, so the statement aborts:

ERROR:  permission denied to change owner of publication "..."

REASSIGN runs atomically, so a failure here moves nothing. I left the publication on its existing owner (the change-data-capture consumer reattaches regardless of who owns it) and reassigned the rest object by object.

Per-object, on live tables Link to heading

I generated ALTER statements for the object kinds I thought of:

SELECT format('ALTER TABLE %I.%I OWNER TO %I;', n.nspname, c.relname, 'svc_admin')
FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE pg_get_userbyid(c.relowner) = 'app'
  AND c.relkind IN ('r','p','S','v','m')
  AND n.nspname NOT IN ('pg_catalog','information_schema','pg_toast');

The database was serving traffic, so the reassign had to stay non-blocking. ALTER ... OWNER rewrites one catalog row (pg_class.relowner) and touches no table data, so it holds its lock only for that update, not for a table scan or rewrite. The lock is still the risk. ALTER ... OWNER takes an AccessExclusiveLock. A pending request for that lock waits behind the running query, and every query arriving afterward queues behind the request, so a few seconds on a busy table stalls traffic. A short lock_timeout caps that wait, so an ALTER that cannot take the lock in time gives up instead of forming the queue. I looped: run the batch, regenerate the still-unmoved list, run it again, until nothing remained. A quiet table flips in milliseconds. The busiest table hit the timeout and moved on the next pass. No statement held a lock long enough to count as downtime.

For about ten seconds the busiest table threw permission denied at a high-QPS reader, then cleared. The grants never changed. Postgres invalidates its catalog caches when an owner changes, and a query mid-flight can recheck permissions against a half-updated catalog row before it settles. I thought I had broken access; the table was healthy the instant the ALTER committed. Seeing it first in a lower environment meant production did not surprise me.

The table check passed afterwards, so I called that service done.

pg_shdepend finds what the table check misses Link to heading

I ran a wider check out of caution, with nothing pointing at a problem. The table count said the migration was done, but the pg_shdepend sweep found objects it had missed. It is a shared catalog recording every dependency on a role across the cluster, so one query per instance lists everything a role still owns, in any database, of any object class:

SELECT d.datname AS db, cl.relname AS catalog, s.deptype, count(*)
FROM pg_shdepend s
JOIN pg_roles r ON r.oid = s.refobjid
JOIN pg_class cl ON cl.oid = s.classid
LEFT JOIN pg_database d ON d.oid = s.dbid
WHERE r.rolname = 'app'
GROUP BY 1, 2, 3
ORDER BY 1, 2;

deptype = 'o' means the role still owns the object; deptype = 'a' means it holds a grant. The catalog column names the kind: pg_class (tables, views, sequences), pg_proc (functions), pg_type (types), pg_namespace (schemas), pg_publication, pg_default_acl.

The sweep found functions, composite and enum types, domains, and the schemas themselves still owned by app in the one service I had moved object by object. PG14 adds a quieter version of the same trap: a user-defined range type carries a companion multirange type (typtype = 'm'), and ALTER TYPE <range> OWNER does not move it. A predicate of typtype IN ('e','d','c','r') misses it.

The services I had reassigned with the one-liner came back clean. REASSIGN OWNED moves every object class, so it had covered functions and types and schemas without me thinking about them. The gap existed only where the publication forced me object by object, and only because my list stopped at relations. I had flipped one lower environment of that service table-only and never rechecked it, so it sat half-migrated until the sweep. I treat an empty owner result from this query as the completeness check now.

Ownership, grants, and membership are three separate layers Link to heading

Changing an object’s owner leaves the grants other roles hold on it untouched, and it leaves role membership untouched. I had explicit DML grants for the old shared role on a handful of tables in another database. They looked like residue worth clearing. Before revoking, I checked one thing: the old role was a member of the new owning role, with inheritance, so it kept full access to those tables through membership no matter what the explicit grants said:

SELECT has_table_privilege('app', 'svc.some_table', 'SELECT');  -- still true via membership

Revoking the explicit grants removed the redundant ACL entries without removing any access path. I would confirm that inheritance before assuming a revoke is either safe or meaningful.

Dropping the old role, in order Link to heading

Retiring the old per-service login came last, and the order mattered in three places.

Ownership had to be fully moved first. DROP ROLE refuses while the role still owns an object or holds a grant, and it lists every dependency it finds. The pg_shdepend sweep had to return no owner rows before I tried, the same reason I stopped trusting the table count.

The migration tooling connects as one fixed role and manages each service’s objects, so it needed membership in the new owning role to set default privileges on the new owner’s tables. The old role was already a member of the new one from the transition. Adding the reverse membership on top makes a cycle that Postgres rejects, so I reversed the direction in two steps: revoke the old role’s membership in the new one, then grant the new owning role membership in the connection role.

The login is managed outside Postgres, by something like an IAM directory, and deleting it there drops the underlying role. Dropping a role also strips its membership grants. Run as one step, the role can disappear before the membership revoke executes, and the revoke then errors on a role that no longer exists. I revoke the memberships first while the role still exists, then delete the user.

My approach now Link to heading

I run this lowest environment first, one at a time. A short lock_timeout and a retry loop keep the live reassign from blocking traffic, so the application stays up. I reach for REASSIGN OWNED first, because it moves every object class and I cannot forget one. When a publication or a shared database forces the per-object path, I treat a non-empty owner result from the sweep as unfinished. By the time production runs, a lower environment has already shown me what breaks.