A big yap about building security bef you write features.

(it's kinda long but idc, I need to rant)

So I have been going a bit mental working on this govtech CRM past week and I have the urge to to blabber about it here. Most of my working style while building before has usually been build first and "add security later." but I had to flip it given I was working on a govtech app. spending an entire week doing nothing but security infrastructure walls felt weird intially ngl.

I've seen what happens when you try to bolt it on afterward. You end up with if user.role === 'admin' checks sprinkled across 200 files, a database that blindly trusts whatever the app layer tells it, and a Jira ticket called "add security headers" that's been in the backlog since kanye's graduation era. So I decided to be a bit more paranoid this time.

I've written this post to walk through every security pattern I implemented at the foundation layer, in the order they build on each other. meant to be running in production, protecting constituent PII for government offices. i'll go about by showing the attack surface first, then the defense, because that's how I actually thought about each one. I know most people here aren't into this stuff but imma rant about it anyways.

right, let's get into it.


The Mental Model: Defense in Depth

Before any code, the framing matters.

Defense in depth means you don't trust any single layer to hold. You assume the app layer will have bugs. You assume someone will forget a WHERE clause. You assume a misconfigured middleware will let something through. so you build multiple independent barriers, each of which would stop an attack on its own.

Here's the stack I built, outermost to innermost:

[ Rate Limiter ]          ← blocks abuse at the edge
[ Security Headers ]      ← protects the browser
[ Auth + Session ]        ← verifies identity
[ Tenant Resolution ]     ← determines which organization
[ Row-Level Security ]    ← database enforces isolation

Each layer is independent. If any one fails, the others still hold. That's the whole idea.


Layer 1: Stopping Abuse Before It Hits Your App

Here's the scenario that keeps me up at night: a login endpoint with no rate limiting, sitting on the open internet, protecting a database full of constituent PII. Every credential-stuffing bot on the planet can hammer it as fast as their bandwidth allows.

So the first thing I built was a rate limiter. now the naive implementation looks like this:

// ❌ Broken — race condition
const count = await redis.incr(key);
await redis.expire(key, 60); // crash here = immortal counter
if (count > limit) return 429;

There's a subtle bug that will ruin your week. If the process crashes between INCR and EXPIRE, the counter lives forever. One unlucky crash and your rate limiter becomes a permanent lockout. It has happened to me once before where exactly this happened: a key with no TTL permanently rate-limited a real user.

The fix is atomicity. Both operations happen together or neither does:

// ✅ Atomic Lua script — no race condition
const luaScript = `
  local current = redis.call('INCR', KEYS[1])
  if current == 1 then
    redis.call('EXPIRE', KEYS[1], ARGV[1])
  end
  return current
`;
const count = await redis.eval(luaScript, 1, key, windowSeconds);

Redis executes Lua scripts atomically so the counter either increments-and-expires as one unit, or it doesn't happen at all. No race condition possible. The Redis docs still recommend this approach. MULTI/EXEC transactions can't branch on intermediate results, which makes them useless for rate-limit logic.

Quick note on algorithms: what I'm showing above is a fixed window counter, which is fine for login throttling. If you want something more sophisticated, the sliding window counter has become the consensus default for API rate limiting, i mean it's low memory, near-exact accuracy, and no boundary burst problem where someone dumps 100 requests at the end of one window and 100 more at the start of the next. This Redis rate limiting tutorial walks through five algorithms if you want to go deep. And for inspiration at massive scale, Let's Encrypt published in early 2025 that they migrated their entire infrastructure to GCRA, a single timestamp per key, sub-millisecond evaluation at billion-certificate scale. pretty wild.

I applied different limits by authentication state: 100 requests/minute for authenticated users, 20/minute for anonymous requests. The login endpoint gets its own stricter budget.

One implementation detail worth flagging: Next.js middleware runs in an edge-like environment without direct TCP access to Redis. Regular clients like ioredis don't work there. So the middleware calls an internal API route that does the Redis work:

Middleware → POST /api/internal/rate-limit → Redis

Slightly more hops, but it keeps the runtime boundaries clean. If you're on Vercel, Upstash's @upstash/ratelimit is worth looking at. it works over HTTP REST, so it runs natively in Edge Runtime with fixed window, sliding window, and token bucket out of the box.

One decision you need to make upfront that nobody talks about: what happens when Redis goes down? Do you fail open (let everyone through) or fail closed (block everyone)? For a login endpoint protecting PII, I fail closed. For general API routes, I fail open. There's no universally right answer, but you need to pick one deliberately rather than discovering your choice during an outage.


Layer 2: Security Headers (what the Browser needs to know)

Security headers are instructions you send to the browser with every response. Your users never see them. They're for the browser itself, telling it what to trust and what to block.

Without them, here's what's possible: an attacker injects a <script> tag into your page via stored XSS or, a compromised CDN, a malicious browser extension and the browser happily executes it. steals cookies. redirects users. exfiltrates form data. The browser has no way to know the script shouldn't be there.

Here's the full set I apply in middleware:

const headers = new Headers(response.headers);

// Prevent XSS by controlling which scripts can execute
headers.set('Content-Security-Policy', buildCsp(nonce));

// Force HTTPS for 2 years, including subdomains
headers.set(
  'Strict-Transport-Security',
  'max-age=63072000; includeSubDomains; preload'
);

// Block this app from being embedded in iframes (clickjacking)
headers.set('X-Frame-Options', 'DENY');

// Prevent MIME sniffing attacks
headers.set('X-Content-Type-Options', 'nosniff');

// Don't leak referring URL when navigating away
headers.set('Referrer-Policy', 'strict-origin-when-cross-origin');

The interesting one, and the one that'll fight you the hardest is CSP.

The CSP Nonce Pattern

A Content Security Policy tells the browser: "only execute scripts that I've explicitly approved." The old approach was whitelisting domains, but that's fragile. If you whitelist a CDN, any script on that CDN becomes trusted — including malicious ones. A better approach is per-request nonces.

A nonce is a random token generated fresh for every single HTTP request. your CSP header says "only run scripts that have this nonce." Your server-rendered scripts include the same token. The browser checks they match.

// In middleware: generate a new nonce for this request
const nonce = Buffer.from(crypto.randomUUID()).toString('base64');

// Pass it forward to the app
requestHeaders.set('x-nonce', nonce);

// Build the CSP directive with that nonce
const csp = `
  script-src 'self' 'nonce-${nonce}' 'strict-dynamic';
  frame-ancestors 'none';
  base-uri 'self';
`.replace(/\s+/g, ' ').trim();

headers.set('Content-Security-Policy', csp);

an attacker who injects a <script> tag can't execute it because they don't know this request's nonce. It's regenerated every time.

the 'strict-dynamic' keyword is important. it means scripts loaded by a trusted (nonced) script are also trusted. this is how Next.js code splitting still works under a strict CSP. But here's something that catches everyone off guard: strict-dynamic silently disables all host-based allowlisting. If you add *.googletagmanager.com alongside strict-dynamic, the allowlist gets ignored. By design, per CSP Level 3.

OWASP's CSP cheat sheet and web.dev's strict CSP guide both recommend a backward-compatible pattern for this: script-src 'nonce-{random}' 'strict-dynamic' https: 'unsafe-inline'. Modern browsers ignore the https: and 'unsafe-inline' fallbacks when nonce/strict-dynamic is present. Older browsers use them as graceful degradation. Best of both worlds.

The Next.js Reality Check

I want to be honest here: CSP nonces in Next.js App Router are a bit of a war zone right now. The official guide shows the pattern, and it works for Server Components. but there are real, open bugs you will hit...

The biggest architectural trade-off nobody tells you about: nonces force dynamic rendering on every page. You need export const dynamic = 'force-dynamic' in your root layout. Static pages can't use nonces because a build-time nonce would be identical for every user, which defeats the entire purpose. That means no CDN edge caching, slower initial loads, higher serverless bills. For my govtech platform handling PII, that trade-off is obvious. For a marketing site? maybe think twice.

Specific things that will bite you: dynamic imports via next/dynamic create <link rel="preload"> tags that don't receive nonces, causing CSP violations even with strict-dynamic. In Next.js 15, you need to call await headers() in page components or your middleware-set CSP won't apply in production. And if you're eyeing Next.js 16's cacheComponents, it's incompatible with nonce-based CSP entirely.

three mistakes I did that I found later: generating nonces in next.config.js headers (that runs at build time, not per-request. the most common CSP nonce mistake people make), only setting CSP on response headers without passing the nonce via request headers so the renderer can extract it, and forgetting that React's HMR needs 'unsafe-eval' in dev mode. That last one is fine for development. If it leaks to production, your entire CSP just becomes decorative.


Layer 3: Auth With Tenant Context Built In

Authentication answers "who are you." In a multi-tenant app, you also need "which organization do you belong to, and what's your role there?"

most auth setups that youd watch on yt stop at the first question and leave the second to a separate DB lookup on every request. i mean that works, but it means every handler starts with a database call just to figure out context. At scale, that's a lot of wasted queries for information that rarely changes.

I solved this with session enrichment, embedding tenant context into the session at creation time, not lookup time. When a user logs in, a hook runs immediately after the session row is created:

// packages/features/auth/lib/better-auth.ts
databaseHooks: {
  session: {
    create: {
      after: async (session) => {
        const membership = await prisma.membership.findFirst({
          where: {
            userId: session.userId,
            accepted: true,
          },
          include: { organization: true },
        });

        if (membership) {
          await prisma.session.update({
            where: { id: session.id },
            data: {
              organizationId: membership.organizationId,
              role: membership.role,
            },
          });
        }
      },
    },
  },
},

Now the session row itself contains organizationId and role. Middleware reads this from a secure cookie and injects it as request headers:

// apps/web/middleware.ts
const session = await readSessionFromCookie(request);

if (session?.organizationId) {
  requestHeaders.set('x-vulcan-organization-id', session.organizationId);
  requestHeaders.set('x-vulcan-user-id', session.userId);
  requestHeaders.set('x-vulcan-role', session.role);
}

Every downstream handler — tRPC, API routes, server components, all reads these headers. no database call. no extra latency.

The Stale Data Trade-Off

The honest downside of session enrichment is that if a user's role changes mid-session, the session retains old permissions until it refreshes. This is the classic freshness-vs-performance trade-off.

For my use case, sessions last 8 hours with a 15-minute refresh window, and role changes are infrequent admin operations. If permissions change rapidly in your app, you'll want shorter-lived tokens. clerk uses 60-second tokens with automatic background refresh, for instance. The right answer depends entirely on how stale you can tolerate.

session: {
  expiresIn: 60 * 60 * 8,         // 8-hour sessions
  updateAge: 60 * 60 * 0.25,      // refresh 15 min before expiry
  cookieCache: {
    enabled: true,
    maxAge: 60 * 5,               // cookie-cached for 5 minutes
  },
},
cookie: {
  secure: true,
  httpOnly: true,
  sameSite: 'strict',
}

httpOnly means JavaScript can't read the cookie, only the browser sends it with requests, so an XSS attack can't steal session tokens. sameSite: 'strict' means the cookie won't be sent on cross-site requests, blocking CSRF entirely.

One more thing: CVE-2025-29927 demonstrated that Next.js middleware can be bypassed entirely in certain configurations. So don't rely solely on middleware for authorization. Check auth in your server-side handlers too. belt & suspenders


Layer 4: The Database Enforces Tenant Isolation, Not Just Your App

This is the most important pattern in the whole post. And the one most developers skip.

Let me paint the picture. You've got 50 government offices on your platform, each with thousands of constituent records. The typical approach:

const cases = await prisma.case.findMany({
  where: {
    organizationId: currentOrgId, // 🤞 hope we never forget this
  },
});

This works until it doesn't. Until someone writes a new query and forgets the where. Until a bug sends the wrong currentOrgId. Until a new engineer doesn't know the convention exists because it was never documented just passed down like folklore in a Slack thread nobody can find.

When it fails, it fails silently. No error. No crash. just... someone else's data showing up. In govtech, that's not an embarrassing bug, that's a breach that makes the news. and it's not hypothetical, in 2025, over 170 Supabase applications were found with missing RLS policies on auto-generated tables. One leak exposed 13,000 users' records.

The defense-in-depth answer is PostgreSQL Row-Level Security.

How RLS Works

RLS lets you attach policies directly to database tables. These policies are evaluated by the database engine itself — before any data is returned, regardless of what the application queries.

-- Enable RLS on the cases table
ALTER TABLE cases ENABLE ROW LEVEL SECURITY;
ALTER TABLE cases FORCE ROW LEVEL SECURITY;

-- Policy: users can only see rows belonging to their organization
CREATE POLICY tenant_isolation ON cases
  USING (organization_id = current_setting('app.current_office_id')::uuid);

FORCE ROW LEVEL SECURITY is critical and the single most commonly missed step. Without it, the table owner (usually the role your app connects as) bypasses RLS entirely. Bytebase's excellent post on RLS footguns calls this out as pitfall #1. Every table gets both ENABLE and FORCE. No exceptions.

Your app provides the tenant context by setting a session variable at the start of every request:

// packages/prisma/client.ts
export async function withTenantContext(
  context: { organizationId: string; userId: string },
  work: (db: PrismaClient) => Promise<void>
) {
  return prisma.$transaction(async (tx) => {
    await tx.$executeRaw`
      SELECT set_config('app.current_office_id', ${context.organizationId}, true),
             set_config('app.current_user_id', ${context.userId}, true)
    `;
    return work(tx as unknown as PrismaClient);
  });
}

And every tRPC handler that touches data calls it:

export async function listCasesHandler({ ctx }: TRPCHandlerArgs) {
  return withTenantContext(
    { organizationId: ctx.organizationId, userId: ctx.userId },
    async (db) => {
      // RLS automatically filters to this organization
      // No WHERE organizationId = ... needed
      return db.case.findMany({ orderBy: { createdAt: 'desc' } });
    }
  );
}

Clean. No manual filtering. The database handles it.

The set_config(..., true) means the setting is local to the current transaction. When the transaction ends, the context clears. No bleed between requests.

The Sharp Edges

I'd be doing you a disservice if I left out the gotchas. RLS has several, and they're not obvious until they bite you.

It fails silently. An UPDATE that should hit 100 rows hits 0 with no error, no warning. A SELECT returns empty instead of the 50 records you expected. You'll debug this for hours before realizing the policy is filtering everything out. And EXPLAIN ANALYZE as a superuser won't even show the RLS conditions, because superusers bypass RLS.

Performance can crater. RLS policies act as implicit WHERE clauses in execution plans. The dangerous trap: if your policy calls a function that isn't marked LEAKPROOF, postgreSQL can't reorder operations around it, which can block index usage and force full table scans. Queries go from milliseconds to minutes. Always index the columns your RLS policies reference. Supabase's docs report 100x improvements just from adding btree indexes on policy columns.

Views bypass RLS by default. PostgreSQL views are SECURITY DEFINER — they run with the creator's privileges. If a superuser creates a view over an RLS-protected table, all policies get skipped. PostgreSQL 15+ added security_invoker = true to fix this, but you have to set it explicitly.

Unique constraints leak data across tenants. A global UNIQUE(email) constraint will throw a "duplicate key" error if another tenant already has that email, revealing the email exists in someone else's data. Fix it by scoping uniqueness: CREATE UNIQUE INDEX users_email_per_tenant ON users(tenant_id, lower(email)).

FK checks can fail unexpectedly. Foreign key validation needs SELECT access on the parent table. If your RLS policy blocks the current tenant from seeing the referenced row, the FK check fails even though the row exists. Really fun to debug at 11pm.

The set_config Subtlety

One more thing that bit me during testing. After a transaction-scoped variable is used and the transaction ends, PostgreSQL doesn't always cleanly remove it, it can reset the value to an empty string rather than unsetting it entirely. If your policy does text comparison instead of UUID casting, '' could theoretically match rows with empty tenant IDs.

The safest practice: use current_setting('app.current_office_id') without the missing_ok parameter. That way, if context isn't set, the query throws an error rather than silently returning wrong data. Fail closed. Always.

a note on prisma

I want to be transparent: Prisma's RLS story is functional but not first-class. There's no native RLS support in the schema language (GitHub issue #12735, still open). Policies go in raw SQL migration files. Prisma's own RLS example extension states it's not intended for production.

The interactive transaction pattern works, but there are documented issues with batch transactions causing blocking queries. If you're doing high-throughput work, test connection pool behavior under load. And this one's a showstopper if you miss it, make sure your app connects as a non-superuser role. Prisma's default setup often connects as postgres, which bypasses all RLS. Defeats the entire point.

Why Two Barriers Beat One

With RLS in place, a cross-tenant data leak requires two simultaneous failures: the app layer sets the wrong organizationId AND the database policies fail to block the query. The probability of both failing at the same time is orders of magnitude lower than either alone.

that's defense in depth.


Layer 5: Proving It All Works

The last pattern isn't code, tis verification.

Security controls that aren't tested aren't controls. They're hopes.

After building all of this, I wrote scripts to prove it actually works:

// packages/prisma/scripts/verify-rls.mjs
async function verifyTenantIsolation() {
  const orgA = await prisma.organization.create({
    data: { name: 'Office A', ... }
  });
  const orgB = await prisma.organization.create({
    data: { name: 'Office B', ... }
  });

  // Create a case belonging to Org A
  await withTenantContext(
    { organizationId: orgA.id, userId: testUser.id },
    async (db) => {
      await db.case.create({
        data: { title: 'Org A Case', organizationId: orgA.id }
      });
    }
  );

  // Try to read Org A's case while authenticated as Org B
  const leaked = await withTenantContext(
    { organizationId: orgB.id, userId: testUser.id },
    async (db) => {
      return db.case.findMany(); // Should return empty
    }
  );

  if (leaked.length > 0) {
    console.error('❌ RLS FAILURE: Cross-tenant data leak detected');
    process.exit(1);
  }

  console.log('✅ Tenant isolation verified');
}

This runs in CI on every push. not once, not when I remember. every. single. push.

I'd also recommend testing the inverse: verify that queries without set_config fail hard, not silently succeed. And test as both the postgres superuser (confirm the policy exists) and your app role (confirm it's enforced). Don't assume. Prove it.


The Full Picture

Here's how all five layers interact on a single request:

1. Request hits middleware
   → Rate limiter checks Redis
   → Over quota? → 429. Under? → Continue.

2. Middleware generates CSP nonce
   → Attaches to response + request headers
   → Applies all security headers

3. Middleware reads session cookie
   → Extracts organizationId + role
   → Injects as x-vulcan-organization-id + x-vulcan-role
   → No auth? → Redirect to login.

4. tRPC handler receives request
   → Reads org/user context from headers
   → Calls withTenantContext → scoped transaction
   → Sets app.current_office_id in PostgreSQL

5. Database query executes
   → RLS policies evaluate each row
   → Only matching rows returned
   → No application-level filter needed

Each layer protects against a different failure mode. Together they make a platform you can hand to government offices handling thousands of constituents' personal data and sleep at night.


What I'd Tell Myself Before Starting

Security infrastructure feels like yak shaving when you're eager to build features. Every day on CSP headers and RLS policies is a day you're not building the actual features.

But here's the math: building security in at the start costs one week. Retrofitting it onto a running system with live data? Months. Downtime. Migrations. Rewriting core assumptions you baked in on day one.

There's a less obvious payoff too. building RLS first meant every feature I shipped afterward could assume tenant isolation existed. For govtech, where the data is constituent PII and the stakes are public trust, theres never a second option. But even if you're building something that tracks nothing more sensitive than recipe collections, these patterns will save you. yea the cost of adding them now is a week. but the cost of not having them when you need them in prodn is incalculable.

Build the walls before you move the furniture.


Go check your headers at securityheaders.com. Run your Prisma migrations as a non-superuser. Add FORCE ROW LEVEL SECURITY to every table, then test as the application role, not postgres.

nuf said,
Kay

Subscribe to Kay's Logs

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe