TL;DR
- Treat security as product requirements: contract-first, short-lived tokens, object-level auth, schema validation, rate limits, replay guards, evidence by default.
- Wire small, reliable gates into CI and portable policies at edge + service.
- Start with 5 changes in 30 days: API-BOM, JWT
aud/iss
checks, object-ownership checks on money + identity flows, write-route rate limits + normalization, webhook signatures with 5-minute replay window.
Who this is for, and how to use it
Engineers, staff ICs, tech leads, and platform folks who build and run APIs. Use this playbook to scaffold new services, retrofit legacy ones, and standardize controls across languages. Skim Quickstart to land the basics, then wire the CI jobs and runtime policies. Reference the snippets whenever you open a PR.
Day-0 Quickstart (copy these into your next PR)
- Add a contract
- For REST: OpenAPI 3.1 with
securitySchemes
, strict request/response schemas, and explicit error shapes.
- For GraphQL: schema SDL checked in, persisted queries only.
- For gRPC:
.proto
files with method allowlists in the proxy.
- Add token checks
- Verify
iss
, aud
, exp
, and signature; rotate keys; prefer short TTLs.
- Block tokens with wrong
aud
even if valid.
- Enforce object-level authorization (BOLA stop)
- Every read/write checks tenant + subject ownership, near the data.
- Rate-limit and normalize
- Soft limits on reads, stricter on writes; collapse near-dupe payloads.
- Replay + webhook hygiene
- HMAC signature with timestamp; 5-minute window; idempotency keys.
- Log decisions, not secrets
- Structured logs with correlation IDs; mask PII; short retention.
Threat model (engineer edition)
- IDOR/BOLA: object ownership missing or inconsistent.
- Mass assignment: extra fields accepted and honored.
- Token misuse: missing
aud/iss
checks; long TTLs; reuse across services.
- Parser & resource exhaustion: large, nested, or weird payloads.
- Webhook replay/spoof: no signature check or long replay window.
- Version drift: deprecated routes still live; unowned endpoints (“shadow/zombie”).
Fix with contract-first design, strict types, portable policies, and negative tests on every change.
Design: contract-first (REST)
openapi: 3.1.0
info: { title: Payments API, version: "1.2.0" }
components:
securitySchemes:
oauth2:
type: oauth2
flows:
clientCredentials:
tokenUrl: https://idp.example.com/oauth2/token
scopes:
payments.read: "Read payments"
payments.write: "Create payments"
schemas:
PaymentCreate:
type: object
additionalProperties: false
required: [amount, currency]
properties:
amount: { type: integer, minimum: 1 }
currency: { type: string, enum: [USD, EUR, GBP] }
security: [{ oauth2: [payments.read] }]
paths:
/v1/payments:
post:
security: [{ oauth2: [payments.write] }]
requestBody:
required: true
content:
application/json: { schema: { $ref: "#/components/schemas/PaymentCreate" } }
responses:
"201": { description: Created }
"400": { description: Schema violation }
"401": { description: Bad token }
"403": { description: Forbidden }
AuthN & AuthZ: practical patterns
Node (Express) - JWT checks + object authorization
import express from "express";
import jwt from "jsonwebtoken";
import jwksClient from "jwks-rsa";
const app = express();
const client = jwksClient({ jwksUri: "https://idp.example.com/.well-known/jwks.json" });
const ISSUER = "https://idp.example.com/";
const AUD = "api://payments";
function getKey(header, cb){ client.getSigningKey(header.kid, (e, key)=>cb(e, key.getPublicKey())); }
function requireJwt(req, res, next){
const token = (req.headers.authorization || "").replace("Bearer ","");
if(!token) return res.status(401).json({error:"missing_token"});
jwt.verify(token, getKey, { algorithms:["RS256"], issuer: ISSUER }, (err, decoded)=>{
if(err || decoded.aud !== AUD) return res.status(401).json({error:"invalid_token"});
req.user = decoded; next();
});
}
app.get("/v1/accounts/:id", requireJwt, async (req, res)=>{
const acct = await db.accounts.findById(req.params.id);
if(!acct || acct.tenant_id !== req.user.tid || acct.owner !== req.user.sub)
return res.status(403).json({error:"forbidden"});
res.json(acct);
});
Python (FastAPI) - schema + ownership
from fastapi import FastAPI, Header, HTTPException
from pydantic import BaseModel, Field
app = FastAPI()
class PaymentCreate(BaseModel):
amount: int = Field(gt=0)
currency: str
def verify_token(auth: str):
# Parse and verify JWT signature, iss, aud, exp using your library of choice
# Return dict with 'sub' and 'tid' on success
...
@app.post("/v1/payments")
def create_payment(body: PaymentCreate, authorization: str = Header(None)):
if not authorization: raise HTTPException(401, "missing_token")
user = verify_token(authorization.replace("Bearer ",""))
# Ownership example for write: ensure user can create for this tenant
if user["tid"] != "tenant-123": raise HTTPException(403, "forbidden")
return {"status": "created"}
Request normalization, rate limits, and replay protection
NGINX (snippet) - JWT file + rate limit + normalization
map $http_authorization $jwt { "~^Bearer (.+)$" $1; default ""; }
limit_req_zone $binary_remote_addr zone=write_zone:10m rate=10r/s;
server {
listen 443 ssl;
location /v1/ {
# Token validation via JWT key file (or use an auth subrequest)
auth_jwt "secured"; # requires nginx-plus or module; otherwise external auth
auth_jwt_key_file /etc/nginx/jwk.json;
# Normalize
proxy_set_header X-Normalized "1"; # pair with app-level canonicalization
limit_req zone=write_zone burst=20 nodelay;
proxy_pass http://payments_upstream;
}
}
Idempotency + replay window (Node)
Webhook signature verify (Python)
import hmac, hashlib, time, base64
def verify(sig_header: str, payload: str, secret: str) -> bool:
# header format: t=unix,s=hex(hmac_sha256(f"{t}.{payload}", secret))
try:
t, s = [p.split("=")[1] for p in sig_header.split(",")]
if abs(time.time() - int(t)) > 300: return False
mac = hmac.new(secret.encode(), f"{t}.{payload}".encode(), hashlib.sha256).hexdigest()
return hmac.compare_digest(mac.lower(), s.lower())
except Exception:
return False
GraphQL hardening
- Disable raw POSTs; persisted queries only.
- Enforce depth and cost limits; disallow introspection in prod.
- Put authorization at resolver level for sensitive fields.
Apollo Server (persisted queries)
import { ApolloServer } from "@apollo/server";
import { createPersistedQueryPlugin } from "@apollo/server-plugin-persisted-queries";
const server = new ApolloServer({ typeDefs, resolvers, plugins: [createPersistedQueryPlugin()] });
gRPC controls
- mTLS between services; method allowlists at proxy.
- Deadlines on every call; max message size; RBAC per method.
Go - unary interceptor sketch
func authInterceptor(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
md, _ := metadata.FromIncomingContext(ctx)
token := strings.TrimPrefix(md["authorization"][0], "Bearer ")
sub, tid, err := verifyJWT(token) // verify iss, aud, exp
if err != nil { return nil, status.Error(codes.Unauthenticated, "bad token") }
if !allowed(info.FullMethod, sub, tid) { // method-level RBAC/ABAC
return nil, status.Error(codes.PermissionDenied, "forbidden")
}
ctx = context.WithValue(ctx, "principal", sub)
return handler(ctx, req)
}
CI: gates that don’t flake (GitHub Actions example)
name: api-ci
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install deps
run: npm ci
- name: Contract checks
run: npm run openapi:lint && npm run openapi:bundle
- name: Negative tests (curl)
run: |
./scripts/neg-tests.sh # includes IDOR, overposting, expired/wrong aud token cases
- name: GraphQL safety
run: npm run gql:persist && npm run gql:check-depth-cost
- name: Fuzz light
run: npm run fuzz:payloads
Sample negative tests (bash)
# 1) Wrong audience
curl -s -o /dev/null -w "%{http_code}\n" \
-H "Authorization: Bearer $WRONG_AUD" https://api.example.com/v1/payments | grep -q "^401$"
# 2) IDOR attempt
curl -s -o /dev/null -w "%{http_code}\n" \
-H "Authorization: Bearer $TOKEN_A" https://api.example.com/v1/accounts/${OTHER_USER} | grep -q "^403$"
# 3) Overposting
curl -s -o /dev/null -w "%{http_code}\n" \
-H "Authorization: Bearer $TOKEN_A" -H "Content-Type: application/json" \
-d '{"amount":100,"currency":"USD","role":"admin"}' \
https://api.example.com/v1/payments | grep -E "^(400|422)$"
Observability for security
Log shape (JSON)
{
"ts":"2025-09-04T12:00:00Z",
"correlation_id":"c-49f8",
"route":"POST /v1/payments",
"principal":"sub:a1f3...",
"tenant":"tid:t-77",
"decision":"allow",
"reason":"scopes:payments.write",
"latency_ms":42,
"pii_masked":true
}
Detections to wire
- 403 spikes by route and principal
- Schema violation bursts on a route or version
- Tokens used across multiple services in short time window
- Repeated webhook IDs or timestamp skew
API discovery and API-BOM
Track this in code (CSV/JSON in repo) and surface via dashboard.
service, path, method, version, owner, data_class, auth, pii_fields, last_seen, risk payments, /v1/payments, POST, 1, @team-pay, sensitive, oauth2, amount|currency, 2025-09-01, high
First 90 days engineering plan
30 days - baseline
- Schemas published for top services; JWT
aud/iss
checks everywhere; write-route limits + normalization; KPI baseline.
60 days - harden critical paths
- Ownership checks on account, payment, password flows; persisted queries + depth/cost caps; CI jobs for IDOR + overposting.
90 days - clean up and codify
- Remove dead routes; enforce deprecation calendar; auth/schema error budgets with rollback; updated service template.
RACI that keeps velocity
- Platform: shared auth libs, policy bundles, CI jobs, templates.
- Service teams: adopt bundles, implement ownership checks, add tests.
- Security: define rules, budgets, evidence; review violations.
- SRE: run gateways/mesh, alerts, rollbacks; publish reliability SLOs.
- Product: deprecation windows; partner comms on version changes.
Anti-patterns to retire
Custom auth sprinkled in services; only-at-edge schema checks; lenient staging vs strict prod; long-lived tokens; PII in logs; alerts with no owners; launching breaking changes without shadow validation.
Introduction to Levo, how we help
Levo gives privacy-preserving runtime visibility and contract validation without moving payloads out of your boundary. Findings become policies and CI tests you adopt service by service. Pricing stays predictable as services and environments grow, so platform teams can raise reliability and speed at the same time.
See how this looks in practice, book a short working session on your two highest risk flows book a demo.
Conclusion
Contracts, portable policies, and small reliable tests make reliability a pipeline property, not a late-night firefight. Ship these guardrails with every service and you’ll go faster with fewer incidents.
Related: Learn how Levo is solving the API security issue with it's fix first approach and a product which is scale agnostic, data privacy first and growth immune pricing Levo's API Solution.
FAQs
What’s the fastest way to stop BOLA today?
Add a shared helper for tenant + subject checks. Call it on every read/write of sensitive resources. Add two negative tests per route.
Do I put auth at edge or in code?
Both. Edge for coarse checks and token validity. In code for object ownership and business rules. Keep both as code.
Will rate limits break customers?
Use soft limits on reads, stricter on writes. Scope by principal and route. Monitor 429s and tune weekly.
How do I make GraphQL safe without killing flexibility?
Persisted queries only, depth and cost caps, field-level auth for sensitive data. Version queries like you version REST.
gRPC best practices in one line?
mTLS + SPIFFE IDs, method allowlists, deadlines, max message size, and proxy-level RBAC.
How do I avoid flaky CI gates?
Small tests with realistic fixtures; assert on exact contract violations; run the noisy gates in monitor for a sprint before blocking.
What should I log?
Correlation ID, route, principal, decision, reason, latency. Mask PII. Keep debug logs short-lived.
How do I detect IDOR early?
Alert on 403 spikes and sequential ID access. Pair with schema-violation alerts for the same route.
How do I secure webhooks?
HMAC signatures with timestamps; 5-minute replay window; store nonce per event; idempotent handlers.
How do I retrofit a legacy service?
Wrap at gateway with token validation + schema checks + limits. Add ownership in code next sprint. Plan version retirement.
What is “request normalization”?
Canonicalize field order, trim whitespace, lower-case headers where safe, de-dup params, bound array sizes. Prevent near-duplicate floods.
Where do I store evidence for audits?
In the repo with policy code, test results, and dashboards as artifacts. Automate a weekly export.
How do I handle agent/LLM traffic?
Whitelist tools/routes, cap outputs, scrub prompts from logs, monitor vector-store access for sensitive terms.
When do I block vs monitor?
Monitor in lower envs and for new rules; block high-risk routes in prod. Tie to violation budgets with rollback rules.
How do I keep templates from drifting?
CI checks that enforce template files and versions; a quarterly “template sync” PR across services.
What’s the minimum “secure by default” scaffold?
OpenAPI/SDL, JWT helper, ownership helper, request validator, rate-limit headers, idempotency + replay guards, structured logging, basic neg tests.
How do I make error messages safe?
Return codes and generic reasons; put detail in logs with masking; never echo raw queries or secrets.
Any tips for partner sandboxes?
Same policies as prod, lower thresholds; seed test data; rotate sandbox credentials often; publish a replay policy.
How do I prove ROI to leadership?
Show fewer incidents, faster MTTR, stable launch metrics, shorter security questionnaires, and predictable cost while traffic grows.
How do I get teams to adopt this?
Make the secure path the easy path: templates, helpers, and passing CI by default. Recognition for teams that remove dead routes and hit budgets.