Building Muntra: 1000 lines of Go, two days of nginx

I shipped Muntra this week, yay! It's an in-house, drop-in replacement for Umami written in Go. The code took a couple of focused evenings. The deploy took sixteen hours. This is the blog version of how that went. Of course I had (not just a little) help from AI, btw, duh!

This isn't a "how I 10x'd analytics" post. It's a war story about what happens when you write a small, tidy, GDPR-compliant-by-design service and then try to wire it into a real production system that has scar tissue from years of choices made by a previous version of yourself.

Act 1: Why I built it

I run a small monorepo of SvelteKit tenants on a single VPS. The idunworks.com is one of them plus a couple of variants in other places. The whole thing fits comfortably in a few gigabytes of RAM except for one stubborn process: the tracker Umami. To have privacy you need to roll your own or give up your data to the big sharks. I don't particularly like sharks, nor paying to keep them off my back.

Umami works. For about 48 hours. Then its RSS creeps from a couple hundred megabytes toward 1 GB, the OOM Grim Reaper eventually visits, Docker restarts it, and the cycle repeats. The community-blessed fix is a cron job that restarts the container every 24 hours. That's not a fix, that's a confession.

The root cause isn't a bug, it's the architecture. Umami is a Next.js application that happens to do tracking, not a tracker that happens to have a dashboard. Next.js keeps a hot-module cache, a server-component cache, an image-optimization cache. Prisma keeps a separate Rust/WASM query engine plus a connection pool with its own buffers. None of these caches are bounded. Add five sites' worth of dashboard queries and Node's old-generation heap grows monotonically until GC can't keep up. That's a textbook Node memory curve. It will never be fixed by a patch release because it isn't a bug.

So I asked the cheapest question in engineering: what's the smallest thing that does ingestion plus a query API, and bring-your-own dashboard? The answer turned out to be about a thousand lines of Go.

The constraints I picked:

Single distroless binary. Image ≤ 25 MB. Steady-state RSS ≤ 50 MB across all tenants.
Redis as a bounded buffer (hard cap 128 MB, fail loud if it overflows).
Postgres as the durable store.
No in-process caches that can grow without an explicit eviction policy.
GDPR-clean by construction, not by configuration.

I have enough trouble keeping my own memory bounded these days. I don't need the same problem on my servers.

Ah, almost forgot to mention: should you ever need help with the most privacy-oriented and compliant tracking system there is, I'm open to suggestions. And yes, I can give you an upfront quote with zero hidden fees.

Act 2: The build (and it gets even nerdier from this point)

The build was the fun part. Single Go module, around a thousand lines of code, structured as:

internal/collect/handler.go — parse incoming JSON, hash the visitor, push to Redis.
internal/salt/salt.go — daily-rotating salt, race-safe via SETNX with 25-hour TTL.
internal/flush/worker.go — pops batches off Redis, slams them into Postgres with pgx.CopyFrom, requeues on failure.
internal/rollup/worker.go — re-upserts the current and previous hour/day buckets into pre-aggregated tables every 15 minutes.
internal/api/handler.go — bearer-authed /stats, /timeseries, /breakdown, /live endpoints.
internal/tracker/tracker.js — fifty lines of vanilla JS, embedded with go:embed, served at /script.js. sendBeacon first, fetch fallback, patched history.pushState for SPA navigation.
internal/migrate/migrate.go — walks schema/*.sql on boot, runs each via pgx.Exec, every statement idempotent (CREATE … IF NOT EXISTS, ADD COLUMN IF NOT EXISTS). No version table. Just re-run everything every boot. If it took 50 ms last time, it will take 50 ms this time, and there's no possible "did this DDL run?" footgun.
internal/auth/auth.go — bearer middleware with subtle.ConstantTimeCompare.

The visitor hash is the GDPR-clean trick: sha256(ip || user_agent || daily_salt), where the salt is rotated every UTC midnight and the old value is allowed to expire from Redis. After 25 hours, no party: not me, not a subpoena, can correlate a visitor across days. The trade-off is that "unique visitors" is a daily concept, never a lifetime one. That's a feature, not a limitation.

Per-site origin validation went in from the start: a MUNTRA_SITE_ORIGINS env map binds each site_id to its allowed Origin hosts.

MUNTRA_SITE_ORIGINS=site1:site1.com,site1:site2.com,site3:site3.com

If the Origin header on /collect doesn't match the registered hosts for the claimed site_id, the handler returns 403. Without this, anyone can spoof events from anywhere and pollute someone else's stats.

Two small details that bit me but were worth the cost:

pgx v5.9.2 requires Go 1.25. The toolchain auto-bumped on first build and I didn't notice for ten (ok 100) minutes.
Distroless static-debian12 has no shell, so Docker's HEALTHCHECK CMD curl … doesn't work. The fix was to add a muntra healthcheck subcommand so the binary can check itself. The healthcheck command in the Dockerfile is just ["/muntra", "healthcheck"].

The final binary is 21.5 MB. That's worth pausing on. Umami's container image is around 300 MB before any application state. Muntra's entire binary is smaller than Umami's package.json after pnpm install. Steady-state RSS in production has been hovering around 38 MB across all four live tenants. The Redis buffer has never exceeded 4 MB.

End-to-end this part is satisfying. The hard part isn't tracking. The hard part is not being Next.js.

Act 3: The deploy from hell

Caveat upfront, because it matters: every single problem in this section was self-inflicted. None of them were Go's fault, Muntra's design's fault, Docker's fault, or nginx's fault. They were all mine. That's why this section exists. If the deploy had gone smoothly I'd have nothing useful to say.

My deploy model has worked for years. The monorepo lives on my Mac. make ship SITE=<tenant> syncs the relevant subtree to .standalone/<tenant>/. That standalone copy gets pushed to a per-tenant GitHub repo. The server pulls from that repo. It's a chain — monorepo → .standalone → GitHub → server — and only adjacent links can be diffed against each other. The reason for that complexity is unrelated to Muntra, but Muntra sat on top of it.

Failure one: env config drift.

I added PUBLIC_MUNTRA_URL=https://site1.com/muntra to env/site1.config in the monorepo, ran make ship SITE=site1, server pulled, container env on the server… empty. Two hours of "why is my env var empty" before I bothered to actually read my own sync script. The script only propagated env/tyr.config and env/shared.secrets.template explicitly. Other env/<site>.config additions were silently dropped.

The fix was an additive merge: for each KEY=VALUE line in the monorepo's env config that's missing from the server side, append it; never overwrite existing server-side values, because a few tenants have hand-tuned overrides like DISABLE_CSRF_CHECK that absolutely must survive. Five lines of bash. Two hours to find. I should have read the script before guessing.

Failure two: BuildKit cache served a stale .env layer.

With env propagation fixed, the container env on the server now had PUBLIC_MUNTRA_URL set correctly. But the built SvelteKit bundle still had features.muntra = !1 (minifier shorthand for false). The sidebar link to the dashboard was hidden. The admin page showed "not configured."

This one cost me four or five hours, and the diagnostic loop was genuinely confusing: runtime env vars correct, docker compose config showed correct resolved build args, but the baked JavaScript bundle was stale. Vite bakes import.meta.env.PUBLIC_X at build time from the .env file in the build context. My Dockerfile wrote that file via:

RUN echo "PUBLIC_X=${PUBLIC_X}" > .env && echo "PUBLIC_Y=${PUBLIC_Y}" >> .env && ...

BuildKit's cache for that RUN step keys on the literal command text plus the previous image state. When the ARG value changed but the command text and earlier layer hashes didn't shift, BuildKit got a cache hit. The cached layer still had the old .env (with PUBLIC_MUNTRA_URL= empty). Vite happily baked an empty string. features.muntra compiled to false. Production showed no dashboard.

The fix was structural: rewrite the env-file-writing step as a single printf so the command text becomes a different string when the args change, and stop relying on cache behavior I'd misunderstood:

RUN printf '%s\n' \
  "PUBLIC_MUNTRA_URL=${PUBLIC_MUNTRA_URL}" \
  "PUBLIC_SITE_ID=${PUBLIC_SITE_ID}" \
  > .env

Then I made --no-cache the default in scripts/deploy-ui.sh. The two-to-three-minute extra build time is cheap insurance against another evening like that one.

Failure three: bypassing my own Makefile.

Out of frustration with the cache problem, I ssh'd into the server and ran a manual docker compose --env-file env/site1.config build --no-cache idun-ui from /opt/site1.com. That command worked. But it bypassed the color-aware $(DC) macro the Makefile uses — the one that dispatches against the active blue/green compose project name. Without that, my manual compose invocation picked the default project name, recreated the UI container under a new project, orphaned it from the running blue stack's DB, and the live site went 500.

Five minutes of "did I just take site1 down" panic. Recovery was make deploy-ui from /opt/site1.com, which restored the correct project naming, port binding, and DB linkage. The site came back up. I was at least 2 months older... ok, maybe not :)

The entire reason the Makefile exists is to wrap exactly this kind of state. Don't bypass it. Don't bypass it even if you "know better." Especially don't bypass it if you think you know better, because the moments you think you know better are precisely the moments your wetware is too tired to model multi-project Docker composition state correctly.

I deleted a note in my own memory called "office-hours surgical recreate". A shortcut I'd been telling myself was sometimes okay. Its existence in my head was actively making me worse at deploying.

Failure four: nginx /muntra/ block placed in the wrong server stanza.

Each tenant's nginx vhost has two server blocks: an HTTP one that listens on :80 and 301s to HTTPS, and an HTTPS one on :443 ssl that does the actual proxying. My provisioning script used awk to insert a new /muntra/ location block "before the first location / { it finds." The first location / { it finds is inside the HTTP redirect block. So HTTPS requests for /muntra/health fell through to the HTTPS block's catch-all and returned a SvelteKit 404 JSON page.

This was the most "obvious in hindsight" bug of the entire day. Ninety minutes of diagnostics because grep '/muntra/' /etc/nginx/sites-enabled/site1.com confirmed the line was in the file — just in the wrong half of the file. Fix: an awk that tracks in_https = 1 once it sees listen .* 443, and only inserts inside that server's location / {. Plus, idempotency — the script now removes any existing /muntra/ blocks before inserting, so it self-corrects on tenants where the broken version already ran.

Failure five: backup files in sites-enabled.

My provisioning script did the responsible thing — cp $VHOST $VHOST.bak.$(date +%s) before editing. That backup landed at /etc/nginx/sites-enabled/theoutdoorhub.eu.bak.1778573790. nginx parses every file in sites-enabled/ as a vhost. Result: "conflicting server name" warnings on every reload, and the backup file competing with the patched one for theoutdoorhub.eu traffic — whichever loaded first won the race. Fix: backups go to /var/backups/nginx-muntra/, which nginx doesn't read.

I had a moment of dark humour there. Years of writing "always back up the file you're editing" as a rule, and the rule itself was a bug.

Failure six: the script didn't compute derived ports.

The provisioning script restarted idun-ui via docker compose. That compose file needs UI_PORT, DB_PORT, IDUN_API_PORT, etc., all of which are derived from PORT_OFFSET (e.g., UI_PORT = 5176 + PORT_OFFSET). The script sourced env/<site>.config, which has PORT_OFFSET, but never computed the derived ports. Compose substituted empty into the ports: block. Docker auto-assigned an ephemeral port — 32871 or whatever happened to be free. nginx's upstream still pointed to the expected port. 502s until I caught it.

The fix is technically a copy-paste of the port-derivation block from scripts/deploy-ui.sh. The better fix is to stop writing new scripts that recompute things deploy-ui.sh already correctly computes. Which is the same lesson as failure three, wearing a different shirt.

Failure seven: blue/green DB binding broke a full blue-green deploy.

Trying to force a clean rebuild, I ran a full make deploy SITE=site1 — the proper blue-green path that brings up a green project sharing blue's DB. Green's skjold-api crashed at startup trying to reach blue's DB at host.docker.internal:5473. Connection refused. Blue's DB container is bound to 127.0.0.1:5473 on the host — loopback only. Containers going through the docker0 bridge can't reach a loopback-only listener on the host. This was a pre-existing latent bug that would have surfaced the next time anyone tried to deploy site1 via the blue-green path. My deploy just happened to be the one that found it. Filed for a separate session. Reverted to a same-project rebuild.

Failure eight: rebuilding shared services pulled a newer SQLAlchemy that broke heimdall.

Configuring make shared-rebuild to also rebuild Muntra, I rebuilt the shared stack from scratch. That pulled a newer SQLAlchemy that now hard-errors when pool_size and max_overflow are passed alongside a SQLite URL (NullPool doesn't accept pool tuning args). Older versions warned. The new version refuses. Heimdall — completely unrelated to Munin — crash-looped. Two-line fix in heimdall to only pass pool args if the URL isn't SQLite, but for forty-five minutes I thought I'd somehow broken the translation service from a Go analytics deploy.

There were probably another four or five small failures I'm not going to enumerate — typos, an ssh hostname that didn't resolve for one diagnostic command, a heredoc mangled through ssh that produced a syntactically corrupt config, a stash/pop on the server that left a real merge conflict I had to resolve by hand. Cumulatively, sixteen hours. Maybe more — I stopped looking at the clock around hour twelve.

The meta-lesson is uncomfortable and worth saying clearly:

I didn't have a bug. I had a class of bugs. Every individual failure had its own simple fix, but the underlying pattern — "I wrote the abstraction and then bypassed it" — was the same thing five times in a row. The fix isn't a better script. The fix is to actually use the abstraction every time, including the times when I'm frustrated and tempted to drop one layer down.

That's not a profound observation. Every engineer learns it eventually. I learned it again on a Sunday.

Act 4: Honest comparison and what I'd do differently

Muntra vs Umami. Muntra is roughly twenty times smaller in RAM and image size. Munin has no web dashboard — that's bring-your-own. Umami has a polished React UI that handles ninety percent of what most people need out of the box. If you don't want to write a dashboard, use Umami and accept the memory cost. It's a reasonable trade.

Munin vs Plausible. Plausible is also pleasant, also self-hostable, and unlike Munin has a real product team and a real UI. Pick Plausible if you'd rather pay for the cloud version or run a heavier stack (self-hosted Elixir plus ClickHouse). Munin is for people who'd rather write a thousand lines of Go than run someone else's web app.

Munin vs Matomo. Not a fair fight. Different category. Matomo is an enterprise analytics suite. Munin is a pinhole camera.

What I'd do differently if I started over:

Don't extend a monorepo's existing per-tenant deploy script. Start with an isolated standalone repo where the deploy path is direct — no sync layer, no fork-out to standalones. Most of Act 3's pain existed because of how my multi-tenant deploy already worked, not because of anything in Muntra's design. A greenfield service should get a greenfield deploy.
--no-cache Docker builds by default for anything that takes env-driven build args. Two to three minutes of build time is nothing compared to debugging a cached stale layer that's silently lying to you.
Origin validation from day one, not "after I noticed anyone could spoof events from anywhere." I had a window of maybe six hours where the staging endpoint would happily accept events from any origin. Nothing bad happened, but it could have.
Schema migrations on startup from day one. I almost shipped this with manual psql -f migrations as the documented path. The auto-apply walker is fifty lines of code and removes an entire class of "did this DDL run yet on tenant N?" footguns. It's the kind of thing that costs nothing to add early and is genuinely hard to add later.
Use my own Makefile. Every. Single. Time. Even when I think I know better.

Wrap

Muntra is on GitHub. AGPL-3.0. Roughly a thousand lines of Go. The whole binary is smaller than Umami's package.json. It's parallel-running alongside Umami across four production tenants right now while I confirm that event counts match — once that's been stable for a couple of weeks, Umami goes in the bin and I get back its 1 GB of RAM for something useful.

Best feature: in three months it'll still be using fifty megabytes of RAM. Worst feature: it doesn't have a UI yet. Both are by design. Oh, I forgot: my Umami dashboard still works! I was actually too lazy to rewrite it so naming is the same in what Muntra outputs. That means if you ever had Umami in your own dash, like I did, most things work even if you do the switch. Don't tell anyone ;)

The code took two evenings. The deploy took two days. The takeaway isn't about analytics, it isn't about Go, and it isn't about Umami. It is that infrastructure-as-discipline (use your abstractions, trust your tools, don't bypass) matters more than infrastructure-as-cleverness. That's not a Muntra lesson. That's the lesson every shipping engineer pays for, again, on a Sunday, in their undies, swearing at nginx.

If you build something with it, or if you've hit one of the same walls, I'd love to hear about it.

Comments? Hit me!