Chasing the vibe-coded bottlenecks
I’ve been building Slips — a collaborative, real-time task list app — as part of a series of lightweight, self-hostable tools. One constraint: the backend would be written primarily by an LLM, with me in the role of technical director rather than primary author. The result was three functionally identical server implementations: Node.js, Go, and Swift. Same API surface, same SQLite persistence, same WebSocket sync protocol. Drop-in replacements for each other.
It started as a productivity experiment. It ended up being a pretty good audit of where LLMs fall short, what “idiomatic” code actually costs at runtime, and a few embarrassing assumptions on my part.
The Setup
Slips has real-time sync over WebSockets using Automerge and CRDTs. The HTTP side is standard CRUD — create a list, fetch it by share token, manage tasks. Every request that looks up a list derives an ID by hashing the token with SHA-256. Sounds boring. Turns out where you hash things matters a lot.
All three backends run SQLite in WAL mode (more on that later), on the same machine (Apple M1 Pro, macOS), benchmarked with a Go tool running 200 ops at 10 concurrent workers.
Node.js came first and got the most hand-holding — detailed prompts, iterative corrections, explicit direction. Go and Swift were reimplementations with lighter prompting: “here’s what this does, look at that fugly JS code, use the language the way it was meant to be used.”
Node.js
The stack was reasonable: Express, better-sqlite3, ws. Standard stuff, nothing surprising.
The first benchmark came back at around 1,400 ops/s on sequential POSTs. Not embarrassing, but also not where I expected to land.
Web Crypto surprise
The token hashing function — called on every single request — was using crypto.subtle.digest. That’s the modern, standards-compliant Web Crypto API. It’s also async, so every request was dispatching to the thread pool and resolving a Promise to compute one SHA-256 hash.
Swapping to createHash('sha256') from Node’s built-in crypto module — a synchronous C++ binding, direct calls to OpenSSL — took sequential POST throughput from ~1,153 to ~2,848 ops/s. 2.5x from one function call. Yikes.
The same file had an interesting token generation implementation: a spread operator into btoa() followed by three regex passes to sanitize the base64 output. What in the holy convoluted batman is this. One randomBytes(32).toString('base64url') call does the same thing. As if it just tried things until it made it work — which tracks, because this implementation needed the most hand-holding and still seemed to not fully grasp the spec even after we worked it out upfront.
Logging overhead
Debug output to stderr with the default log level was eating a measurable slice of request time. LOG_LEVEL=error recovered 2,975 ops/s — matching the historical best. Not a criticism of the LLM; just a reminder that benchmarking with debug logging on is benchmarking the logger.
SQLite WAL mode
WAL decouples readers and writers — readers get a consistent snapshot while a write is in progress, instead of serializing behind it. For 10 concurrent writers this matters a lot. Measured on the Go backend: ~5x improvement on both sequential and concurrent writes. All three backends ended up with WAL enabled; Go and Node needed an explicit PRAGMA, Swift’s GRDB enables it automatically via DatabasePool. Why that isn’t the default to begin with, I have no idea.
Final Node.js: 2,975 ops/s sequential POST, ~1,024 ops/s GET by token, 121.8 MB RSS idle.
Swift
I expected Swift to be the fastest - ARC instead of GC, native ARM64, Apple hardware, Apple frameworks on Apple silicon — seemed like a slam dunk.
First benchmark: 139 ops/s on GET by token. Go was doing over 3,000. Uh, oh.
The actor hop
The LLM structured the backend as two actors: API handling routing, calling into Store for persistence. That sounds like a natural Swift 6 architecture — separate concerns, compile-time safety, clean. It’s also two cooperative executor suspensions per request, each costing ~5–15μs. The code was correct, it passed strict concurrency checking, and it was 20× slower than Go.
The fix: turn on Swift 6 mode with strict concurrency checking, use lightweight sendable types, value types instead of reference types, keep one Store actor for actual database access. Let the compiler provide additional guardrails the LLM had to follow. Throughput jumped to ~3,300 ops/s.
33 allocations for a hex string
Every request hex-encodes a SHA-256 hash. The LLM used hash.map { String(format: "%02x", $0) }.joined() - a pattern you’ll find in tutorials, in Apple’s docs, all over Stack Overflow.
String(format:) routes through CFStringCreateWithFormat. For each of the 32 bytes: box the UInt8 into NSNumber, allocate an autoreleased CFString, bridge back to a Swift String, land it in an intermediate array. 33 heap allocations per call, all touching the Obj-C autorelease pool. At 5,000 req/s, that’s 165,000 unnecessary allocations per second.
Replaced with a pure-Swift nibble lookup table writing into a pre-allocated [UInt8] buffer, final string via String(decoding:as:). One allocation. GET by token improved +57%.
The same String(format:) antipattern showed up in two more places — token validation using CharacterSet (bridges to NSCharacterSet) and token generation using replacingOccurrences(of:with:) (bridges to NSString). All converted.
Going further
After the lookup table, two allocations remained: the intermediate [UInt8] buffer and a Data copy of the input token. String(unsafeUninitializedCapacity:initializingUTF8With:) writes directly into the String’s storage. withContiguousStorageIfAvailable reads the token’s UTF-8 bytes without copying (Swift 5+ stores strings as UTF-8 internally).
Across 100,000 iterations on M1:
| Version | Time | Allocations |
|---|---|---|
Original (map + String(format:)) |
36,844 ns/op | 33+ |
Lookup table with [UInt8] buffer |
7,467 ns/op | 3 |
String(unsafeUninitializedCapacity:) |
6,051 ns/op | 2 |
| + no-copy UTF-8 input | 5,617 ns/op | 1 |
87% reduction from eliminating Obj-C bridging and copies. Swift can be fast. Getting there means knowing which APIs stay in Swift and which ones quietly drop into Obj-C — and that knowledge doesn’t come from the docs.
Final Swift: 5,049 ops/s GET by token, 8,620 ops/s concurrent POSTs, 56.1 MB RSS. Caveat: Swift’s NIO event loop hit anomalous CPU usage (up to 576% at idle) in some sessions. Historical best was 14,030 ops/s on concurrent writes.
Go
The LLM picked well from the start: net/http, mattn/go-sqlite3 (CGO), gorilla/websocket. Most of the optimization work here was pushing a good baseline further rather than fixing structural mistakes.
WAL + CGO
Switching from modernc.org/sqlite (pure-Go, WASM-based) to mattn/go-sqlite3 (CGO, native library) bumped GET throughput from ~3,608 to ~6,861 ops/s — the CGO version gets the full optimized SQLite C library. WAL on top of that roughly doubled write throughput.
Per-token shard mutexes
One sync.RWMutex protecting all provider state meant 10 concurrent goroutines writing to 10 different lists still serialized. Split into 64 shard-level mutexes keyed by FNV-1a hash of the share token — the provider lock is held briefly for map lookups, the heavy work (Automerge, crypto, SQLite) runs under just the per-token lock.
| Metric | Before | After | Change |
|---|---|---|---|
| POST seq (ops/s) | 2,907 | 5,327 | +83% |
| POST c=10 (ops/s) | 9,122 | 15,677 | +72% |
| GET by token (ops/s) | 4,739 | 7,536 | +59% |
fmt.Sprintf vs hex.EncodeToString
fmt.Sprintf("%x", h) uses reflection. hex.EncodeToString(h[:]) is a direct byte operation. Small per-call, measurable at throughput.
Final Go: 6,866 ops/s GET by token, 12,122 ops/s concurrent POSTs, 44.5 MB RSS.
The Final Comparison
Tested 2026-05-25 on Apple M1 (arm64), macOS 26.0. Three backends, same session, fresh starts, clean databases.
| Metric | Go | Swift | Node.js |
|---|---|---|---|
| POST list (seq) ops/s | 4,370 | 3,301 | 2,975 |
| POST list P50 latency | 0.20ms | 0.25ms | 0.26ms |
| POST list (c=10) ops/s | 12,122 | 8,620 | 2,479 |
| GET by token (seq) ops/s | 6,866 | 5,049 | 1,024 |
| GET by token P50 latency | 0.13ms | 0.19ms | 0.69ms |
| Memory idle RSS | 44.5 MB | 56.1 MB* | 121.8 MB |
| Binary size | 10 MB | 17 MB | ~350 MB† |
*Swift RSS varies; historical range 27–56 MB.
†Node.js binary size includes node_modules.
What I got wrong
Swift won’t be fastest by default. The actor model is genuinely good — the safety guarantees are worth having — but the LLM naturally reaches for the most correct-looking structure, not the most profiled one. Two actors sounds right. Two cooperative suspensions per request at 5–15μs each is 139 ops/s.
Node.js won’t be dramatically slower. Sequential write gap is Go at 4,370 vs Node.js at 2,975 — 1.5x, not an order of magnitude. Concurrent writes are worse (12,122 vs 2,479), but that’s a fundamental architectural constraint: better-sqlite3 is synchronous and serializes on the event loop thread. When Node.js can hand work to C++ it does it efficiently. OpenSSL through the native crypto module isn’t a slow JavaScript wrapper. V8 has had a lot of investment. The runtime isn’t what’s slow.
Bravo to Node.js for holding its own, despite being the icky slow JavaScript it’s said to run underneath.
Go wouldn’t just be a nice middle ground. It won across the board, often by a significant margin. The stdlib is extremely well optimized — SHA-256 uses BoringSSL, hex encoding is direct byte ops with no reflection, the HTTP server is Google’s own production-grade work. The LLM went with stdlib throughout and the choices held up. Go’s single-minded approach to simplicity and performance tuning does make you want to hold everything else to the same standard.
The actual takeaway
In every case, the initial code looked correct. The Swift actor chain passed strict concurrency checking. Web Crypto is the officially recommended modern API. String(format:) is in Apple’s own documentation examples. None of these were bugs.
LLMs optimize for code that looks right and follows documented patterns — which is most of what’s been written. They don’t profile. They don’t have a feel for what a given abstraction costs.
What ended up being useful wasn’t writing more code. It was asking “what is this actually doing at runtime?” when something already looked clean, and being willing to measure rather than assume. And then having your assumptions turn out to be wrong anyway.
The Node.js version got the most guidance, as it was the first implementation. The more autonomous Go and Swift implementations had more interesting structural problems — “less hand-holding” just meant the LLM made its own decisions, and some of those were questionable.
Augmenting the coding environment’s system prompt, using proper guardrails in agent config, or reaching for ecosystem-specific best-practice files would probably steer the LLM faster. But the experiment was to see what the defaults are. And many of those defaults are bottlenecked by the model’s knowledge cutoff — which is a different and increasingly depressing story.
These benchmarks are single-machine, in-process, no network latency. The LLMs used were a mix of locally-hosted Qwen 3.6 27B, OpenCode Go’s DeepSeek V4 Flash, and some Claude Code.