Chasing the vibe-coded bottlenecks
Three backends, three languages, and what the benchmarks taught me about my own assumptions.
I’ve been building Slips — a collaborative, real-time task list app, as part of a series of lightweight, self-hostable collaboration tools - with one constraint: the backend would be written primarily by an LLM, with me in the role of technical director rather than primary author. The result was three functionally identical server implementations: Node.js, Go, and Swift. Same API surface, same SQLite persistence, same WebSocket sync protocol. For the client apps, each backend must be essentially a drop-in replacement.
The experiment started as a productivity exercise. It turned into something more interesting: a systematic audit of where LLMs may fall short, what “idiomatic” code actually costs at runtime, and some wrong assumptions about language and runtime performance I had.
The Setup
Slips is a collaborative task list with real-time sync to multiple clients over WebSockets. The HTTP API covers the usual operations: create a list, fetch it by share token, manage tasks. The share token is derived via SHA-256 from a random base64 string, so every request that looks up a list performs a hash computation. The WebSocket layer handles live sync using the Automerge library, using CRDTs.
All three backends use SQLite as the persistence layer (WAL mode, which I’ll come back to), run on the same machine (Apple M1, macOS), and were benchmarked with a Go-based tool running 200 operations at 10 concurrent workers.
The Node.js implementation came first and received the most guidance - detailed prompts, iterative corrections, explicit direction toward correct behavior, and successful manual tests. The Go and Swift versions were reimplementations using some lighter prompting: “here’s what this does, look at that fugly js code, use the target language the way it was meant to be used.”
Round 1: Node.js — Getting it Working
The Node.js backend reached a working state relatively quickly. Express for HTTP, better-sqlite3 for synchronous SQLite access, ws for WebSockets. The LLM’s choices were reasonable and pretty conventional. Sounded pretty good with what I’ve been using in the past that were good enough defaults to not drop down too deep a rabbit hole of dependencies.
The first benchmark session revealed numbers that, on the surface, looked acceptable: around 1,400 ops/s on sequential POSTs. Not fast, but not obviously broken either. Given that at the time the benchmarking started, all three implementations already existed and were somewhat functionally equivalent.
Web Crypto surprise
The function that derives a list ID from a share token - called on every single request - was implemented using the Web Crypto API (crypto.subtle.digest). This is the modern, standards-compliant way to hash data in JavaScript. It’s also asynchronous, which means every request was performing a thread pool dispatch and a Promise task just to compute a SHA-256 hash.
The fix was switching to Node’s built-in crypto module, specifically createHash('sha256') — a synchronous C++ binding that goes directly to OpenSSL with no JavaScript overhead, no allocations, no Promises. Sequential POST throughput went from ~1,153 to ~2,848 ops/s. 2.5x improvement from one function call change. Yikes.
The same file had a secondary issue in token generation: a spread operator feeding into btoa() followed by three separate regex passes to clean up the base64 output - what in the holy convoluted batman is this. Replaced with a single randomBytes(32).toString('base64url') call. Same result, one native operation. I was quite surprised how convoluted the original implementation was. As if it just tried things until it made it work. After all, making this implementation was the most painful, and required most hand holding, since the requirements seemed not to be grasped in their entirety, even though we worked out the spec before the first code hit the runtime.
Logging overhead
At the default log level, debug output to stderr was consuming a measurable fraction of request time. With LOG_LEVEL=error, sequential POST throughput recovered to 2,975 ops/s — matching the historical best. This isn’t a criticism of the LLM; it’s a reminder that benchmarking with debug logging enabled is benchmarking the logger. On the other hand, the real world deployment requires logging, so benchmaxxing cannot drive such decisions. Maybe the logging layer needs some tweaks of its own.
SQLite WAL mode
The SQLite journal mode defaults to a rollback journal that serializes all reads behind active writes. WAL (Write-Ahead Logging) decouples readers and writers by appending changes to a separate file, letting readers operate against a consistent snapshot while a write is in progress. For a workload with 10 concurrent writers, this matters enormously.
The impact was measured directly on the Go backend (where it was easiest to isolate): WAL mode delivered approximately 5x improvement on sequential writes and 5.1x on concurrent writes. All three backends ended up with WAL enabled, though the mechanism differed — Go and Node.js required an explicit PRAGMA, while Swift’s GRDB library via DatabasePool enables it automatically. Why the default was not to enable WAL in the first place. Beats me.
Final Node.js numbers: 2,975 ops/s sequential POST, ~1,024 ops/s GET by token, 121.8 MB RSS, actual physical memory taken, at idle.
Round 2: Swift — Where My Assumptions Broke
I expected Swift to be the fastest of the three, possibly by a significant margin. The reasoning: ARC eliminates garbage collector pauses, Swift compiles to native ARM64 machine code with aggressive optimization, Apple’s frameworks are tuned for Apple silicon. The LLM chose Hummingbird 2 for HTTP (a solid NIO-based framework) and GRDB for SQLite (which handles WAL automatically via DatabasePool).
The first benchmark told a different story.
The actor hop
The initial Swift implementation modeled the backend with two actors: an API actor handling routing logic, calling into a Store actor for persistence. This is a natural way to think about the architecture in Swift’s concurrency model — separate concerns, separate actors, compile-time safety.
The cost: each actor boundary is a cooperative executor suspension. Two hops per request, each carrying a ~5–15μs overhead. The benchmark result was 139 ops/s on GET by token. Go was doing over 3,000 on the same test.
This wasn’t the LLM doing something wrong in any naive sense. The code was correct, it was idiomatic Swift 6, it compiled cleanly with strict concurrency checking. The problem was that “idiomatic” and “performant” diverged significantly at this particular boundary.
The fix was restructuring to a single actor hop: a Sendable class for the API layer (zero executor cost from route handlers) calling into a single Store actor for actual persistence. Throughput jumped to ~3,300 ops/s. The actor boundary is still there where it matters — around actual database writes — but the unnecessary intermediate hop is gone.
33 allocations for a hex string
The SHA-256 hash of a share token needs to be hex-encoded on every request. The LLM reached for hash.map { String(format: "%02x", $0) }.joined() — a common Swift pattern that looks completely innocuous.
It isn’t. String(format:) routes through CFStringCreateWithFormat, which means Objective-C bridging. For each of the 32 bytes in a SHA-256 hash:
- The
UInt8is boxed into anNSNumber CFStringCreateWithFormatallocates an autoreleasedCFString- That bridges back to a Swift
Stringwith a separate allocation - The resulting string lands in an intermediate array
33 heap allocations per call, all hitting the Obj-C autorelease pool. At 5,000 requests/second, that’s 165,000 unnecessary allocations per second.
The replacement: a pure-Swift lookup table mapping nibbles directly to ASCII bytes, writing into a pre-allocated [UInt8] buffer, then constructing the final string with a single String(decoding:as:) call. One allocation instead of 33. GET by token throughput improved +57%.
The same String(format:) antipattern appeared in token validation (using CharacterSet, which bridges to NSCharacterSet) and token generation (using replacingOccurrences(of:with:), which bridges to NSString). All three were converted to pure-Swift equivalents.
Going further: unsafe buffer tricks
After the lookup table fix, there were still two unnecessary allocations in the hot path: an intermediate [UInt8] buffer for the hex output, and a Data copy of the input token for the SHA-256 computation.
String(unsafeUninitializedCapacity:initializingUTF8With:) writes hex characters directly into the String’s internal storage, bypassing the intermediate buffer. withContiguousStorageIfAvailable reads the token’s UTF-8 bytes from its internal storage without a copy (Swift 5+ stores all strings as UTF-8 internally).
The microbenchmark result across 100,000 iterations on M3:
| Version | Time | Allocations |
|---|---|---|
Original (map + String(format:)) |
36,844 ns/op | 33+ |
Lookup table with [UInt8] buffer |
7,467 ns/op | 3 |
String(unsafeUninitializedCapacity:) |
6,051 ns/op | 2 |
| + no-copy UTF-8 input | 5,617 ns/op | 1 |
An 87% reduction in that function’s overhead, purely from eliminating Obj-C bridging and redundant copies.
Final Swift numbers: 5,049 ops/s GET by token, 8,620 ops/s concurrent POSTs, 56.1 MB RSS. Note: Swift’s concurrent throughput varies across runs — the NIO event loop showed elevated CPU usage (up to 576% at idle) in some sessions, which dragged down benchmark scores. Historical best was 14,030 ops/s on concurrent writes.
Round 3: Go — Standing on the Shoulders of stdlib
The Go backend benefited from a combination of LLM choices that happened to align well with Go’s strengths from the start: net/http for HTTP, mattn/go-sqlite3 (CGO) for SQLite, gorilla/websocket for WebSockets. Standard choices, well-trodden path.
The performance optimizations here were less about correcting structural mistakes and more about pushing a good baseline further.
WAL mode and CGO
Switching from modernc.org/sqlite (pure-Go, WASM-based SQLite) to mattn/go-sqlite3 (CGO, native SQLite library) improved GET throughput significantly — from ~3,608 to ~6,861 ops/s on token lookups — because the CGO version has access to the full, optimized SQLite C library. Combined with WAL mode, write throughput approximately doubled.
Per-token shard mutexes
The original implementation used a single sync.RWMutex protecting the entire provider state. Under concurrent load, all goroutines writing to different lists were still serializing on this one lock.
The fix was splitting into 64 shard-level mutexes, keyed by FNV-1a hash of the share token. The provider-level lock is now only held briefly for map lookups; the CPU-heavy work — Automerge operations, cryptography, SQLite writes — runs under only the per-token shard lock. Different tokens can write concurrently.
Results:
| Metric | Before | After | Change |
|---|---|---|---|
| POST seq (ops/s) | 2,907 | 5,327 | +83% |
| POST c=10 (ops/s) | 9,122 | 15,677 | +72% |
| GET by token (ops/s) | 4,739 | 7,536 | +59% |
fmt.Sprintf vs hex.EncodeToString
A small but measurable optimization: the original DeriveListID used fmt.Sprintf("%x", h) to hex-encode the SHA-256 hash. fmt.Sprintf uses reflection internally to format its arguments. hex.EncodeToString(h[:]) is a direct memory operation with a single allocation for the output string. The difference is small per call, but measurable at throughput.
Final Go numbers: 6,866 ops/s GET by token, 12,122 ops/s concurrent POSTs, 44.5 MB RSS.
The Final Comparison
Tested 2026-05-25 on Apple M1 (arm64), macOS 26.0. All three backends in the same session, fresh starts, clean databases.
| Metric | Go | Swift | Node.js |
|---|---|---|---|
| POST list (seq) ops/s | 4,370 | 3,301 | 2,975 |
| POST list P50 latency | 0.20ms | 0.25ms | 0.26ms |
| POST list (c=10) ops/s | 12,122 | 8,620 | 2,479 |
| GET by token (seq) ops/s | 6,866 | 5,049 | 1,024 |
| GET by token P50 latency | 0.13ms | 0.19ms | 0.69ms |
| Memory idle RSS | 44.5 MB | 56.1 MB* | 121.8 MB |
| Binary size | 10 MB | 17 MB | ~350 MB† |
*Swift RSS was lower in this session (44.5 MB vs Go’s 56.1 MB) but varies; historical range 27–56 MB.
†Node.js binary size includes node_modules.
What I Got Wrong
“Swift will be fastest”
The reasoning was: no garbage collector pauses (ARC handles memory), native machine code, Apple hardware. What I underestimated was the cost of Swift’s concurrency model. The actor system is genuinely powerful and its safety guarantees are worth having — but actor boundaries have real runtime cost, and the LLM naturally reached for the most “correct” structure without profiling implications in mind.
Beyond concurrency, the Obj-C bridging legacy is a trap for anyone who learned Swift before Swift 5’s clean break. The String(format:) pattern is in tutorials, in Apple’s own documentation examples, in thousands of Stack Overflow answers. It’s idiomatic — and it’s expensive in such hot paths.
Swift can be extremely fast. Getting there requires knowing which APIs are pure-Swift and which ones drop into Obj-C under the hood. That knowledge doesn’t come from reading documentation - it comes from benchmarking.
“Node.js will be dramatically slower”
The final gap on sequential writes is Go at 4,370 vs Node.js at 2,975 — roughly 1.5x, not an order of magnitude. On concurrent writes it’s worse (12,122 vs 2,479), but that’s a fundamental architectural constraint: better-sqlite3 is synchronous and serializes on the main thread.
Node.js’s strength is that when it can offload to C++, it does so efficiently. The native crypto module isn’t a JavaScript wrapper with overhead — it’s OpenSSL with a thin binding layer. The V8 engine has had decades of investment. The runtime isn’t slow; the question is whether you’re doing work in JavaScript or in the C++ layer it sits on top of.
Bravo to Node.JS for holding its own, even though being the icky slow Javacript it is said to run underneath.
“Go is a middle ground”
Go won across the board, often by a significant margin. What I didn’t fully appreciate was how well optimized Go’s standard library is. The SHA-256 implementation uses BoringSSL for fast hash calculation. The hex encoding uses direct byte operations with no reflection. The HTTP server is production-grade and heavily optimized - it’s Google’s baby after all. Goroutines are lighweight and fast.
The LLM went with stdlib throughout the Go implementation, and took some impressive wins. Go team’s religious approach to simplicity, efficiency, and constant performance tuning of both runtime and the library surely inspires us vegans of the server-side world to thrive for the same when using it.
Reflections on LLM-Assisted Development
The most interesting finding isn’t in the benchmark numbers — it’s in the pattern of where the LLMs went wrong.
In every case, the initial code was correct by conventional standards. The Swift actor chain passed strict concurrency checking. The Web Crypto call used the officially recommended modern API. The String(format:) hex encoding is in the Swift documentation. None of these were bugs; they were choices that looked right until you measured them.
LLMs optimize for code that looks right, reads well, and follows documented patterns. They’ve been trained on the entire corpus of human-written code, which skews heavily toward “working” rather than “optimal.” They don’t profile. They don’t have an intuition for what a particular abstraction costs at runtime.
What an experienced developer adds to this loop isn’t more code — the LLM handles that. It’s the mental model that asks “what is this actually doing at runtime?” when something looks clean already. It’s the habit of measuring before assuming. And, as this experiment showed, it’s the willingness to have your prior assumptions proven wrong by the numbers.
The Node.js version required the most guidance during development. It also revealed the clearest antipattern (Web Crypto async vs native sync) and delivered the cleanest optimization story. The more “autonomous” Go and Swift implementations had more interesting structural problems to untangle.
I’m not sure what conclusion to draw from that, except that “less hand-holding” doesn’t mean “better code” — it means the LLM made its own decisions, and some of those decisions were questionable.
Surely augmenting the system prompt for the coding environment, setting some proper guardrails and directions in agent instruction files, or using a set of advanced skills freely available for each ecosystem would steer the LLM in the right direction much faster, but the experiment was to see what the defaults are. And many of those defaults stem purely from the models’ knowledge cut-off, and those are a completely different story, a gradually becoming a much more depressing one.
Numbers Don’t Lie, But They Do Require Context
A few caveats worth noting:
- These benchmarks run on a single machine, in-process, with no network latency. Real-world throughput for all three would be bottlenecked by I/O long before hitting these numbers.
- Swift’s NIO event loop showed anomalous CPU usage (up to 576% at idle) in some sessions. The historical best for Swift concurrent writes was 14,030 ops/s — significantly higher than the 8,620 captured in the final comparison session.
- Node.js concurrent write throughput (2,479 ops/s) reflects a fundamental architectural choice:
better-sqlite3is synchronous and serializes on the event loop thread. A different SQLite strategy (WAL + connection pool + worker threads) could change that picture, at the cost of implementation complexity. - Memory numbers for Node.js (121.8 MB) include the V8 heap baseline. For long-running production services this may not matter; for constrained environments it does.
- The LLMs used throughout the experiment was a mix of localy hosted Qwen 3.6 27B, OpenCode Go’s DeepSeek V4 Flash, and some Claude Code, so the results may vary

