Migrating to Custom Observability: 6 Months of Pain, Then Payoff

We hit $47,000/month on our observability stack before someone finally asked the obvious question: what exactly are we paying for?

The answer was uncomfortable. Datadog was ingesting 2.3TB of logs daily, most of which nobody looked at. Our custom metrics had ballooned to 140,000 unique time series, with engineering teams adding new ones faster than we could track. APM traces were sampled so aggressively to control costs that debugging production issues had become archaeological guesswork.

So we built our own. It took six months, cost us two sprints of product work, and at one point I seriously considered whether we’d made a catastrophic mistake. But 18 months later, our observability costs are $8,200/month, our debugging capabilities are stronger, and we own the roadmap completely.

This is the honest story of that migration — the parts that went wrong, the decisions that saved us, and the framework for knowing whether this path makes sense for your organization.

The Real Cost of Managed Observability

Before diving into the migration, it’s worth understanding why managed observability costs spiral. The pricing models are designed around dimensions that grow faster than your revenue.

Logs are priced by ingestion volume. Every new microservice, every verbose library, every developer who adds debug logging “just for now” increases your bill. We found that 60% of our log volume came from three services that logged every HTTP request body in full — a practice that started as a debugging aid years ago and became invisible technical debt.

Metrics are priced by cardinality. High-cardinality labels (user IDs, request IDs, dynamic paths) can turn a single metric into thousands of unique time series. One team added a customer_id label to their latency metrics “for debugging.” Useful? Occasionally. Expensive? $3,400/month for that single decision.

APM is priced by span volume or host count. Either way, as you scale horizontally or increase instrumentation depth, costs grow. And the sampling required to control costs directly undermines the tool’s value — you’re paying premium prices for incomplete data.

The managed observability vendors know this. Their incentives are structurally misaligned with yours. They want you to instrument everything and retain it forever. You want actionable insights at sustainable cost.

The Migration Architecture

We didn’t build from scratch. That would have been genuine insanity. Instead, we assembled a stack from proven open-source components with clear interfaces between them.

Logs: Vector for collection and transformation, ClickHouse for storage and querying. Vector handles parsing, filtering, and routing at the edge. We drop debug-level logs in production unless explicitly requested, reducing volume by 70% before anything hits storage. ClickHouse’s columnar storage and compression means our 700GB daily log volume costs about $180/month in S3-backed storage.

Metrics: OpenTelemetry collectors feeding into VictoriaMetrics. We chose VictoriaMetrics over Prometheus for its better long-term storage and lower operational overhead. Cardinality limits are enforced at the collector level — no more surprise $3,400 labels.

Traces: OpenTelemetry with Jaeger, backed by the same ClickHouse cluster. We implemented tail-based sampling, which samples traces based on their outcome rather than randomly at the start. Errored traces and slow requests are captured at 100%; normal traffic is sampled at 1%. This gives us full debugging capability for the requests that matter.

Alerting: Grafana for visualization and alerting. We briefly considered building custom dashboards but quickly realized that Grafana’s ecosystem is too mature to replicate.

The total infrastructure runs on three nodes (16 vCPU, 64GB RAM each) plus S3 storage. At scale, ClickHouse and VictoriaMetrics can cluster horizontally, but we haven’t needed that yet.

Six Months of Pain

The architecture sounds clean in retrospect. The reality was messier.

Month 1: We underestimated Vector’s learning curve. It’s powerful but idiosyncratic. Our first configuration attempted to replicate Datadog’s log parsing exactly, which was the wrong approach. We spent two weeks on edge cases that affected 0.1% of logs.

Month 2: ClickHouse query performance was terrible. Turns out, we’d designed our table schema around how we thought we’d query logs, not how we actually did. Rebuilding the schema and backfilling data cost us a week.

Month 3: Tail-based sampling broke in subtle ways. Our initial implementation held traces in memory for 30 seconds waiting for all spans to arrive. Some slow traces took longer. We either lost them entirely or ran out of memory. The fix required rearchitecting the collector layer.

Month 4: Engineers started complaining. Loudly. The new dashboards didn’t have feature parity with Datadog. Saved queries were gone. The UI was “clunky.” Several teams quietly spun up their own managed observability trials, threatening to fragment our stack entirely.

Month 5: We ran both systems in parallel, which doubled costs temporarily. The Datadog contract renewal deadline forced the issue. We made the hard call to cut over completely, accepting that some capabilities would be degraded for a period.

Month 6: Things stabilized. Query patterns became familiar. Missing features were built. The complaints shifted from “this is broken” to “can we add X?” — a fundamentally different conversation.

An athlete at the peak of a grueling mountain climb, pausing to look back at a fog-filled valley below where the trail disappears — the summit is visible but still distant, with clear sky overhead

When Custom Observability Makes Sense

After living with this decision, I can articulate when building makes sense and when it doesn’t.

Build when: Your observability costs exceed $25,000/month with a clear growth trajectory. You have at least one engineer who’s genuinely interested in this domain and will own it long-term. Your scale is large enough that off-the-shelf solutions require aggressive sampling that degrades their value. You’ve outgrown the vendor’s mental model of how observability should work.

Don’t build when: You’re under $10,000/month and the vendor’s pricing still scales with your ability to pay. You don’t have someone who wants to own this system operationally. Your engineering culture can’t tolerate temporary capability regression. You’re hoping this will somehow be easier than using a managed solution.

The middle zone ($10,000-$25,000/month) is genuinely ambiguous. It depends on your growth rate, team composition, and tolerance for operational complexity.

One factor we didn’t anticipate: custom observability dramatically improved our debugging culture. Because we controlled the stack, we could add capabilities that vendors don’t prioritize. Correlating logs with traces is seamless. Querying metrics and logs together in a single interface is natural. Adding custom dimensions to tracing that would be prohibitively expensive on managed platforms costs us nothing.

The ownership also changed how teams think about instrumentation. When adding a high-cardinality label meant arguing with the platform team rather than an invisible cost increase next month, teams became more thoughtful about what they actually needed.

The Unsexy Truth About Total Cost

Our $8,200/month breaks down as: $5,400 for infrastructure (compute + storage), $2,800 for engineering time (roughly 10 hours/month of maintenance and improvements, valued at $280/hour fully loaded).

That engineering time number is real but easy to misrepresent. The first six months required 30-40 hours/month. Months 7-12 dropped to 15-20 hours/month as the system stabilized. Now it’s genuinely around 10 hours unless we’re adding major capabilities.

The comparison isn’t just $47,000 vs $8,200. It’s also about what you get for the money. Our traces aren’t sampled into uselessness. Our logs aren’t truncated to control costs. Our metrics don’t have artificial cardinality limits. We can answer questions about production behavior that were effectively impossible before, not because the tools couldn’t do it, but because the cost of doing it was prohibitive.

Making the Decision

If you’re considering this path, start with an honest accounting of your current observability spend, including the hidden costs. What’s the fully loaded cost including engineering time spent working around limitations? What questions can’t you answer because the data is sampled or truncated or too expensive to query?

Then do a real proof of concept. Not a weekend experiment — a month-long trial running in parallel with your production stack. You’ll discover where the complexity actually lives, and it’s rarely where you expect.

The payoff is real, but it’s not immediate. You’re trading vendor dependency and escalating costs for operational ownership and upfront investment. For us, that trade was unambiguously correct. For plenty of organizations, it isn’t.

The question isn’t whether custom observability is better. It’s whether it’s better for you, right now, given everything else you need to build.

Building infrastructure that scales without vendor lock-in is core to what we do at Koalabs. If you’re evaluating your observability strategy or navigating similar build-vs-buy decisions, we’re happy to share what we’ve learned.

Six months of pain, but finally nailing custom observability!

Migrating to Custom Observability: 6 Months of Pain, Then Payoff

Migrating to Custom Observability: 6 Months of Pain, Then Payoff

The Real Cost of Managed Observability

The Migration Architecture

Six Months of Pain

When Custom Observability Makes Sense

The Unsexy Truth About Total Cost

Making the Decision

Tags

Need Expert Software Help?