How to Rust CGP, Iceberg's the wrong spec?, Spark's testimony, Froid-inspired UDF engine, and more

Today we talk about Rust again, why Iceberg might have a metadata issue, Spark's future, testing frameworks, and building a better UDF-engine inspired by Microsoft’s Froid framework.

Jul 18, 2025

🤖 Rust Context-Generic Programming (CGP) in a Nutshell

A very insightful take on Rust CGP took over our engineering team this week. Here are a few highlights:

CGP swaps “objects” for contexts: functions declare the few capabilities they need via consumer traits, and any struct that offers those capabilities can satisfy them. The compiler wires everything together, so there’s no runtime cost.
It tackles Rust’s big headaches—bloated trait APIs, orphan-rule conflicts, forced public types, copy-pasted code—by cleanly splitting provider-and-consumer roles and letting overlapping implementations live side-by-side.
Power tools include HasField and cgp_auto_getter macros, extensible records/builders for piecemeal struct construction, and extensible variants/visitors for enums that grow without breaking old code.

Our Takeaway: Use CGP when systems have lots of inter-dependent services or tight performance demands; stick with plain trait-object DI for small, stable parts. CGP keeps tests easy and mixes real or mock back-ends just by swapping contexts.

🧊 Iceberg, The Right Idea – The Wrong Spec

A very contrarian blog by the Database Doctor called “Iceberg, The Right Idea – The Wrong Spec” is claiming history on why object-store tables risk repeating 1990s storage issues:

The author shows how old-school filesystems and today’s object stores choke on millions of tiny files, lock handling, and fragmentation; in contrast, classic databases solved these “space-management” headaches decades ago.
Because databases already handle storage, compression, atomic writes, and bit-rot, pushing data into object stores just shifts cost and pain to users while giving cloud vendors more lock-in.
Data lake tables still need fast, tiny metadata updates, yet object stores are not the place for that; even at multi-petabyte scale, metadata stays small enough to fit on a single server, so a real metadata database beats file blobs.
Conclusion: openness matters but be wary until it tackles metadata and lock-in as cleanly as proven database engines do.

Our Takeaway: Before betting a 100% on open-table formats, scrutinise how they handle metadata at scale to unlock the best RoI.

💥 Spark: Still Got the Fire?

Another r/dataengineering thread asks whether Apache Spark is passé in 2025—and finds the community surprisingly loyal. “Yes—when the data’s truly big.”:

Petabyte shops say Spark “just works” for multi-PB tables
Critics highlight launch latency and opaque debugging, steering smaller teams to Polars first
Hidden cost of “tool churn”: migrating stacks as volumes grow burns hiring and context-switching budget
Many would still choose Spark for once workloads cross the Petabyte barrier; below that, lightweight engines win on ergonomics

Our Takeaway: Spark isn’t dead—just reserve it for workloads that justify the cluster costs.

🧪 Unit Tests ≠ Data Quality Checks

Another great Reddit debate separates build-time tests from run-time data validation.

Unit testing is about making sure that some dependency change or code refactor doesn’t result in bad code that gives wrong results. Integration and e2e testing are about the whole integrated pipeline performing as expected.
Data quality checks are about checking the integrity of production data as it’s already flowing, each time it flows. It’s a “runtime” construct, ie after your code is released.
The community agrees the two are complementary; DQ systems should be platform-agnostic and drive feedback to data stewards.
Overlap is rare: unit tests may cover complex transforms, but DQ thresholds still catch unexpected business-data anomalies.

Our Takeaway: Keep CI tests and data-quality observability as distinct layers—each fails fast in its own domain.

We have made a faster Froid-inspied UDF engine!

UDFs let you keep loops and business rules right inside SQL, but the usual multi-statement kind makes the engine hop out for every row, killing speed.
That drag comes from three things: context-switch overhead, row-by-row work, and an optimizer that can’t peek inside the function—turning some queries 10-1000× slower.
Databricks and Snowflake sidestep the pain by only allowing single-expression UDFs, which forces devs to cram logic into one line or shift it to external code.
e6data’s Froid-style “inliner” dissolves multi-statement UDFs back into one relational plan, so the optimizer can parallelize freely and queries run orders faster—without changing user code.

Community & Events

Lakehouse Days Replay – The full recording of our “From Stream to Lakehouse” is now live on e6data’s YouTube.
Inside e6data’s Froid-Inspired UDF Engine: an in-depth blog series by our engineering team
Meet Us at MachineCon USA – We’ll be on-site; come say hi!
Hiring: We’re growing! Check out open engineering roles [here]

Thanks for reading Data Engineering ACID! Please like, follow, and share if you liked it.

Data Engineering ACID