Agent-Built JUnit Suites, CDC vs Daily Snapshots, Ducklake vs Iceberg, Queries, Costs, and more

Every Friday, we deliver your weekend win: copy-paste tutorial, cost-optimisation technique, CFPs worth your pitch, and fresh ideas from the field. Stop surfing fluff.

Jul 04, 2025

Farce the Music: Famous Movie Scenes Country Reaction Gifs

🔧 We used a few agents to bring in JUNit runner support for regression run, it sort of worked?

An engineer from our team pointed IntelliJ’s Junie agent (Claude 3.7 under the hood) at a hand-rolled regression harness (~2000 SQL queries) and asked it to surface each query as an IDE-runnable JUnit test. Here is what went down:

Agent #1 — Junie + JUnit 4 reflection adapter. A single prompt asked Junie (IntelliJ’s Claude-powered agent) to analyse RegressionTest.runTests and emit an adapter that registers every query as a JUnit test. Junie delivered code that works, but hangs on brittle reflection hooks and produces hash-based test names with no Run/Debug actions.
Agent #1, retry — JUnit 5 Parameterized / Dynamic tests. After nudging Junie toward Jupiter, it rewrote the runner, solving the name issue but introducing new ones: incremental discovery kept appending tests mid-run; only ~70 % of the 2006 tests were detected, and per-test stdout/stderr never reached the console.
Edge-cases that still bite.
- IDE agent loops on feedback instead of asking for help.
- Environment variables (default schema, storage endpoint) are ignored, so failed tests hit remote buckets.

Our Takeaway: IDE agents can scaffold a runner in minutes, but deep plumbing—stable discovery, clean logs, env-aware execution—still needs a human who knows JUnit extension points. Treat today’s agents as junior devs: great at boilerplate, lost in integration hell.

🔄 CDC vs. Daily Snapshots?

Data engineers on Reddit recently argued (and sort of concluded) that CDC is cheaper, safer and more granular than nightly full snapshots:

Pulling changes straight from the DB WAL avoids table locks and SELECT * hell.
Real-time extraction doesn’t force real-time delivery—buffer it rather then landing hourly or daily.
For 100M+ rows tables, daily snapshots hammer network & compute; Debezium-style CDC streams are faster.

Our Takeaway: Unless your tables are tiny, CDC + append-only logs win on both cost and correctness, while still letting teams materialize snapshot views whenever they like.

🦆 DuckLake vs. Iceberg — Metadata Wars (Again?)

r/dataengineering is hot with another metadata war this week:

One-binary catalog. DuckLake boots from a single DuckDB file; you’re querying with plain SQL in minutes, no Hive metastore, no Spark session spin-up.
Adoption hurdle. Redditors call it a “pure improvement” over Iceberg but say they’ll jump only after Spark / Trino / Flink connectors and real petabyte tests—land.
Could just fold in. Even Iceberg fans note its catalog is already pluggable; a SQL-backed option like DuckLake might slip upstream instead of forking the ecosystem.

Our Takeaway: The fight isn’t about table files but about where metadata lives. We will watch out for connector support: once the major engines speak DuckLake, JSON catalogs might become legacy.

💡 We’re cooking up a Query & Cost Optimization Hub

We’re launching: a hub which distills cost-optimising and latency-crushing playbooks

One for or every major compute engine—Snowflake, Databricks, BigQuery, Redshift, Athena, ClickHouse, Fabric, Starburst, and more.
Filter by engine, choose “query” or “cost” mode, and grab beginner-to-advanced tactics.

First one is out with Snowflake Cost Optimisation for Beginners, rest to follow soon.

Community & Events:

Event: Lakehouse Days — “Real‑time Streaming Ingest” on 12 July, AWS Office, Bangalore with speakers from Confluent, AWS, and e6data. Subscribe to the calendar and register before seats run out!
Hiring: We’re growing! Check out open engineering roles [here].

Thanks for reading Data Engineering ACID! Please like, follow, and share if you liked it.

Data Engineering ACID