How to check Snowflake costs, My slowing Spark pipeline, IndiCS CFPs, NL2SQL MCP, and more
Every Friday, we deliver your weekend win: copy-paste tutorial, cost-optimisation technique, CFPs worth your pitch, and fresh ideas from the field. Stop surfing fluff.
🤖 Is Your Snowflake Bill Really High? Here is How to Check
Warehouse size – Anything above LARGE (> 8 credits/hr/cluster) is too much. Cost doubles each size step (XS 1 → 4XL 128 credits/hr); leave a 4XL up for 8 hrs and you burn $2 k–$6 k.
Idle ratio – If a warehouse sits idle > 30–40 % of the time, you’re paying for nothing (billing is per-second but with a 60s minimum).
Cloud-services share – When cloud-services charges exceed 10 % of daily compute, you’re past the free tier and likely over-using serverless features.
Query cost – Interactive query over 0.1 credit (~$0.20–$0.60) starts to sting; Dashboards usually finish under 0.01 credit.
Account run-rate – If your 90-day burn rate is >20 % above your annual commitment pace, you’re in high-cost territory.
Serverless features – Snowpipe, materialised views, auto-clustering, etc. should stay <15 % of total compute; once they creep higher, review them.
Takeaway – If two or more bullets fire, you have real savings hiding in plain sight. Watch next week for our beginner & advanced optimisation guides.
📣 Call for Proposals: ACM India IndiCS Seminars
IndiCS is ACM India’s Dagstuhl‑style, fully‑immersive seminar series that brings 45 invited researchers together for 3–5 days of deep dives at rotating venues across India. The program funds student travel and encourages open‑ended collaboration on frontier CS topics.
Next deadline: 31 July 2025 (for seminars in autumn/winter 2025). A second window closes 31 December 2025 for spring/summer 2026 slots.
🐌 Spark’s Hybrid Engine: Unified ≠ Fast
Vanilla Spark juggles three execution paths—row‑oriented Volcano, JVM‑generated WholeStage, and a thin slice of vectorization. That Swiss‑army flexibility serves logs, streams, ML, and SQL in one runtime, but it also brings GC churn, branchy code, and frequent fall‑backs when UDFs or exotic types appear.
For classic lakehouse OLAP (Parquet + flat schemas, few UDFs), a fully vectorized kernel wins. Vendors are swapping Spark’s final stage:
Databricks Photon, Apache Gluten, DataFusion Comet drop C++/Rust kernels into the DAG and see 2–4× speed‑ups with no API break.
Spark’s own vector path is narrow—only some Parquet/ORC scans use SIMD batches—so WholeStage still dominates most queries.
Takeaway: Spark is slow for pure OLAP by design, not accident. If your workload is > 90 % columnar SQL, a vector back‑end or Arrow‑native engine can reduce runtime and cost—keep vanilla Spark for truly mixed jobs. Read more.
🧪 Why Data Engineers Still Skip Tests — Reddit Reacts
A r/dataengineering thread boiled the problem down to four points:
Cost & deadlines – quick‑and‑dirty DAGs ship first; bug‑fixes bill to the next sprint.
Skills gap – many DEs start in SQL/Excel land, meet
pytest
only after pain.Unstable inputs – schemas drift, APIs mutate; unit tests feel brittle, so teams lean on downstream DQ checks.
Tooling debt – unclear lines between unit, data‑quality, and prod monitors stalls adoption despite dbt, GX, SQLMesh.
Takeaway: testing is an org‑level habit, not a novel tech problem. Seed the habit with lightweight assertions + CI on critical transforms—rerunning petabyte jobs is the expensive path.
🧠 Ask in English, Get Perfect SQL — Even Across Hundreds of Tables
We’re sipping our own Kool-Aid again at e6data: NL2SQL v0 plugs into e6-MCP and turns a plain-language prompt into a rock-solid, schema-faithful query—even when your data stores looks like a city map.
How? An agentic triple-play:
Vector search surfaces the likeliest tables & columns
Random-walk graph traversal traces real relationships, not guesswork
Cross-attention re-ranker locks the winning set, then lets the agent self-reflect and patch any schema slip-ups
The result: accurate answers, zero hard-coded limits. Where Databricks stalls at 25 tables and Snowflake taps out at 10, NL2SQL sails past—ready for enterprise-scale schemas.
➡️ Join the private beta
Community & Events:
Blog: Iceberg Catalogs 2025 — a deep dive into modern metadata management across Project Nessie, Apache Gravitino, Apache Polaris, Lakekeeper, and more. [Read here]
Event: Lakehouse Days — “Real‑time Streaming Ingest” on 12 July, Bangalore with speakers from Confluent. Subscribe to the calendar for early registration.
Hiring: We’re growing! Check out open engineering roles [here].