Semantic Layers Beat Prompts, Flink Goes Agentic, Flotilla's 18x Speed Jump, and Why Your Spark Jobs Struggle With Images

How AI is reshaping data infrastructure: from agentic stream processing to production-grade semantic layers and purpose-built multimodal engines.

Oct 13, 2025

Dead Poets Society | Hyde Park Picture House (HPPH) Leeds — Image Credits: Google

🔍 “Show me your GitHub” - The Quest for Real Data Engineering Projects

A semi-senior data engineer asked the r/dataengineering community to share production-grade GitHub repos, hoping to see how senior engineers structure projects, choose tools, and design architectures. The responses revealed a fundamental problem: most real data engineering work is proprietary.

The standout contribution: tobymao’s sqlglot, a SQL parser and transpiler that handles dialect translation across 20+ SQL variants (it’s what powers dialect conversion in DBT, Airflow, and a bunch of other tools you probably use daily)

The technical gap is real: junior engineers are learning from tutorials with toy datasets, not systems that handle petabyte-scale data with complex SLA requirements, data quality checks, and orchestration patterns

The consensus: strong fundamentals matter more than tool knowledge (CI/CD pipelines, proper testing, schema management, idempotent workflows - the infrastructure that keeps data pipelines from becoming unmaintainable messes)

⚡ When SQL Isn’t Enough (And When It Absolutely Is)

A Reddit thread kicked off with a pointed question: if Snowflake’s SQL handles complex transformations (CTEs, window functions, UDFs), why do we need Spark, Airflow, and DBT? The answers cut through a common architectural confusion.

Spark vs. SQL is a category error: Spark is a distributed compute engine that happens to support SQL (via Spark SQL), while Snowflake is a cloud data warehouse where SQL is the primary interface (you’re comparing an execution engine to a storage-compute platform)

The cost-performance trade-off: Snowflake charges for compute time, so transforming 10TB of raw logs directly in Snowflake can be 5-10x more expensive than preprocessing with Spark on cheaper compute, then loading cleaned data

Where Snowflake SQL breaks down: when you need to read from Kafka streams, hit external APIs mid-transformation, or apply complex ML models that aren’t SQL-expressible (Snowpark helps, but you’re still constrained by what the warehouse supports)

🤖 Flink Gets Agentic: Real-Time AI That Actually Responds to Events

Flink Agents is a new framework from the Apache Flink community that combines stateful stream processing with LLM-based agents. If you’ve been trying to build AI systems that react to real-time event streams (not batch data), this architecture is worth understanding.

The technical setup: Flink handles event ingestion, stateful processing, and exactly-once semantics, while AI agents (LLM-powered) make decisions based on streaming context (think fraud detection that adapts its rules based on emerging patterns, not static thresholds)

Why this matters: traditional AI data tools work in request-response mode with static data; Flink Agents can maintain conversation state across millions of concurrent event streams with sub-second latency (Flink’s distributed snapshotting gives you fault tolerance without losing agent state)

Real use cases emerging: real-time content moderation for live streams that adapts to context, intelligent alerting systems that reduce false positives by understanding event sequences, automated trading systems that process market data and execute decisions in the same pipeline

🧠 Text-to-SQL Is Just the Appetizer: Building Production AI Data Analysts

Pedro Nascimento’s deep-dive on building Findly’s AI data analyst is packed with architectural lessons from taking text-to-SQL from demo to production. His core argument: the SQL generation is 20% of the problem; the other 80% is context, validation, and multi-step reasoning.

Multi-agent architecture: separate agents for planning (decompose “analyze cohort retention” into specific steps), SQL generation (with compile-time validation), Python execution (for post-query transforms like statistical tests), and result synthesis (the system runs 5-7 LLM calls per complex query, not one)

Semantic layer as context: they use Malloy to define metrics, joins, and business logic in code, then compile it to optimized SQL (this gives you type-checking and prevents the LLM from hallucinating table relationships - it’s working with a known schema graph)

RAG as a recommendation pipeline: keyword search (BM25) for exact term matching → embedding search (dense retrieval) for semantic similarity → fine-tuned reranker (instruction-following model) to pick the top-k most relevant schema fragments (they found off-the-shelf rerankers underperform by 15-20% without domain fine-tuning)

Latency architecture: fast models (GPT-4o mini) for planning and simple queries, reasoning models (Claude Sonnet, Gemini 2.5 Pro) for complex SQL generation, aggressive caching at every layer (they hit 200-300ms for cached queries, 3-5s for complex uncached analysis)

🚀 Flotilla: When Spark Can’t Handle Your Multimodal Data (18x Faster)

Daft’s new distributed execution engine Flotilla is purpose-built for workloads that mix structured data with images, videos, PDFs, and audio. If you’ve tried processing terabytes of images with Spark and watched it crawl, the architectural choices here are instructive.

Performance claims: 18x faster than Spark and Ray Data on multimodal benchmarks (they tested on image embedding generation across 10TB+ datasets - the kind of preprocessing needed for training vision models or RAG over images)

The technical difference: Spark’s task scheduler assumes uniform task duration and homogeneous data, which breaks down when some tasks process 100KB images and others handle 50MB videos (Flotilla does content-aware scheduling and dynamic resource allocation based on actual data characteristics)

Where this matters: ML preprocessing pipelines that need to resize/normalize millions of images, extract frames from videos, run inference models (CLIP embeddings, OCR), and join results with structured metadata (doing this efficiently requires understanding data size distribution, not just partition count)

No manual tuning: Spark typically needs careful partition sizing, memory configuration, and shuffle tuning for multimodal data; Flotilla profiles data characteristics and auto-tunes (which matters when your data engineer doesn’t want to become a Spark performance expert)

💡 e6data AI Analyst Early Access

We’re launching early access to e6data AI Analyst, and honestly, it’s about time someone built a data querying interface that actually understands how humans think about data. Ask questions exactly as they occur to you, get contextual multi-turn conversations, and for once, actually enjoy the follow-up process.

95%+ accuracy on enterprise workloads: Because “pretty good” genuinely isn’t good enough when you’re dealing with 1000+ tables and the kind of complex relationships that make SQL joins look like abstract art

Multi-turn conversations that make sense: Your data can finally talk back in a way that doesn’t require translating human curiosity into rigid query syntax

Zero migration headaches: Works with your existing data platform because we know exactly how much “fun” those migration projects actually are

→ Get early access here

Data Engineering ACID

Discussion about this post