Data Engineer Interview Help: SQL, Pipelines, and System Design Questions

Hello! I’ve been collecting data engineering interview questions for a while now, specifically the SQL and pipeline ones, because those tend to be the most technically specific and the hardest to prepare for with generic guides.

This post is about the SQL and pipeline questions that come up in data engineering interviews. Not frontend or backend coding questions (those are different). Not system design at the architecture level (that’s its own topic). Just the SQL and pipeline stuff that shows up in the technical screen and onsite rounds.

1. SQL questions for data engineering (not data analyst SQL)

Data engineering SQL is different from analyst SQL. It’s less about aggregation and more about correctness, performance, and edge cases. Here’s what I mean.

The classic question is something like: “Write a query to find all rows where the same customer_id appears more than once with a different email.” Seems simple. But interviewers are watching whether you reach for GROUP BY + HAVING, or whether you know about window functions and can use COUNT() OVER (PARTITION BY customer_id) to avoid a subquery. Both work. The window function approach signals you’re thinking about query plans.

Another one I’ve seen at multiple companies: “Given a table of events with timestamps, find sessions where a user was inactive for more than 30 minutes.” This is a LAG() problem. The key is computing the time difference between consecutive events per user, then grouping on session breaks. If you’ve only used LAG() in a tutorial and never in a real pipeline, it’s worth writing this one out from scratch a few times.

CTEs (Common Table Expressions) come up constantly, not just as a syntax question but as a “can you refactor this monster subquery” exercise. Interviewers will sometimes show you a deeply nested query and ask you to make it readable. CTEs are usually the right move. The secondary question is often “when would you NOT use a CTE?” — and the answer involves performance on large scans where a temp table or materialized view might be faster.

Slowly Changing Dimensions (SCDs) are worth knowing. Type 2 SCD questions — where you need to track historical state of a record — come up often in warehousing contexts. The SQL to implement a Type 2 upsert is something you should be able to sketch out, including how you’d handle the valid_from and valid_to columns.

2. Pipeline architecture questions

These questions test whether you’ve actually built and operated pipelines or just read about them.

ETL vs. ELT comes up a lot. The short version: ETL transforms before loading (older model, good for strict compliance or limited storage), ELT loads raw then transforms inside the warehouse (modern model, works well with BigQuery, Snowflake, Redshift). Interviewers want to hear you reason about trade-offs, not recite a definition.

Idempotency is one of those words that shows up in every data engineering job description, and interviewers test it directly. “What does it mean for a pipeline to be idempotent?” The answer: running it multiple times with the same input should produce the same output with no side effects. The follow-up is usually: “How would you make this pipeline idempotent?” — and the answer involves things like MERGE/UPSERT patterns, timestamp-based partitioning for overwrites, and avoiding append-only writes without deduplication.

Late-arriving data is a practical problem that tests operational experience. “Your pipeline runs at midnight. What happens if some events arrive at 2am for yesterday’s date?” Good answers mention backfill strategies, watermarking in streaming systems (Flink, Spark Structured Streaming), and how you’d alert on completeness without blocking downstream consumers.

Testing pipelines is an area where junior engineers often give weak answers. Unit tests on transformations are straightforward. But interviewers at senior levels want to hear about data quality assertions (tools like Great Expectations or dbt tests), schema contract testing, and how you’d catch a silent failure where the pipeline ran successfully but produced wrong results.

3. The data modeling questions people skip

I think people underprepare data modeling relative to SQL and Spark. That’s a mistake, especially for warehouse-focused roles.

Star schema vs. snowflake schema is the entry-level question. Know the trade-offs: star schema is simpler for queries, snowflake is more normalized and saves storage but requires more joins. Interviewers sometimes ask “when would you pick one over the other?” and the answer depends on query patterns, not rules.

Fact tables and dimension tables. Know the difference and know why a fact table with a low cardinality dimension column directly embedded (rather than normalized out) can sometimes be fine and sometimes causes problems. This shows up as a “how would you model this?” question with a specific domain — e-commerce orders, SaaS subscriptions, etc.

The Stack Overflow Developer Survey 2024 shows that dbt has become extremely common in modern data stacks. If you’re interviewing at a company with a warehouse-centric setup, knowing dbt’s model materialization options (table, view, incremental, ephemeral) and when to use each is close to required at this point.

4. Distributed systems and Spark

Not every data engineering interview goes here. But mid-level and above at larger companies, it usually does.

The classic question: “Explain what happens when you call .collect() on a large Spark DataFrame.” The answer involves understanding the difference between transformations (lazy, build a DAG) and actions (trigger execution), and why collecting a large dataset to the driver can crash your job. Interviewers want to see you reason about driver vs. executor memory.

Data skew is a practical problem. “Your Spark job is slow and you notice one task is taking 10x longer than the others. What’s probably happening?” This is a skew question. Solutions include salting keys, repartitioning, and broadcast joins for small lookup tables. If you’ve dealt with skew in production you’ll have specific stories. If not, study the concepts and be honest that you haven’t hit it at scale yet.

CAP theorem still shows up, usually in the context of choosing between systems. “You’re designing a pipeline that needs both high availability and strong consistency. What’s the trade-off?” The correct answer acknowledges you can’t fully have both under network partition (the P in CAP), and then discusses what consistency model is actually needed for the use case.

5. What interviewers actually want from you

I’ve noticed that most pipeline and SQL questions in data engineering interviews aren’t about finding the one correct answer. They’re about seeing how you think when you hit something unfamiliar.

The candidates who move forward are usually the ones who say “I haven’t seen this exact pattern, but here’s how I’d approach it” and then actually work through the problem aloud. That’s a skill you have to practice, not just know about.

Craqly’s AI interview practice lets you work through data engineering scenarios conversationally, which is genuinely useful for that thinking-aloud muscle. It doesn’t replace building real pipelines, but it’s good for drilling the explanation layer.

The BLS reports that data engineering and related roles are among the fastest-growing in the computer and IT sector through 2032. The interview bar has risen with the demand. Knowing the concepts isn’t enough anymore — you need to be able to talk through them clearly.

That’s mostly it. The SQL and pipeline material is learnable. The communication piece under pressure is what trips most people who technically know the answers.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top