Skip to main content
    Interview Questions

    Data Engineer Interview Help: SQL, Pipelines, and System Design Questions

    Essential data engineering interview questions covering SQL, ETL pipelines, data modeling, and distributed systems — with AI-powered preparation tips.

    March 10, 2026
    5 min read
    21 views
    Craqly Team
    Data Engineer Interview Help: SQL, Pipelines, and System Design Questions
    data engineer interview
    sql interview
    data pipeline questions
    etl interview
    data modeling

    What Data Engineering Interviews Cover

    Data engineering interviews in 2026 test a unique combination of skills: SQL mastery, distributed systems knowledge, pipeline architecture, data modeling, and increasingly, familiarity with AI/ML data infrastructure. The breadth is challenging — you might write complex SQL window functions, design a real-time streaming pipeline, and discuss data governance policies all in the same interview.

    SQL Questions (The Foundation)

    1. Write a query to find the second highest salary in each department.

    Approach: Window functions — DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC). Filter for rank = 2. Discuss why DENSE_RANK vs ROW_NUMBER vs RANK matters.

    2. Explain the difference between WHERE and HAVING.

    WHERE filters rows before grouping. HAVING filters groups after aggregation. WHERE cannot reference aggregate functions; HAVING can.

    3. What are window functions and when do you use them?

    Functions that operate over a set of rows related to the current row. ROW_NUMBER, RANK, DENSE_RANK for ranking. LAG, LEAD for accessing adjacent rows. SUM/AVG OVER for running totals. Critical for analytics queries without self-joins.

    4. Explain query optimization strategies.

    Indexing strategy, EXPLAIN plan analysis, avoiding SELECT *, reducing subqueries (use CTEs or JOINs), partitioning large tables, materialized views for expensive aggregations, query caching.

    5. What is a CTE and when do you prefer it over subqueries?

    Common Table Expressions improve readability for complex queries. Recursive CTEs for hierarchical data. Some databases optimize CTEs differently than subqueries — worth knowing your specific engine.

    Data Pipeline Questions

    6. Design an ETL pipeline for processing 10TB of daily data.

    Batch vs streaming decision, data ingestion (Kafka/Kinesis for streaming, S3/GCS for batch), processing (Spark/Beam), transformation logic, data quality checks, loading strategy, monitoring and alerting.

    7. What is the difference between ETL and ELT?

    ETL transforms before loading (traditional, limited by transform compute). ELT loads raw data first, transforms in the warehouse (modern, leverages warehouse compute power). ELT is dominant in 2026 with powerful warehouses like BigQuery, Snowflake.

    8. How do you handle late-arriving data?

    Watermarks in streaming systems, reprocessing windows, lambda architecture (batch corrects streaming), event-time vs processing-time semantics, idempotent processing.

    9. Explain exactly-once semantics in streaming.

    Three guarantees: at-most-once, at-least-once, exactly-once. Exactly-once requires idempotent sinks and transactional processing. Kafka supports it with transactions. Flink provides it with checkpointing.

    10. How do you test data pipelines?

    Unit tests for transformation logic, integration tests with sample data, data quality assertions (Great Expectations), schema validation, regression testing against known outputs.

    Data Modeling Questions

    11. Explain star schema vs snowflake schema.

    Star: fact table surrounded by denormalized dimensions (simpler queries, more storage). Snowflake: normalized dimensions (less storage, more complex joins). Star is preferred for analytical workloads.

    12. What is a slowly changing dimension?

    Dimension data that changes over time. Type 1: overwrite (no history). Type 2: add new row with version/date range (full history). Type 3: add column for previous value (limited history).

    13. How do you model event data?

    Event sourcing pattern: immutable event log, derived views for different query patterns. Schema: event_id, event_type, timestamp, entity_id, payload (JSON). Partition by time, index by entity.

    Distributed Systems Questions

    14. Explain the CAP theorem with practical examples.

    Consistency, Availability, Partition tolerance — pick two. In practice, partitions happen, so you choose CP (strong consistency, may be unavailable) or AP (always available, eventually consistent). DynamoDB is AP, Spanner is CP.

    15. How does Apache Spark work internally?

    Driver creates DAG of stages from transformations. Stages split into tasks distributed across executors. Lazy evaluation — transformations build the DAG, actions trigger execution. Shuffle operations define stage boundaries.

    16. What is data partitioning and why does it matter?

    Splitting data across nodes/files by a key. Good partitioning: even distribution, query-aligned (avoid full scans). Common strategies: hash partitioning, range partitioning, time-based partitioning.

    Modern Data Stack Questions

    17. Compare Snowflake, BigQuery, and Redshift.

    Snowflake: multi-cloud, separation of storage/compute, virtual warehouses. BigQuery: serverless, slot-based pricing, best for GCP ecosystem. Redshift: tight AWS integration, provisioned clusters or serverless.

    18. What is a data lakehouse?

    Combines data lake flexibility (raw storage, any format) with data warehouse features (ACID transactions, schema enforcement, SQL interface). Technologies: Delta Lake, Apache Iceberg, Apache Hudi.

    19. How do you implement data quality at scale?

    Schema validation, null/uniqueness checks, statistical profiling, freshness monitoring, Great Expectations for assertion frameworks, data contracts between producers and consumers.

    20. Explain data mesh architecture.

    Domain-oriented ownership, data as a product, self-serve data infrastructure, federated governance. Shifts from centralized data team to domain teams owning their data products.

    AI-Powered Interview Preparation for Data Engineers

    Data engineering interviews combine deep SQL knowledge, distributed systems concepts, and architecture design — a breadth that is hard to fully prepare for. You use documentation and references daily in your work, but interviews expect recall from memory.

    Craqly's AI assistant provides real-time suggestions during your interview. When asked about window functions, it reminds you of the syntax and edge cases. When asked to design a pipeline, it suggests components you might forget under pressure.

    The AI supplements your real experience — it does not replace it. Practice with mock interviews to get comfortable with the workflow.

    Share this article
    C

    Written by

    Craqly Team

    Comments

    Leave a comment

    No comments yet. Be the first to share your thoughts!

    Ready to Transform Your Interview Skills?

    Join thousands of professionals who have improved their interview performance with AI-powered practice sessions.