Skip to main content
AWS 🇺🇸 · 12 min read

How to Pass AWS Certified Data Engineer Associate (DEA-C01) in 2026: Complete Study Guide

Complete study guide for the AWS Certified Data Engineer Associate (DEA-C01) exam. Covers all 4 domains, AWS Glue, Kinesis, Redshift, Lake Formation, and a 5-week study plan.

# How to Pass AWS Certified Data Engineer Associate (DEA-C01) in 2026: Complete Study Guide The AWS Certified Data Engineer Associate (DEA-C01) is AWS's dedicated certification for data engineers. It validates your ability to build data pipelines, manage data storage, operate data systems in production, and apply security and governance to data workloads on AWS. Unlike the Solutions Architect exam, which covers the full breadth of AWS, the DEA-C01 goes deep on the services data engineers use every day: AWS Glue, Amazon Kinesis, Amazon Redshift, AWS Lake Formation, Amazon EMR, and Amazon Athena. If you build data pipelines on AWS, this certification validates your expertise in a highly visible way. This guide covers everything you need: exam format, all four domains in detail, key architectural patterns, and a 5-week study plan. --- ## Exam Facts at a Glance | Detail | Value | |---|---| | Exam code | DEA-C01 | | Exam cost | $150 USD | | Number of questions | 65 | | Time limit | 130 minutes | | Passing score | ~72% | | Format | Multiple choice, multiple response | | Delivery | Pearson VUE (online or test center) | | Validity | 3 years | | Prerequisites | None (1-2 years data engineering experience recommended) | At $150, this is one of the more affordable AWS certifications. The 130-minute window for 65 questions gives you 2 minutes per question — manageable if you know the material, tight if you are guessing on data-specific services. --- ## Domain Breakdown ### Domain 1: Data Ingestion and Transformation (34%) The largest domain by a significant margin. It covers everything that happens before data lands in its final storage destination: collecting data from sources, moving it, and reshaping it into usable formats. Key topics: - AWS Glue: crawlers, ETL jobs (Spark, Python Shell, Ray, Streaming), DynamicFrame API - AWS Glue Data Catalog as the central metadata repository - Amazon Kinesis Data Streams: real-time ingestion, shard capacity, consumers - Amazon Kinesis Data Firehose: delivery to S3/Redshift/OpenSearch, transformation with Lambda - Amazon Kinesis Data Analytics (now Amazon Managed Service for Apache Flink): SQL on streams - Amazon MSK (Managed Streaming for Apache Kafka): Kafka-compatible streaming - AWS Database Migration Service (DMS): full load, CDC, full load + CDC modes - AWS DataSync: accelerated file and object storage migration - Apache Spark transformation patterns on AWS (EMR, Glue) - AWS Step Functions for orchestrating multi-step pipelines - Amazon EventBridge for event-driven pipeline triggers ### Domain 2: Data Store Management (26%) This domain covers the design and operation of data storage systems — both where to store data and how to manage it efficiently. Key topics: - Amazon S3 as a data lake foundation: partitioning strategies, storage classes, lifecycle policies - AWS Lake Formation: centralized permissions, data lake creation, blueprints - Amazon Redshift: distribution styles, sort keys, COPY/UNLOAD commands, Spectrum - Amazon Redshift Serverless: RPU (Redshift Processing Units) capacity model - Amazon RDS and Aurora for transactional source systems - Amazon DynamoDB for NoSQL: on-demand vs provisioned capacity, DynamoDB Streams - Amazon OpenSearch Service for log analytics and search - Apache Iceberg, Hudi, and Delta Lake format support on S3 ### Domain 3: Data Operations and Support (22%) Once pipelines are built, they need to be monitored, maintained, and optimized. This domain tests your operational knowledge. Key topics: - AWS Glue job monitoring: CloudWatch metrics, job bookmarks, error handling - Amazon CloudWatch: custom metrics, dashboards, alarms for pipeline health - AWS Glue Workflows for multi-job orchestration - Amazon Managed Workflows for Apache Airflow (MWAA) for complex DAG orchestration - AWS Step Functions for serverless workflow orchestration - AWS Glue DataBrew for visual data preparation and profiling - Redshift performance tuning: EXPLAIN, ANALYZE, WLM (Workload Management) - EMR cluster types: instance groups vs instance fleets, auto-scaling - Troubleshooting DMS replication: task logs, latency metrics ### Domain 4: Data Security and Governance (18%) The smallest domain but one where mistakes are costly in production. It covers encryption, access control, and compliance for data workloads. Key topics: - AWS Lake Formation column-level and row-level security (cell-level security) - Lake Formation permissions vs S3 bucket policies (understand which takes precedence) - AWS Glue Data Catalog resource policies - Amazon Macie for PII discovery in S3 - AWS KMS encryption for S3, Redshift, Glue, DynamoDB - S3 server-side encryption: SSE-S3, SSE-KMS, SSE-C - VPC endpoints for S3 and Redshift (keep data traffic off the internet) - AWS CloudTrail for API audit logging - Amazon Athena workgroups for cost control and access separation --- ## AWS Glue: The ETL Core AWS Glue appears throughout every domain and is the single most important service for this exam. Understand it at a deeper level than most AWS certifications require. ### Glue Job Types | Job Type | Runtime | Best For | |---|---|---| | Spark | Apache Spark (distributed) | Large-scale batch ETL, complex transformations | | Python Shell | Pure Python (single node) | Small data, API calls, lightweight processing | | Ray | Ray framework (distributed Python) | ML preprocessing, Python-native parallelism | | Streaming | Spark Streaming | Near-real-time ETL from Kinesis or Kafka | The exam distinguishes between job types. Use Spark for scale, Python Shell for simplicity when data is small, Streaming when you need continuous processing. ### DynamicFrame vs DataFrame AWS Glue introduces the DynamicFrame as an alternative to Spark's DataFrame. The key differences: - **DynamicFrame** handles semi-structured data with inconsistent schemas (e.g., a JSON field that is sometimes a string, sometimes an array). Each field has a type along with a confidence score. - **DataFrame** requires a fixed schema — schema inconsistencies cause errors. - You can convert between them: `dyf.toDF()` and `DynamicFrame.fromDF(df, glue_ctx, "name")`. 💡 **Exam Tip:** Use DynamicFrame when source data has schema inconsistencies (a common real-world scenario). Convert to DataFrame when you need Spark SQL operations or Spark MLlib. ### Glue Crawlers Crawlers automatically scan data stores (S3, JDBC, DynamoDB) and populate the Glue Data Catalog with table metadata. Key points: - Crawlers infer schema and create or update table definitions - Crawlers can detect partition changes and update the catalog automatically - Scheduled crawlers run on a cron schedule; you can also trigger them on-demand - For large S3 buckets with known partition structure, consider maintaining the catalog manually to avoid crawler costs ### Job Bookmarks Glue job bookmarks enable incremental processing — only data added since the last job run is processed. Bookmarks work for S3 sources (by tracking file modification timestamps) and JDBC sources (by tracking max values of bookmark keys). --- ## Amazon Kinesis: Streaming Data Ingestion ### Kinesis Data Streams Kinesis Data Streams provides real-time data ingestion with sub-second latency. Key capacity facts: - Each shard supports **1 MB/s write** and **2 MB/s read** throughput - Data is retained for 24 hours by default (up to 365 days with extended retention) - Records are ordered within a shard, not across shards - Shard count determines total throughput — scale by splitting shards **Enhanced fan-out**: Dedicated 2 MB/s read throughput per consumer per shard, using HTTP/2 push delivery. Use when multiple consumers need to read at full throughput simultaneously. ### Kinesis Data Firehose Firehose is a fully managed delivery service — no consumer code to write. It buffers data and delivers to destinations (S3, Redshift, OpenSearch, Splunk, HTTP endpoint). Buffer settings: - **Buffer size**: 1 MB to 128 MB (delivery triggers when buffer is full) - **Buffer interval**: 60 seconds to 900 seconds (delivery triggers at interval even if buffer is not full) Whichever threshold is hit first triggers delivery. For dynamic partitioning, Firehose can use inline Lambda or Firehose's built-in JQ expressions to route records to different S3 prefixes based on record content. 💡 **Exam Tip:** Kinesis Data Streams is for real-time, exactly-once processing with custom consumers. Firehose is for managed delivery to storage/analytics destinations. Firehose does not provide replay capability. --- ## Amazon Redshift Architecture Redshift is a columnar, massively parallel processing (MPP) data warehouse built for analytical queries against large datasets. ### Distribution Styles | Style | How Data Is Distributed | Best For | |---|---|---| | EVEN | Round-robin across all slices | Tables with no clear join key, staging tables | | KEY | Rows with the same key value go to the same slice | Fact-dimension joins, large table joins | | ALL | Full copy of the table on every node | Small dimension tables joined frequently | | AUTO | Redshift chooses (EVEN for small, KEY for large) | Default — let Redshift decide | ### Sort Keys Sort keys determine the physical sort order of rows on disk. Range scans benefit enormously from well-chosen sort keys because Redshift can skip entire blocks. - **Compound sort key**: Columns are sorted in order (like a B-tree index). Most efficient for queries that filter on the leading columns. - **Interleaved sort key**: Equal weight given to each column. Useful when queries filter on different columns in different patterns. Has higher VACUUM overhead. ### COPY Command The COPY command is the recommended way to load data into Redshift. It parallelizes across multiple nodes and slices: ```sql COPY sales FROM 's3://mybucket/sales/' IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopyRole' FORMAT AS PARQUET; ``` Key COPY parameters: - `MANIFEST`: Load specific files listed in a manifest JSON file - `REGION`: Specify the S3 bucket region if different from the Redshift cluster region - `COMPUPDATE ON`: Run compression analysis during COPY (slower but more efficient storage) - `STATUPDATE ON`: Update table statistics after load ### Redshift Spectrum Spectrum lets you query data directly in S3 from Redshift without loading it. Use it for historical data that is rarely queried, or to combine S3 data with Redshift table data in a single query. Spectrum tables are defined as external tables in the Glue Data Catalog. --- ## AWS Lake Formation Lake Formation centralizes data lake governance. Instead of managing S3 bucket policies and IAM policies separately for every table, you grant permissions at the Lake Formation level and it enforces them. ### Lake Formation vs S3 Bucket Policies A common exam question: if a user has S3 full access but no Lake Formation permissions, can they read a table registered with Lake Formation? The answer: **Lake Formation permissions take precedence over S3 permissions for registered data locations.** A user needs both S3 access (or Lake Formation grants the IAM role S3 access on their behalf) AND Lake Formation table/column permissions to read the data through Glue Data Catalog or Athena. ### Column-Level Security Lake Formation can restrict access to specific columns using column permissions. A data analyst role might see all columns except PII columns (SSN, credit card number). This is enforced transparently — the analyst simply sees fewer columns when querying through Athena or Redshift Spectrum. Row-level security (cell-level security) adds the ability to restrict rows based on filter expressions — for example, a regional analyst only sees rows where `region = 'US-EAST'`. --- ## Pipeline Orchestration Options | Service | Type | Best For | |---|---|---| | AWS Glue Workflows | Managed, Glue-native | Multi-job Glue pipelines | | AWS Step Functions | Serverless, general-purpose | Cross-service workflows, error handling | | Amazon MWAA | Managed Apache Airflow | Complex DAGs, teams already using Airflow | | Amazon EventBridge | Event-driven | Trigger pipelines on AWS events (S3 PUT, DynamoDB Streams) | Step Functions is the AWS-native choice for serverless orchestration. MWAA is the right choice when teams have existing Airflow DAGs or need Airflow's rich ecosystem of operators. --- ## 5-Week Study Plan **Week 1 — Glue and Ingestion** Study AWS Glue in depth: crawlers, job types, DynamicFrame vs DataFrame, job bookmarks. Build an ETL job that reads from S3 (CSV), transforms with DynamicFrame, and writes to S3 (Parquet). Study the Glue Data Catalog structure. **Week 2 — Streaming and Migration** Study Kinesis Data Streams shard capacity model. Build a Firehose delivery stream to S3 with Lambda transformation. Study DMS replication modes (full load, CDC, full load + CDC). Read the MSK vs Kinesis comparison documentation. **Week 3 — Redshift and S3 Data Lake** Study Redshift distribution styles and sort keys. Run COPY from S3 with MANIFEST. Query external tables with Redshift Spectrum. Study Athena: partitioning, workgroups, output location. Build a Lake Formation database with column-level permissions. **Week 4 — Security, Governance, and Operations** Study Lake Formation permissions model vs S3 bucket policies. Configure Macie for PII detection. Review CloudWatch metrics for Glue, Kinesis, and Redshift. Study Redshift WLM and VACUUM types. Review Step Functions for ETL orchestration. **Week 5 — Practice Exams and Review** Take full-length practice exams under timed conditions. Focus on Glue job bookmark behavior, Kinesis throughput calculations, Lake Formation vs S3 policy precedence, and DMS CDC behavior — the most common exam traps. Review every wrong answer with the documentation. --- ## Study Resources **Free:** - AWS documentation: Glue Developer Guide, Redshift Database Developer Guide, Kinesis Developer Guide - AWS Skill Builder: Data Engineer learning plan - AWS official DEA-C01 exam guide - AWS re:Invent sessions on YouTube (search "Data Engineering on AWS 2024") **Paid:** - CertLand DEA-C01 practice exam — 340 questions covering all four domains with detailed explanations and exam tips - Udemy: "AWS Certified Data Engineer Associate" courses --- ## Final Tips The DEA-C01 is a scenario-based exam. Questions describe a business situation — a company needs to ingest 10 TB/day of clickstream data with sub-second latency, with a requirement to replay historical data — and ask which service configuration satisfies all constraints. Practice building pipelines, not just reading documentation. Stand up a Glue job, a Kinesis stream, and an Athena table. The hands-on familiarity will help you eliminate wrong answers quickly. At $150, this is one of the best-value AWS certifications for cloud data engineers. Five focused weeks of study, combined with hands-on practice, is sufficient for candidates with 1-2 years of AWS data engineering experience. Good luck.

Comments

Sign in to leave a comment.

No comments yet. Be the first!

Comments are reviewed before publication.