Course Introduction

Syllabus Covered

Professional certification

Google Professional Data Engineer

Design, build, and operate data systems on Google Cloud. This guide focuses on pipelines, storage/processing choices, security, and the exam scenarios you will face.

Want to pass faster?

Get practice tests + the latest deals/discount codes

View deals

Get the latest from Cloud-Edify

Get two free courses coupons. Curated learning content, tools, and deal alerts—straight to your inbox.

Storage & models

Choose between BigQuery, Bigtable, Spanner, Cloud SQL based on consistency, scale, and latency.

Pipelines

Batch/streaming patterns with Dataflow, Pub/Sub, Dataproc, and Composer control.

Security & governance

IAM, CMEK, VPC SC, lineage, and data quality checkpoints.

Status

Interactive guide is live - more labs and flashcards are coming.

Related certifications: Professional Cloud Architect Professional ML Engineer

Google Professional Data Engineer Study Guide

Official Exam Objectives

Focus on design, implementation, and operationalization of data pipelines and ML models on Google Cloud.

Design Build ML Security

Exam Overview

Length: 2 hours

Format: 50-60 multiple choice and multiple select questions

Prereqs: strong SQL + Python/Java, distributed systems + data modeling.

Focus: select the right service based on cost, latency, throughput, access patterns.

Section 1: Designing Data Processing Systems

Storage + pipelines.

Selecting the right storage

BigQuery: serverless OLAP; partitioning + clustering for cost/perf.

Bigtable: wide-column, sub-10ms latency; row key design matters.

Spanner: global SQL, strong consistency.

Cloud SQL: regional OLTP and lift & shift.

GCS: data lake storage + lifecycle tiers.

Designing data pipelines

Batch vs streaming: bounded vs unbounded data.

Pub/Sub: ingestion layer; ordering keys when order matters.

Dataflow: Beam for batch/stream; windows, watermarks, triggers.

Section 2: Building & Operationalizing Pipelines

ETL, orchestration, migration.

Build pipelines

Dataflow: autoscaling, templates.

Dataproc: lift & shift Hadoop/Spark; preemptibles for workers.

Cloud Fusion: no-code ETL.

Orchestration

Composer: Airflow DAGs across services.

Migration

Transfer Service: S3/Azure/HTTP to GCS.

Transfer Appliance: petabyte migrations.

BQ Data Transfer: SaaS ingestion.

Section 3: Operationalizing ML

Pre-built, BQML, Vertex AI.

Pre-built ML APIs

Vision, Speech, NLP, Video, Translation: choose first for standard use cases.

BigQuery ML

Use case: SQL-first modeling for data analysts.

Models: regression, k-means, matrix factorization, time series.

Vertex AI

AutoML: low-code training.

Training: custom jobs (TF/PyTorch).

Feature Store: avoid training/serving skew.

Pipelines: reproducible MLOps workflows.

Section 4: Ensuring Solution Quality

Security + monitoring.

Security & compliance

IAM: dataset/table/row-level security; policy tags for columns.

DLP: discover and mask PII.

CMEK: customer-managed keys for BigQuery/GCS.

Monitoring & optimization

Monitoring: alerts for failures and latency.

Best practices: avoid SELECT *, use partitioning/clustering, denormalize.

Cheatsheet: Which Service Should I Use?

SQL analytics on petabytes

BigQuery

Low latency + high throughput writes

Bigtable

Global consistency (banking/OLTP)

Spanner

Lift & shift MySQL/Postgres

Cloud SQL

Lift & shift Hadoop/Spark

Dataproc

Serverless batch/stream processing

Dataflow

Orchestrating workflows/DAGs

Cloud Composer

Streaming ingestion

Pub/Sub

ML for SQL users

BigQuery ML

No-code ETL

Cloud Fusion

No spam. Unsubscribe anytime.