Course Introduction
Professional certification
Google Professional Data Engineer
Design, build, and operate data systems on Google Cloud. This guide focuses on pipelines, storage/processing choices, security, and the exam scenarios you will face.
Storage & models
Choose between BigQuery, Bigtable, Spanner, Cloud SQL based on consistency, scale, and latency.
Pipelines
Batch/streaming patterns with Dataflow, Pub/Sub, Dataproc, and Composer control.
Security & governance
IAM, CMEK, VPC SC, lineage, and data quality checkpoints.
Status
Interactive guide is live—more labs and flashcards are coming.
Official Exam Objectives
Focus on design, implementation, and operationalization of data pipelines and ML models on Google Cloud.
Exam Overview
Length: 2 hours
Format: 50-60 multiple choice and multiple select questions
Prereqs: strong SQL + Python/Java, distributed systems + data modeling.
Focus: select the right service based on cost, latency, throughput, access patterns.
Section 1: Designing Data Processing Systems
Storage + pipelines.
Selecting the right storage
BigQuery: serverless OLAP; partitioning + clustering for cost/perf.
Bigtable: wide-column, sub-10ms latency; row key design matters.
Spanner: global SQL, strong consistency.
Cloud SQL: regional OLTP and lift & shift.
GCS: data lake storage + lifecycle tiers.
Designing data pipelines
Batch vs streaming: bounded vs unbounded data.
Pub/Sub: ingestion layer; ordering keys when order matters.
Dataflow: Beam for batch/stream; windows, watermarks, triggers.
Section 2: Building & Operationalizing Pipelines
ETL, orchestration, migration.
Build pipelines
Dataflow: autoscaling, templates.
Dataproc: lift & shift Hadoop/Spark; preemptibles for workers.
Cloud Fusion: no-code ETL.
Orchestration
Composer: Airflow DAGs across services.
Migration
Transfer Service: S3/Azure/HTTP to GCS.
Transfer Appliance: petabyte migrations.
BQ Data Transfer: SaaS ingestion.
Section 3: Operationalizing ML
Pre-built, BQML, Vertex AI.
Pre-built ML APIs
Vision, Speech, NLP, Video, Translation: choose first for standard use cases.
BigQuery ML
Use case: SQL-first modeling for data analysts.
Models: regression, k-means, matrix factorization, time series.
Vertex AI
AutoML: low-code training.
Training: custom jobs (TF/PyTorch).
Feature Store: avoid training/serving skew.
Pipelines: reproducible MLOps workflows.
Section 4: Ensuring Solution Quality
Security + monitoring.
Security & compliance
IAM: dataset/table/row-level security; policy tags for columns.
DLP: discover and mask PII.
CMEK: customer-managed keys for BigQuery/GCS.
Monitoring & optimization
Monitoring: alerts for failures and latency.
Best practices: avoid SELECT *, use partitioning/clustering, denormalize.
Cheatsheet: Which Service Should I Use?
SQL analytics on petabytes
BigQuery
Low latency + high throughput writes
Bigtable
Global consistency (banking/OLTP)
Spanner
Lift & shift MySQL/Postgres
Cloud SQL
Lift & shift Hadoop/Spark
Dataproc
Serverless batch/stream processing
Dataflow
Orchestrating workflows/DAGs
Cloud Composer
Streaming ingestion
Pub/Sub
ML for SQL users
BigQuery ML
No-code ETL
Cloud Fusion
Flashcards
Test your knowledge of GCP Services
Question Text
Click to reveal answerAnswer Text
Architecture Decision Diagrams
Click on any diagram title to show/hide the diagram.
Storage Decision