Foundation of Data Systems
Chapter 1: Basics of Modern Data Architecture
Exploring the landscape of data engineering
What is data engineering?
Types of data engineering problems
Responsibilities and challenges of a Java data architect
Data architect versus data engineer
Challenges of a data architect
Techniques to mitigate those challenges
Chapter 2: Data Storage and Databases
Understanding data types, formats, and encodings
Understanding file, block, and object storage
The data lake, data warehouse, and data mart
Databases and their types
Data model design considerations
Chapter 3: Identifying the Right Data Platform
Virtualization and containerization platforms
Benefits of virtualization
Benefits of containerization
Benefits of cloud computing
Choosing the correct platform
When to choose virtualization versus containerization
Choosing between on-premise versus cloud-based solutions
Choosing between various cloud vendors
Building Data Processing Pipelines
A Batch-Based Solution to Ingesting Data in a Data Warehouse
Understanding the problem and source data
Understanding the source data
Building an effective data model
Relational data warehouse schemas
Evaluation of the schema design
Implementing and unit testing the solution
Chapter 5: Architecting a Batch Processing Pipeline
Developing the architecture and choosing the right tools
Architecting the solution
Factors that affect your choice of storage
Determining storage based on cost
The cost factor in the processing layer
Implementing the solution
Profiling the source data
Writing the Spark application
Deploying and running the Spark application
Developing and testing a Lambda trigger
Performance tuning a Spark job
Querying the ODL using AWS Athena
Chapter 6: Architecting a Real-Time Processing Pipeline
Understanding and analyzing the streaming problem
Architecting the solution
Implementing and verifying the design
Setting up Apache Kafka on your local machine
Developing the Kafka streaming application
Unit testing a Kafka Streams application