Instructor & Course Logistics

Instructor: Prof. Gil Salu

Meeting Times: Tuesday / Thursday, 9:00 AM – 10:20 AM
Location: Library 209

Office Hours: 12:00 PM – 2:00 PM
Office Location: TBD

Appointments via Zoom are available outside of office hours. If you need to meet, send me a note and we will find a time.

 

Distributed Systems for Data Science is a hands-on course that teaches students to build and manage data pipelines at scale using cloud and distributed computing technologies. Topics include AWS services (Lambda, S3, DynamoDB, SQS), Git/GitHub workflows, Apache Spark with Databricks, Snowflake data warehousing, streaming with Kafka, and data engineering tools like dbt and Airflow. Students complete labs deploying serverless APIs, building distributed data pipelines, and working with modern file/table formats (Parquet, Delta, Iceberg).

Meeting Times

Tuesdays and Thursdays, 9:00 AM, in person.

Key Dates (per campus calendar)

Weekly Topics (Weeks 1–15)

  1. Week 1 (Tue Jan 27, Thu Jan 29) — Git fundamentals; AWS onboarding (IAM, CLI, budgets). Lab: branch conflict + GitHub Pages.
  2. Week 2 (Tue Feb 3, Thu Feb 5) — S3 static sites; CloudFront overview. Lab: publish static site to S3 (+ optional CloudFront).
  3. Week 3 (Tue Feb 10, Thu Feb 12) — Serverless: API Gateway + Lambda with DynamoDB. Lab: deploy Python API.
  4. Week 4 (Tue Feb 17, Thu Feb 19) — Scaling fundamentals + AWS patterns (SQS/SNS/EventBridge). Short design critique.
  5. Week 5 (Tue Feb 24, Thu Feb 26) — AWS Lab Week: Distributed Cipher (Lambda + SQS + DynamoDB + S3) design and implementation.
  6. Week 6 (Tue Mar 3, Thu Mar 5) — Data at scale: pandas limits → Polars; benchmarking.
  7. Week 7 (Tue Mar 10, Thu Mar 12) — Spark 101 + Databricks notebooks. Midterm opens Thu evening.
  8. Week 8 (Mar 16–20) — Spring Break (no class).
  9. Week 9 (Tue Mar 24, Thu Mar 26) — File formats (CSV/Avro/Parquet); table formats (Delta/Iceberg). Midterm due Tue night.
  10. Week 10 (Tue Mar 31 — no class; Thu Apr 2) — Snowflake fundamentals.
  11. Week 11 (Tue Apr 7, Thu Apr 9) — Load from S3; operating warehouses & cost control.
  12. Week 12 (Tue Apr 14, Thu Apr 16) — Streaming with Redpanda/Kafka; producer/consumer lab.
  13. Week 13 (Tue Apr 21, Thu Apr 23) — Project/buffer week; integration time.
  14. Week 14 (Tue Apr 28 — no class; Thu Apr 30) — Snowpark for Python + ML.
  15. Week 15 (Tue May 5, Thu May 7) — Project studio + final presentations; final short quiz released.

Assessments & Weights 

Tooling & Accounts

Using Git in this Class

We will use GitHub Classroom to distribute starter code and run autograding. In Week 1 you will complete a guided merge-conflict exercise and publish a GitHub Pages site from a dedicated branch.

 

Submission, Deadlines, and Late Work

This course is designed around steady progress rather than one-off high-stakes deadlines. You are expected to submit work on time, but the policies below are meant to support learning rather than penalize recovery.

If you fall behind, the correct move is to submit incomplete or imperfect work rather than nothing at all. The goal is to keep you engaged with the material.

Use of AI and External Tools

AI tools are part of modern data science practice, and their thoughtful use is allowed and encouraged in this course.

If you are unsure whether a particular use of AI is appropriate for an assignment, ask. Asking first is always the right call and will never count against you.

Course Flexibility and Adjustments

This syllabus reflects the intended structure of the course, but distributed systems are a fast-moving field and class pacing varies by cohort.

Your responsibility is to stay engaged with Canvas announcements and in-class guidance.

Attendance and Participation

This course is built around in-class labs, walkthroughs, and design discussion. Attendance and active participation are expected and directly support your ability to complete graded work.

If you must miss class, you are responsible for catching up on material and announcements. Reach out early if attendance becomes an ongoing issue.

 

 

Required and Optional Resources

There is no required textbook for this course. This is intentional and based on prior student feedback.

Instead, we will rely on a mix of instructor material, official documentation, and selected online references. Any required readings or videos will be clearly linked in Canvas.

Frequently Useful References

You are not expected to read all of these end-to-end. They are reference material to support labs, projects, and independent exploration.