DS 3000: Foundations of Data Science

DS 5110: Data Management and Processing

John Rachlin

When Lecture: MTWR 3:20p-5:00p ET (Live/Online)
E-mail j.rachlin@northeastern.edu
Web Website
Office Hours Friday 2-6pm ET/Boston on Zoom and by appointment only.
Reserve 20 minutes for one-on-one help

Teaching Assistants

Please see Piazza for any scheduling updates and connection information. All times are ET/Boston.
NameOffice HoursZoom
Mahek Aggarwal Mon 8p-10p, Wed 8a-9a zoom
Colbe Chang Tue/Wed 5:30p-7:30p zoom
Sofie Cook TBD zoom
Ashiwn Kasargode Mon/Tue 10:15a-12:15p zoom
Andre Kirby Tue/Wed 1p-3p zoom
Emily Liu Mon/Tue 11a-1p zoom
Khushi Morparia TBD zoom
David Pogrebitskiy Mon/Wed 12n-2p zoom
Rishita Shroff Mon/Wed 11a-1p zoom
Abigail Sodergren Thu 10:30-12:30 zoom
Sri Sreepada TBD zoom
Prerna Chander Mon/Wed 9:30a-11:30a zoom
Vinesh Kumar Gande Thu/Fri 9:30a-11:30a zoom
Franc O Tue 11a-2p zoom

About the course

Catalog Description

DS3000: Introduces core modern data science technologies and methods that provide a foundation for subsequent Data Science classes. Covers: working with tensors and applied linear algebra in standard numerical computing libraries (e.g., NumPy); processing and integrating data from a variety of structured and unstructured sources; introductory concepts in probability, statistics, and machine learning; basic data visualization techniques; and now standard data science tools such as Jupyter notebooks.

DS5110: Introduces students to the core tasks in data science, including data collection, storage, tidying, transformation, processing, management, and modeling for the purpose of extracting knowledge from raw observations. Programming is a cross-cutting aspect of the course. Offers students an opportunity to gain experience with data science tasks and tools through short assignments. Includes a term project based on real-world data.

4.000 Credit Hours
Prerequisites: Assumes an Intermediate-level understanding of Python Programming (DS5010 or DS2500 or the equivalent).

My vision for these two courses

This semester, at my suggestion, DS3000 (Undergraduate) and DS5110 (Graduate) are being cross-listed. This means that the content of these two courses are being merged. Both sections will attend the live lectures MTWR 3:20p-5:00p ET/Boston. Why is this being done? As the co-director of the Master in Data Science program here at Northeastern, one of my functions is to help define and refine the course curriculum of our MS/DS core courses and elective offerings. I very much view both DS5110 and DS3000 as bringing together Mathematical and Programming foundations to explore key concepts, algorithms, models, and applications in Data Science. From here you will continue your studies in Data Science to explore Machine Learning, Natural Language Processing, Database Systems, Visualization, and other topics relevant to Data Science. A course in Data Science Foundations sets the stage for these subsequent topics. My goal for this course is to explore a number of interesting topics in Data Science without overlapping too much with material that you have already covered, or will cover in the future. In addition you will expand your programming skills for data minimulation, visualization, analysis, and algorithmic development.

This class is not recorded

Other than the second-week of class, this class is not recorded. I strongly you recommend you come to every live session with your cameras on and your microphone muted. In the event you need to miss class, I will be posting lecture notes and coding examples that evening. (Materials should be available by 8pm.) I very much welcome active discussion in my classrooms. To ask a question, feel free to virtually raise your hand or post your question on chat.

In week 2 (May 15th-18th), I will be in Boston for faculty meetings. Pre-recorded lectures will be made available and we will temporarily run the class asynchronously. In week 3 and for the rest of the semester, we will return to live / synchronous lectures on Zoom.

Readings

There is no singular textbook that I know of on Data Science Foundations. The subject is still very new and contantly evolving. Below I list a number of books that I have found interesting and well-written and readings will be assigned from selected chapters. Most of these books are available through O'Reilly E-Books. As Northeastern students, you have free access!

Title Deitel and Deitel (2019): Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud, 1ed. (Pearson)
Buy online Amazon.com
Digital (free)
Description The Deitels’ Introduction to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and the Cloud offers a unique approach to teaching introductory Python programming, appropriate for both computer-science and data-science audiences. Providing the most current coverage of topics and applications, the book is paired with extensive traditional supplements as well as Jupyter Notebooks supplements. Real-world datasets and artificial-intelligence technologies allow students to work on projects making a difference in business, industry, government and academia. Hundreds of examples, exercises, projects (EEPs), and implementation case studies give students an engaging, challenging and entertaining introduction to Python programming and hands-on data science.
Title Provost and Fawcett (2013): Data Science for Business. (O'Reilly)
Buy online Amazon.com
Digital (free)
Description Based on an MBA course Provost has taught at New York University over the past ten years, Data Science for Business provides examples of real-world business problems to illustrate these principles. You’ll not only learn how to improve communication between business stakeholders and data scientists, but also how participate intelligently in your company’s data science projects. You’ll also discover how to think data-analytically, and fully appreciate how data science methods can support business decision-making.
Title Page S., (2021): The Model Thinker: What you need to know to make data work for you. (Basic Books)
Buy online Amazon.com
Description In The Model Thinker, social scientist Scott E. Page shows us the mathematical, statistical, and computational models—from linear regression to random walks and far beyond—that can turn anyone into a genius. At the core of the book is Page's "many-model paradigm," which shows the reader how to apply multiple models to organize the data, leading to wiser choices, more accurate predictions, and more robust designs.
Title Fuhrer C., Verdier O., and Solem J. (2021): Scientific Computing with Python (2ed)
Buy online Amazon.com
Digital (free)
Description This book will help you to explore new Python syntax features and create different models using scientific computing principles. The book presents Python alongside mathematical applications and demonstrates how to apply Python concepts in computing with the help of examples involving Python 3.8. You'll use pandas for basic data analysis to understand the modern needs of scientific computing, and cover data module improvements and built-in features. You'll also explore numerical computation modules such as NumPy and SciPy, which enable fast access to highly efficient numerical algorithms. By learning to use the plotting module Matplotlib, you will be able to represent your computational results in talks and publications. A special chapter is devoted to SymPy, a tool for bridging symbolic and numerical computations.