Introduction to Data Science (Spring 2021)

MBA course, Harvard Business School, 2021

Teaching Fellow

Over the past decade, numerous firms have extensively invested in developing business infrastructure to collect, store, and analyze data effectively. A wide range of roles across finance, marketing, human resources, operations, innovation, and strategy now rely heavily on data for critical decision-making input and implementation. Indeed, many firms strategically differentiate themselves by their ability to translate their vast amounts of data into meaningful insights that help them gain an edge over their competitors.

The Introduction to Data Science (IDS) course provides students with the necessary foundations to effectively derive and evaluate data-driven insights to inform managerial decisions. In this hands-on course, students will learn how to view and solve business problems from a data perspective. The course will introduce fundamental principles that will enable MBA graduates to understand the opportunities and limitations of analytics, develop a solid grasp of two software packages currently being widely deployed (Tableau and R), and build a robust data analytics mindset.

Importantly, students will have opportunities to see how data science is used across a broad range of business environments. The course will focus on managers’ roles in data science projects, including hypothesis generation and testing, model design, interpretation of results, and the formulation of actionable recommendations.

Data science is an interdisciplinary field that combines principles from statistics and computer science with substantive domain knowledge to extract useful insights from data. The tools, technologies, and methodologies employed in data science are numerous; however, they broadly fall into four categories that mimic the typical process flow of a data science project and comprise the course’s four modules.

Module 1: Description

The first step in analyzing data is to understand the data by visualizing it and computing basic descriptive summary statistics (e.g., average, standard deviation, maximum, and minimum). We begin the course by introducing students to Tableau, a powerful and widely used enterprise data visualization software. Often, visualizing data is enough to answer basic descriptive questions (such as, which types of customers are buying different products?) devise more complex hypotheses about various relationships (such as, what types of innovations might increase sales?) and identify irregularities (such as mistakes in the data collection or outlier data).

This module will also introduce R, a statistical software widely used by data scientists, to compute descriptive statistics and perform simple visualizations. Descriptive statistics of key business metrics are aggregations of data that should form the information backbone of every enterprise. For example, sales, revenue, and customer churn are all examples of business metrics.

Module 2: Statistical Inference

Most companies collect data that represent just a fraction of the entire population of interest (for example, the students taking the IDS course are a sample of the HBS MBA population; customers who have purchased a coffee at Starbucks are a sample of the coffee-drinking population; a company’s customers in January are a sample of all of its customers throughout the year). Statistical inference, one of the fundamental pillars of data science, is the practice of using a sample to learn something (i.e., draw inferences) about the full population. Through hypothesis tests and confidence intervals, we will determine if the differences we observe in summary statistics across different groups in the sample (e.g., the two sets of customers in an A/B test) are due to random fluctuations or systematic differences in the population. We will also use linear regression to determine whether relationships seen in the sample hold more generally in the population.

Module 3: Prediction

Prediction is a process that uses historical data to forecast future events (for example, using January’s sale data to determine which customers are likely to return in February). The growth in predictive models is in large part responsible for the increased uptake of data science in so many industry sectors. Today, predictive models affect most aspects of our everyday life, from Netflix’s recommendation algorithm to Google’s search engine, or even to food placement on grocery store shelves. Although these companies use extremely advanced algorithms, we will focus on two fundamental methods: linear regression and logistic regression. Mastering these methods will allow students to develop forecasts that apply across various business problems and industries.

Module 4: Causal Inference

Our final module focuses on studying how actions, interventions, or treatments (e.g., launching a new predictive algorithm, changing the color of a button on a user interface, or increasing the price of a product) impact business metrics (e.g., engagement, click-through rate, or daily units sold). In this module, students will learn the primary method for establishing causal relationships: randomized experiments (e.g., A/B tests). This simple idea underlies the scientific method and has revolutionized how managers make decisions, as it allows them to almost instantaneously discern the preferences of their customers, evaluate their firm’s initiatives, and ultimately test their hypotheses. Experimentation is now an integral part of many companies’ product development cycles, allowing managers to continuously challenge their working hypotheses and make pivots that ultimately lead to better innovations. It is often the final step of many data science projects and it will also be where we conclude the IDS course.