Chapter 14 Logistic Regression

Introduction

The simple and multiple linear regression methods we studied in Chapters 10 and 11 are used to model the relationship between a quantitative response variable and one or more explanatory variables. In this chapter, we describe similar methods that are used when the response variable is a categorical variable with two possible values.

The methods detailed in this chapter will help us answer questions such as

Can we use the opening-weekend revenue of a movie to predict whether it will be profitable?
Does the chance that a lost wallet will be returned depend on whether there is money in the wallet?
How does the concentration of an insecticide relate to the proportion of insects killed?

In general, we call the two outcomes of the response variable “success” and “failure” and represent them by 1 (for a success) and 0 (for a failure). The mean is then the proportion of 1s or p=P(success), the probability of a success.

If our data are n independent observations, we have the binomial setting, and we can use the inference methods of Chapter 8. What is new in this chapter is that the data now include at least one explanatory variable x, and the probability p depends on the value of x.

For example, suppose that we are studying whether a student applicant receives (y=1) or is denied (y=0) financial aid. We think that p, the probability that an applicant receives aid, may be related to the explanatory variables (a) the financial support of the parents, (b) the income and savings of the applicant, and (c) whether the applicant has received financial aid before. Just as in multiple linear regression, the explanatory variables can be either categorical or quantitative. Because it is now a probability rather than a mean that depends on explanatory variables, we need inference methods that ensure 0≤p≤1. Logistic regression is a statistical method for describing these kinds of relationships.¹