Predicting an Individual Retention Rate Using Statistical Analysis

Hyun-Joo Kim

 

 

Table of Content

Abstract

1.  Introduction

2. Data and Analysis

2.1  Data

2.2  Analysis

2.3  Prediction and Application

2.4  Model Validation

3.  Conclusion

4.  Future Study

Appendix

Reference

 

Abstract

 

Retention rate is a measure of student success in many universities.  Retention is important because it implies how well a university and its community serve students and it plays an important role in the quality of the education.  Truman State University also hopes to increase the retention rate.

 

For various reasons, students choose to leave Truman.  For the last several years, many discussions have taken place to figure out the reason of students’ leaving, and there have been many efforts to help out those students who are considering leaving Truman.  This project will focus on statistically analyzing the relationship between an individual retention rate and the possible causes discussed in many occasions, including financial, academic, and social aspects.

 

This project is to predict the retention rate of an individual student.  The probability of a freshman individual retention will be found.  This will help us to find appropriate ways and times to help these students.  This project will statistically analyze the freshman data to find out what factors influence students’ decisions on retention and may help to eventually increase the overall retention rate of Truman in the future.

 

The correlations between the retention rate and each factor will be studied.  A statistical model will be developed using the relations between retention rate and possible reasons of students’ leaving.  Logistic regression is a very common method to analyze a binary response variable with various covariates.  This will be used to analyze data and to build statistical models.  An optimal set of factors that possibly determine retention will be found.  Whether an individual student is highly likely to leave Truman or not will be predicted using a developed model.  Back to top


1. Introduction 

 

Retention rate is a measure of student success in many universities.  Retention is important because it implies how well a university and its community serve students, and it has an important role in the quality of the education. For the last few decades, there have been many studies related to the retention issue with many factors including gender (Spady (1971), Pascarella, Duby, and Iverson (1983), Stage and Hossler (1989)), race(Braxton, Duster, and Pascarella (1988), Stage and Hossler (1989)), the parents’ educational background (Pascarella and Terenzini (1978), Stage (1988)), a student’s educational aspiration (Bean (1982), Metzner and Bean (1987), college GPA (Bean (1982,1983), Cabrera et al., (1992, 1993)), the effect of financial aid (DesJardins et al. (1999), Hochstein and Butler (1983)), and many others. Truman State University is not an exception. For the last several years, there have been many studies about student retention, and we are trying to improve the current retention in Truman. (Ishiyama, 2000)

 

In this project, the relationship between an individual’s retention rate and the possible causes  discussed in some of these internal and external studies is statistically analyzed hoping that this will provide better understanding of incoming students and their needs, and ultimately increases the overall retention rate by providing appropriate help. This is based on the idea that if we can identify the students who have higher potential of leaving Truman earlier and also know the reasons of their struggle, we will be able to give more appropriate, more efficient, and more timely help to the students. The early identification is important especially because many students decide to leave Truman in the very first couple weeks of their freshman year. If we can act sooner, we will be able to prevent students from leaving early. Statistical analysis on freshmen data and their retention can provide such early identification.

 

Many previous studies have dealt with the relationship between retention and individual factors. This project is to find the relationship of retention and the potential factors collectively, so that the interactions and correlation between factors can also be taken into account. Pearson correlation is studied between the retention variable and each individual factor. An optimal set of factors that determine an individual’s retention rate is found by a Logistic regression model. Whether an individual student is highly likely to leave Truman or not is predicted using the model developed and compared to the actual retention. The final model is analyzed using the model validation procedures including cross-tabulation and ROC curve.  Back to top

 

2. Data and Analysis

 

2.1. Data

There are many factors that might affect students’ decision about their staying or leaving Truman. Some students who are far away from their hometown might have a hard time adjusting themselves at the university. Some might found the courses to be too difficult. Some might lose their scholarship and become financially jeopardized. Some students might have a hard time finding the personal connection during the beginning of their college life and decide to try out other places. In this project, many possible factors are considered and analyzed with the retention, including financial aspects (family income, financial aid, how many hours working), academic aspects (expect to change a major, high school GPA, ACT composition scores, how many hours studying, and pursuing advanced degree), and social aspects (distance from home, ethnicity, gender, how many hours socializing, how many hours exercising, and whether or not a student is the first generation in a college). Table 1 summarizes the possible covariates (factors).

 

Table 1.  Possible factors on the retention

 

Financial factor

Academic factor

Social factor

Family income

Financial aid

How many hours working

Expect to change major

High school GPA

ACT (composition)

How many hours studying

Pursue advanced degree

 

Hometown (distance from Kirksville)

Ethnicity

Sex

How many hours socializing

How many hours exercising

First generation

 

Note that the number of financial aid sources is used for the financial aid variable rather than the amount that he/she received. The expected to change major variable has a value of 1 for no chance, 2 for little chance, 3 for some chance, and 4 for good chance to change major. Pursuing advanced degree is a categorical variable with 0 if a student is pursuing up to bachelor degree or 1 if pursue beyond bachelor degree. Race variable is characterized by White, African American, American Indian, Asian, and others. Gender is 0 for male and 1 for female. First generation is 1 for the first generation college student or 0 otherwise. Note that covariates are mixture of continuous, ordinal, or categorical variables.

2 year retention is the response variable. If a student is still in Truman in the beginning of her/his junior year, the response variable (retention) is 1, or 0 is recorded otherwise. 2 year retention is important because students often make their decision in 2 years rather than later in their junior or senior year.

 

CIRP(Cooperative Institutional Research Program, freshman survey) and CSEQ (College Student Experiences Questionnaire)  data is organized by Dr. John Ishiyama (TSU, Political Science) and his students (Ishiyama, 2000). The analysis is based on this data. The freshman data from 1996 to 1997 (with 2 year retention in the beginning of the year 1998, and 1999) is used to develop a statistical model.  Back to top

 

2.2. Analysis

Initially, the correlations between the response variable and each factor are studied. Pearson correlation shows that family income (.071), hours working for pay (-.062), change of major (.068), high school GPA (.131), ACT (.101), study time (.079), first generation (-.087) are highly correlated to the retention. Pursuing highest degree, race, and socializing hours are moderately, and financial aid, gender, and exercising hours are weakly correlated with the retention.

 

A logistic regression model is developed to study the relations between a binary response variable and possible covariates. This is an appropriate statistical method for the current data, since the two year retention is a binary variable (0 or 1), and we are looking for the explanation of the relationship between retention and many covariates. Various model selection procedures are run including backward stepwise and forward stepwise regression. These are very common model selection procedures that either eliminate unnecessary variables from the model or include necessary variables in the model. Various significance levels are also used. The following table summarizes the variables chosen by some of the model selection procedure and significance levels.


 

Table 2. Model selection by stepwise selection procedure

 

Model selection procedure

Chosen variables

Backward stepwise (α =.05)

Income, change of major, high school GPA, ACT composition scores, hour spent studying, pursuing highest degree, sex, first generation

Backward stepwise (α =.1)

Income, change of major, high school GPA, ACT composition scores, hour spent studying

Forward  stepwise (α =.05)

Income, change of major, high school GPA, ACT composition scores, hour spent studying

Forward  stepwise (α =.1)

Income, change of major, high school GPA, ACT composition scores,  hour spent studying, pursuing highest degree, sex, first generation

 

An optimal set of factors is not clear in this case since different procedures give a different set of choices for possible optimal set. In particular, pursuit of highest degree, sex, and first generation variables could be or could not be in the model depending on which model selection procedure is used. In fact, these model selection procedures are known to be less reliable than other more sophisticated statistical model selection tools. For more reliable model selection, many other model selection criteria have been recently studied and proposed.

 

AIC (Akaike Information Criteria, Akaike, 1973) and AICc (corrected AIC, Hurvich & Tsai, 1989) are two of the most well known model selection criteria that can be applied in logistic regression model. Both measures estimate the difference between the unknown true model and the current candidate model. Thus, the smaller values of these criteria imply that candidate model is closer to the true model; thus, the better model. AIC and AICc are computed for the possible candidate models and the results are provided in Table 3.

 

Note that between five and ten variables are considered for the candidate models. It is pretty certain that 5 variables: income, expect to change major, high school GPA, ACT composition score, hours spent studying are important in the model (all the models include these 5 variables.) However, it is not so obvious whether the other 5 variables (the five variables with the next smallest p value), pursuit of highest degree, sex, first generation, financial aid, Asian or some of them, should be included in the model or not. Table 3 summarizes the results of model selection of these candidate models. Model 1 includes five essential variables in the model. Model 2 and 3 are the best two models among the models with 6 covariate variables (5 essential variables and one of the unobvious five variables). Model 4 and 5 are the best two models among with the models 7 covariate variables (5 essential variables and two of the possible 5 variables), and so on.

 

Table 3. Model selection from AIC, AICc method

 

 

Model

AIC

AICc

1

Income, change of major, high school GPA, ACT composition scores, hour spent studying

2273.135

2273.172

2

Income, change of major, high school GPA, ACT composition scores, hour spent studying, pursuing highest degree

1991.843

1991.893

3

Income, change of major, high school GPA, ACT composition scores, hour spent studying, financial aid

1703.089

1703.139

4

Income, change of major, high school GPA, ACT composition scores,  hour spent studying, pursuing highest degree, financial aid

1491.680

1491.744

5

Income, change of major, high school GPA, ACT composition scores,  hour spent studying, first generation, financial aid

1699.966

1700.030

6

Income, change of major, high school GPA, ACT composition scores,  hour spent studying, pursuing highest degree, sex, financial aid

1490.593

1490.673

7

Income, change of major, high school GPA, ACT composition scores,  hour spent studying, sex, first generation, financial aid

1700.342

1700.422

8

Income, change of major, high school GPA, ACT composition scores, hour spent studying, pursuing highest degree, sex, first generation, financial aid

1488.883

1488.981

9

Income, change of major, high school GPA, ACT composition scores,  hour spent studying, pursuing highest degree, sex, first generation, financial aid, Asian

1490.036

1490.154

 

Note that the model with variables of income, change of major, high school GPA, ACT composition score, hours spent studying, pursuit highest degree, sex, first generation, and financial aid has the smallest AIC and AICc. This means that  the model 8 is the closest model to the unknown true model. Note that model selection procedure gives slightly different results from AIC and AICc result. This is also different from Pearson correlation study. Statistically, AIC and AICc result is considered to be more reliable than others.  Thus, we will use the model 8 in Table 3 for the final model. (Note that since model 4 or 6 in Table 3 have similar AIC and AICc values as the model 8 and have less number of variables than the model 8, one might have similar prediction results using model 4 and 6 in Table 3.) The final model includes the following factors.

 

Table 4. Variable included in the final model

 

Financial factor

Academic factor

Social factor

Family income

Financial Aid

 

Expect to change major

High school GPA

ACT (sub scores)

How many hours studying

Pursue advanced degree

 

Sex

First generation

 

The general formula for logistic regression is

Ln ( p(x)/(1-p(x)) ) = b0 + b1 X1 + b2 X2 + … + bk Xk + e,

where p(x) is the probability that a student retains at Truman after 2 years.  From Table 5 (model coefficient), the final model equation can be written as

Ln ( p(x)/(1-p(x)) ) = -3.295 + .054 Income + .252 Change of major + .234 High school GPA        

                                    + .035 ACT composition + .138 Hour spent studying + .023 Financial aid        

                    + .264 Pursing highest degree - .227 Sex - .275 Fist generation.

Note that there are a few variables with larger p values than an usual significance level 0.05. However, again, AIC and AICc chose this model as the closest model to the true model, thus these variables are kept in the final model.

 

  

Table 5. Model Coefficient

 

 

 

B

S.E.

Wald

Df

Sig.

Exp(B)

 

INCOME

.054

.028

3.608

1

.057

1.055

 

CHANGEM

.252

.071

12.476

1

.000

1.286

 

HSGPA

.234

.063

13.754

1

.000

1.264

 

ACTCOMP

.035

.023

2.226

1

.136

1.035

 

STUDING

.138

.047

8.733

1

.003

1.148

 

AIDGRANT

.023

.025

.849

1

.357

1.023

 

HIGHEST

.264

.168

2.471

1

.116

1.302

 

SEX

-.227

.140

2.616

1

.106

.797

 

FIRSTGEN

-.275

.142

3.747

1

.053

.760

 

Constant

-3.295

.732

20.266

1

.000

.037

 

 

From the previous logistic regression equation, several conclusions can be drawn.

 

2.3. Prediction and application

Using the logistic regression model developed in the previous section, we can predict an individual’s retention rate with the chosen covariates (factors).  Once we recognize a student with a low potential retention rate, both academic and RCP advisors will be able to give appropriate advice for the student.  This may influence the student positively and encourage them to stay at Truman. Some of the prediction examples are given in Table 6.

 

 

Table 6.  Individual actual and predicted retention rate

 

 

Actual

Retention

Retention

probability

Income

Change

major

H

GPA

ACT

comp

Study

hour

Financial

Aid

Degree

pursue

Sex

First

generation

1

1

.702

6

3

7

24

3

16

1

2

0

2

1

.832

11

4

7

31

3

15

1

2

0

3

0

.699

8

2

6

34

1

19

1

1

0

4

0

.759

11

2

8

31

4

13

1

2

1

5

1

.704

7

3

8

23

4

13