Predicting an Individual Retention Rate Using Statistical Analysis
Hyun-Joo Kim
Table of Content
2.3 Prediction and Application
Retention rate is a measure of student success in many universities. Retention is important because it implies how well a university and its community serve students and it plays an important role in the quality of the education. Truman State University also hopes to increase the retention rate.
For various reasons, students choose to leave Truman. For the last several years, many discussions have taken place to figure out the reason of students’ leaving, and there have been many efforts to help out those students who are considering leaving Truman. This project will focus on statistically analyzing the relationship between an individual retention rate and the possible causes discussed in many occasions, including financial, academic, and social aspects.
This project is to predict the retention rate of an individual student. The probability of a freshman individual retention will be found. This will help us to find appropriate ways and times to help these students. This project will statistically analyze the freshman data to find out what factors influence students’ decisions on retention and may help to eventually increase the overall retention rate of Truman in the future.
The correlations between the retention rate and each factor will be studied. A statistical model will be developed using the relations between retention rate and possible reasons of students’ leaving. Logistic regression is a very common method to analyze a binary response variable with various covariates. This will be used to analyze data and to build statistical models. An optimal set of factors that possibly determine retention will be found. Whether an individual student is highly likely to leave Truman or not will be predicted using a developed model. Back to top
Retention rate is a measure of student success in many universities. Retention is important because it implies how well a university and its community serve students, and it has an important role in the quality of the education. For the last few decades, there have been many studies related to the retention issue with many factors including gender (Spady (1971), Pascarella, Duby, and Iverson (1983), Stage and Hossler (1989)), race(Braxton, Duster, and Pascarella (1988), Stage and Hossler (1989)), the parents’ educational background (Pascarella and Terenzini (1978), Stage (1988)), a student’s educational aspiration (Bean (1982), Metzner and Bean (1987), college GPA (Bean (1982,1983), Cabrera et al., (1992, 1993)), the effect of financial aid (DesJardins et al. (1999), Hochstein and Butler (1983)), and many others. Truman State University is not an exception. For the last several years, there have been many studies about student retention, and we are trying to improve the current retention in Truman. (Ishiyama, 2000)
In this project, the relationship between an individual’s retention rate and the possible causes discussed in some of these internal and external studies is statistically analyzed hoping that this will provide better understanding of incoming students and their needs, and ultimately increases the overall retention rate by providing appropriate help. This is based on the idea that if we can identify the students who have higher potential of leaving Truman earlier and also know the reasons of their struggle, we will be able to give more appropriate, more efficient, and more timely help to the students. The early identification is important especially because many students decide to leave Truman in the very first couple weeks of their freshman year. If we can act sooner, we will be able to prevent students from leaving early. Statistical analysis on freshmen data and their retention can provide such early identification.
Many previous studies have dealt with the relationship between retention and individual factors. This project is to find the relationship of retention and the potential factors collectively, so that the interactions and correlation between factors can also be taken into account. Pearson correlation is studied between the retention variable and each individual factor. An optimal set of factors that determine an individual’s retention rate is found by a Logistic regression model. Whether an individual student is highly likely to leave Truman or not is predicted using the model developed and compared to the actual retention. The final model is analyzed using the model validation procedures including cross-tabulation and ROC curve. Back to top
There are many factors that might affect students’ decision about their staying or leaving Truman. Some students who are far away from their hometown might have a hard time adjusting themselves at the university. Some might found the courses to be too difficult. Some might lose their scholarship and become financially jeopardized. Some students might have a hard time finding the personal connection during the beginning of their college life and decide to try out other places. In this project, many possible factors are considered and analyzed with the retention, including financial aspects (family income, financial aid, how many hours working), academic aspects (expect to change a major, high school GPA, ACT composition scores, how many hours studying, and pursuing advanced degree), and social aspects (distance from home, ethnicity, gender, how many hours socializing, how many hours exercising, and whether or not a student is the first generation in a college). Table 1 summarizes the possible covariates (factors).
Table 1. Possible factors on the retention
|
Financial factor |
Academic factor |
Social factor |
|
Family income Financial aid How many hours working |
Expect to change major High school GPA ACT (composition) How many hours studying Pursue advanced degree
|
Hometown (distance from Kirksville) Ethnicity Sex How many hours socializing How many hours exercising First generation |
Note that the number of financial aid sources is used for the financial aid variable rather than the amount that he/she received. The expected to change major variable has a value of 1 for no chance, 2 for little chance, 3 for some chance, and 4 for good chance to change major. Pursuing advanced degree is a categorical variable with 0 if a student is pursuing up to bachelor degree or 1 if pursue beyond bachelor degree. Race variable is characterized by White, African American, American Indian, Asian, and others. Gender is 0 for male and 1 for female. First generation is 1 for the first generation college student or 0 otherwise. Note that covariates are mixture of continuous, ordinal, or categorical variables.
2 year retention is the response variable. If a student is still in Truman in the beginning of her/his junior year, the response variable (retention) is 1, or 0 is recorded otherwise. 2 year retention is important because students often make their decision in 2 years rather than later in their junior or senior year.
CIRP(Cooperative Institutional Research Program, freshman survey) and CSEQ (College Student Experiences Questionnaire) data is organized by Dr. John Ishiyama (TSU, Political Science) and his students (Ishiyama, 2000). The analysis is based on this data. The freshman data from 1996 to 1997 (with 2 year retention in the beginning of the year 1998, and 1999) is used to develop a statistical model. Back to top
Initially, the correlations between the response variable and each factor are studied. Pearson correlation shows that family income (.071), hours working for pay (-.062), change of major (.068), high school GPA (.131), ACT (.101), study time (.079), first generation (-.087) are highly correlated to the retention. Pursuing highest degree, race, and socializing hours are moderately, and financial aid, gender, and exercising hours are weakly correlated with the retention.
A logistic regression model is developed to study the relations between a binary response variable and possible covariates. This is an appropriate statistical method for the current data, since the two year retention is a binary variable (0 or 1), and we are looking for the explanation of the relationship between retention and many covariates. Various model selection procedures are run including backward stepwise and forward stepwise regression. These are very common model selection procedures that either eliminate unnecessary variables from the model or include necessary variables in the model. Various significance levels are also used. The following table summarizes the variables chosen by some of the model selection procedure and significance levels.
Table 2. Model selection by stepwise selection procedure
|
Model selection procedure |
Chosen variables |
|
Backward stepwise (α =.05) |
Income, change of major, high school GPA, ACT composition scores, hour spent studying, pursuing highest degree, sex, first generation |
|
Backward stepwise (α =.1) |
Income, change of major, high school GPA, ACT composition scores, hour spent studying |
|
Forward stepwise (α =.05) |
Income, change of major, high school GPA, ACT composition scores, hour spent studying |
|
Forward stepwise (α =.1) |
Income, change of major, high school GPA, ACT composition scores, hour spent studying, pursuing highest degree, sex, first generation |
An optimal set of factors is not clear in this case since different procedures give a different set of choices for possible optimal set. In particular, pursuit of highest degree, sex, and first generation variables could be or could not be in the model depending on which model selection procedure is used. In fact, these model selection procedures are known to be less reliable than other more sophisticated statistical model selection tools. For more reliable model selection, many other model selection criteria have been recently studied and proposed.
AIC (Akaike Information Criteria, Akaike, 1973) and AICc (corrected AIC, Hurvich & Tsai, 1989) are two of the most well known model selection criteria that can be applied in logistic regression model. Both measures estimate the difference between the unknown true model and the current candidate model. Thus, the smaller values of these criteria imply that candidate model is closer to the true model; thus, the better model. AIC and AICc are computed for the possible candidate models and the results are provided in Table 3.
Note that between five and ten variables are considered for the candidate models. It is pretty certain that 5 variables: income, expect to change major, high school GPA, ACT composition score, hours spent studying are important in the model (all the models include these 5 variables.) However, it is not so obvious whether the other 5 variables (the five variables with the next smallest p value), pursuit of highest degree, sex, first generation, financial aid, Asian or some of them, should be included in the model or not. Table 3 summarizes the results of model selection of these candidate models. Model 1 includes five essential variables in the model. Model 2 and 3 are the best two models among the models with 6 covariate variables (5 essential variables and one of the unobvious five variables). Model 4 and 5 are the best two models among with the models 7 covariate variables (5 essential variables and two of the possible 5 variables), and so on.
Table 3. Model selection from AIC, AICc method
|
|
Model |
AIC |
AICc |
|
1 |
Income, change of major, high school GPA, ACT composition scores, hour spent studying |
2273.135 |
2273.172 |
|
2 |
Income, change of major, high school GPA, ACT composition scores, hour spent studying, pursuing highest degree |
1991.843 |
1991.893 |
|
3 |
Income, change of major, high school GPA, ACT composition scores, hour spent studying, financial aid |
1703.089 |
1703.139 |
|
4 |
Income, change of major, high school GPA, ACT composition scores, hour spent studying, pursuing highest degree, financial aid |
1491.680 |
1491.744 |
|
5 |
Income, change of major, high school GPA, ACT composition scores, hour spent studying, first generation, financial aid |
1699.966 |
1700.030 |
|
6 |
Income, change of major, high school GPA, ACT composition scores, hour spent studying, pursuing highest degree, sex, financial aid |
1490.593 |
1490.673 |
|
7 |
Income, change of major, high school GPA, ACT composition scores, hour spent studying, sex, first generation, financial aid |
1700.342 |
1700.422 |
|
8 |
Income, change of major, high school GPA, ACT composition scores, hour spent studying, pursuing highest degree, sex, first generation, financial aid |
1488.883 |
1488.981 |
|
9 |
Income, change of major, high school GPA, ACT composition scores, hour spent studying, pursuing highest degree, sex, first generation, financial aid, Asian |
1490.036 |
1490.154 |
Note that the model with variables of income, change of major, high school GPA, ACT composition score, hours spent studying, pursuit highest degree, sex, first generation, and financial aid has the smallest AIC and AICc. This means that the model 8 is the closest model to the unknown true model. Note that model selection procedure gives slightly different results from AIC and AICc result. This is also different from Pearson correlation study. Statistically, AIC and AICc result is considered to be more reliable than others. Thus, we will use the model 8 in Table 3 for the final model. (Note that since model 4 or 6 in Table 3 have similar AIC and AICc values as the model 8 and have less number of variables than the model 8, one might have similar prediction results using model 4 and 6 in Table 3.) The final model includes the following factors.
Table 4. Variable included in the final model
|
Financial factor |
Academic factor |
Social factor |
|
Family income Financial Aid
|
Expect to change major High school GPA ACT (sub scores) How many hours studying Pursue advanced degree
|
Sex First generation |
The general formula for logistic regression is
Ln ( p(x)/(1-p(x)) ) = b0 + b1 X1 + b2 X2 + … + bk Xk + e,
where p(x) is the probability that a student retains at Truman after 2 years. From Table 5 (model coefficient), the final model equation can be written as
Ln ( p(x)/(1-p(x)) ) = -3.295 + .054 Income + .252 Change of major + .234 High school GPA
+ .035 ACT composition + .138 Hour spent studying + .023 Financial aid
+ .264 Pursing highest degree - .227 Sex - .275 Fist generation.
Note that there are a few variables with larger p values than an usual significance level 0.05. However, again, AIC and AICc chose this model as the closest model to the true model, thus these variables are kept in the final model.
Table 5. Model Coefficient
|
|
B |
S.E. |
Wald |
Df |
Sig. |
Exp(B) |
| INCOME | ||||||
|
.054 |
.028 |
3.608 |
1 |
.057 |
1.055 |
|
| CHANGEM | ||||||
|
.252 |
.071 |
12.476 |
1 |
.000 |
1.286 |
|
| HSGPA | ||||||
|
.234 |
.063 |
13.754 |
1 |
.000 |
1.264 |
|
| ACTCOMP | ||||||
|
.035 |
.023 |
2.226 |
1 |
.136 |
1.035 |
|
| STUDING | ||||||
|
.138 |
.047 |
8.733 |
1 |
.003 |
1.148 |
|
| AIDGRANT | ||||||
|
.023 |
.025 |
.849 |
1 |
.357 |
1.023 |
|
| HIGHEST | ||||||
|
.264 |
.168 |
2.471 |
1 |
.116 |
1.302 |
|
| SEX | ||||||
|
-.227 |
.140 |
2.616 |
1 |
.106 |
.797 |
|
| FIRSTGEN | ||||||
|
-.275 |
.142 |
3.747 |
1 |
.053 |
.760 |
|
| Constant | ||||||
|
-3.295 |
.732 |
20.266 |
1 |
.000 |
.037 |
From the previous logistic regression equation, several conclusions can be drawn.
2.3. Prediction and application
Using the logistic regression model developed in the previous section, we can predict an individual’s retention rate with the chosen covariates (factors). Once we recognize a student with a low potential retention rate, both academic and RCP advisors will be able to give appropriate advice for the student. This may influence the student positively and encourage them to stay at Truman. Some of the prediction examples are given in Table 6.
Table 6. Individual actual and predicted retention rate
|
|
Actual Retention |
Retention probability |
Income |
Change major |
H GPA |
ACT comp |
Study hour |
Financial Aid |
Degree pursue |
Sex |
First generation |
|
1 |
1 |
.702 |
6 |
3 |
7 |
24 |
3 |
16 |
1 |
2 |
0 |
|
2 |
1 |
.832 |
11 |
4 |
7 |
31 |
3 |
15 |
1 |
2 |
0 |
|
3 |
0 |
.699 |
8 |
2 |
6 |
34 |
1 |
19 |
1 |
1 |
0 |
|
4 |
0 |
.759 |
11 |
2 |
8 |
31 |
4 |
13 |
1 |
2 |
1 |
|
5 |
1 |
.704 |
7 |
3 |
8 |
23 |
4 |
13 |