Predicting Admission Yield Rate

Predicting Admission Yield Rate

Project Overview

The project “Predicting Admission Yield Rate” was developed as part of the Advanced Analytics Capstone to address a major challenge faced by educational institutions: accurately forecasting which admitted students will proceed to enrol, particularly following recent Immigration, Refugees and Citizenship Canada (IRCC) policy changes.
Each academic year, universities and colleges admit thousands of students, but a significant proportion of these offers do not translate into confirmed enrolments. This “admission-to-enrolment” gap, or yield rate, has a direct impact on institutional planning, resource allocation, tuition revenue, and student opportunity costs.
 
In early 2024, IRCC introduced a national cap on study permits and a new requirement for Provincial Attestation Letters (PALs); documents issued by provinces confirming that an admitted international student’s seat falls within the province’s allocation quota.
While designed to regulate international enrollment, this policy created a challenge: once a PAL is issued to a student who later fails to enroll, that seat cannot easily be reassigned to another applicant. In practice, this means colleges lose valuable study spaces and tuition opportunities, while other qualified applicants are denied offers because the quota appears full.
Recognizing this problem, the project sought to build a machine learning classification model capable of predicting the likelihood of a student enrolling after receiving an admission offer. This predictive insight would allow institutions to identify at-risk students early, target communications more effectively, and make data-driven admission decisions that ensure each available seat is filled by a likely enrollee. By identifying high-risk non-enrollees early, schools could:
  • Reallocate attestation letters more efficiently.
  • Reduce seat wastage under the new IRCC limits, and
  • Strengthen the overall yield rate and planning accuracy for upcoming academic terms.
The predictive system was designed to use schools’ historical admissions and application data to highlight risk factors such as application timing, payment patterns, and program type.
A Power BI dashboard was created to visualize key metrics; yield trends, international vs. domestic conversion rates, and time-based patterns, allowing the schools admissions office to make informed, data-driven decisions in real time.
 
notion image

Problem and Challenges

While the concept of predicting enrollment outcomes appears straightforward, the project encountered multiple technical and data-related challenges that shaped both the modeling process and the final insights.

1. Lack of Negative Class Data

A primary limitation was the absence of a clear dataset for students who were admitted but did not enroll. The data systems were optimized for tracking successful enrollments, meaning the project team had to engineer a proxy target variable.
This was achieved by matching applicant IDs between the admissions and enrollment databases—if an admitted student’s ID was missing from the enrollment records, they were classified as “not enrolled.”

2. Missing Immigration and Visa Data

For international students, visa and immigration approval status is one of the strongest predictors of actual enrolment. Unfortunately, the database did not include this critical information, and IRCC doesn't provide data of students who were denied visas. Without it, the model could not fully capture why certain applicants, especially international ones.
The absence of these external features limited the predictive accuracy and made it difficult to model the effects of new policies such as IRCC’s Provincial Attestation Letters (PALs) and the 2024 international student permit cap.

3. Data Leakage and Overfitting

During model development, the team identified a data leakage problem, as one of the columns in the dataset was highly correlated with the target variable, resulting in unrealistically high model accuracy.
Even Azure AutoML, which was used for benchmarking, failed to flag this issue because the feature selection algorithm continuously prioritized the leaked column.
The problem was eventually detected through Chi-Square feature analysis, where the feature’s score was nearly identical to the target distribution.
Looking back, running feature importance or SHAP analysis earlier could have revealed the overfitting faster. This experience reinforced the need for manual feature audits even when using automated ML tools.

4. Feature Engineering Complexity

Creating meaningful predictive features required extensive temporal analysis. Key engineered features included:
  • Application-to-Visa Gap: number of days between application submission and visa application date.
  • Application-to-Payment Duration: time between initial application and payment of admission fees.
  • Offer-to-Enrollment Lag: duration between offer issuance and confirmed enrollment.
However, not all these features had equal relevance across applicant types. Domestic students, for example, often apply later and don’t face visa-related delays, making time-based features less predictive for them. This introduced feature noise when modeling both populations together.

5. Segmentation Oversight

Initially, a single model was built for both domestic and international students. Later evaluation revealed this was a major oversight as each group is influenced by different factors.
  • International students: affected by visa timelines, immigration policies, housing, and early applications.
  • Domestic students: influenced more by local factors such as program demand, commuting distance, or financial aid.
The team concluded that separate models for each group would yield higher accuracy and interpretability, as a one-size-fits-all model blurred the distinctions between these behavioral segments.

6. AutoML Limitations

Although Azure AutoML was used to validate results, it did not automatically identify the data leakage issue and occasionally overfitted certain high-cardinality features. This experience underscored that AutoML should complement, not replace, human insight, especially in exploratory academic projects with limited data provenance.

notion image

Feature Selection Techniques

Six feature selection methods were compared:
Random Forest Importance, PCA, Recursive Feature Elimination (RFE), LASSO (L1), Mutual Information, and Tree-Based Selection.

Data Balancing Techniques

Addressed target imbalance (62.6% vs 37.4%) using:
  • SMOTE for synthetic oversampling,
  • RandomUnderSampler, and
  • SMOTE + Tomek Links for cleaner class boundaries.

Models Trained

  • Logistic Regression (baseline linear classifier)
  • Random Forest (ensemble decision-tree model)
  • XGBoost (optimized gradient boosting model)

Model Evaluation Metrics

Accuracy, F1 Score, Precision, Recall, ROC-AUC, and PR-AUC.
Cross-validated with RandomizedSearchCV (5-fold) for hyperparameter tuning.

Why Recall Was the Most Important Metric


What Recall Means (in simple terms)

Recall measures how many of the actual enrollees we successfully identified.
It answers the question:
“Of all the students who eventually enrolled, how many did our model correctly predict as likely to enroll?”
“It’s safer for the model to overestimate enrollment than to miss actual enrollees.”
Mathematically:
 
notion image
Where:
  • True Positives (TP) = Students who enrolled, and the model predicted they would.
  • False Negatives (FN) = Students who enrolled, but the model predicted they wouldn’t.

💡 Why Recall Matters More Than Precision Here

In many machine learning projects, Precision and Recall are balanced.
But in this case, recall carries more real-world importance because of the nature of our problem..
If a student who would have enrolled is incorrectly classified as “not likely to enroll,” the institution might:
  • Delay or skip sending follow-up reminders or enrollment support.
  • Lose track of real demand for a program.
  • End up issuing too few attestation letters in future cycles.
So, maximizing recall ensures the school captures as many true enrollees as possible.

Results

The best model was XGBoost with SMOTE + Tomek and RFE feature selection.