Donor Choose

Application status prediction for DonorChoose.org

The goal of the project is to predict whether or not a DonorsChoose.org project proposal submitted by a teacher will be approved, using the text of project descriptions as well as additional metadata about the project, teacher, and school.

Dataset

The train.csv data set provided by DonorsChoose contains the following features:

Feature Description
project_id A unique identifier for the proposed project. Example: p036502
project_title Title of the project. Examples:
Art Will Make You Happy!
project_grade_category Grade level of students for which the project is targeted. One of the following enumerated values: Grades PreK-2, Grades 3-5, Grades 6-8, Grades 9-12
project_subject_categories One or more (comma-separated) subject categories for the project from the following list of values: Applied Learning, Care & Hunger, Health & Sports, History & Civics, Literacy & Language, Math & Science, Music & The Arts, Special Needs, Warmth
Examples: Music & The Arts Literacy & Language, Math & Science
school_state State where school is located (Two-letter U.S. postal code). Example: WY
project_subject_subcategories One or more (comma-separated) subject subcategories for the project. Examples: Literature & Writing, Social Sciences
project_resource_summary An explanation of the resources needed for the project. Example: My students need hands on literacy materials to manage sensory needs!
project_essay_1 Introduce us to your classroom
project_essay_2 Tell us more about your students
project_essay_3 Describe how your students will use the materials you’re requesting
project_essay_4 Close by sharing why your project will make a difference
project_submitted_datetime Datetime when project application was submitted. Example: 2016-04-28 12:43:56.245
teacher_id A unique identifier for the teacher of the proposed project. Example: bdf8baa8fedef6bfeec7ae4ff1c15c56
teacher_prefix Teacher’s title. One of the following enumerated values: nan, Dr. ,Mr., Mrs. ,Ms. ,Teacher.
teacher_number_of_previously_posted_projects Number of project applications previously submitted by the same teacher. Example: 2

Additionally, the resources.csv data set provides more data about the resources required for each project. Each line in this file represents a resource required by a project:

Feature Description
id A project_id value from the train.csv file. Example: p036502
description Desciption of the resource. Example: Tenor Saxophone Reeds, Box of 25
quantity Quantity of the resource required. Example: 3
price Price of the resource required. Example: 9.95

Note: Many projects require multiple resources. The id value corresponds to a project_id in train.csv, so you use it as a key to retrieve all resources needed for a project:

The data set contains the following label (the value you will attempt to predict):

Label Description
project_is_approved A binary flag indicating whether DonorsChoose approved the project. A value of 0 indicates the project was not approved, and a value of 1 indicates the project was approved.

Exploratory Data Analysis

  • Univariate Analysis
    • Distribution of target variable
    • Distribution of categorical variables
      • Distribution of School State with respect to target variable(Acceptance Rate) and number of projects applied: Have a look at the extrem values
      • Distribution of Teacher Prefix with respect to target variable(Acceptance Rate) and number of projects applied: Have a look at the extrem values
      • Distribution of Project Grade Category with respect to target variable(Acceptance Rate) and number of projects applied: Have a look at the extrem values
      • Distribution of Project Subject Categories with respect to target variable(Acceptance Rate) and number of projects applied: Have a look at the extrem values
      • Distribution of Project Subject Subcategories with respect to target variable(Acceptance Rate) and number of projects applied: Have a look at the extrem values
      • Title of the project length vs target variable(Acceptance Rate)
      • Resource summary length vs target variable(Acceptance Rate)
      • Number of previously posted projects by the same teacher vs target variable(Acceptance Rate)
      • Relation between number of previously posted projects by the same teacher and acceptance rate
    • Distribution of numerical variables
      • Cost of the project vs target variable(Acceptance Rate) and distribution of cost of the project
  • Bivariate Analysis

Data Preprocessing

  • Preprocessing of Categorical Features
    • School State
    • Teacher Prefix
    • Project Grade Category
    • Project Subject Categories
    • Project Subject Subcategories

The First step is to convert all the features into standard names so that they do not have any spaces in between and one categorie is a single word. Now we have two choices for categorical features conversion.

  • Convert them into ordinal numbers using LabelEncoder. (But inherent order is not present in the categories)
  • Convert them into one hot encoding using OneHotEncoder

These choices are dependent upon the model we select because a model like Naive bayse or decision tree would be able to directly able to use the ordinal values but a model like logistic regression or SVM would be able to use the one hot encoding. So we will try both the approaches and see which one works better.

Here we have to choices for features which are a list of categories:

  • Convert them into a single category by concatenating all the categories
  • Convert them into multiple categories and use them as a bag of words vector

If we have a limited number of categories when concatened then we can use the first approach but if the number of categories is large then we can use the second approach.

  • Preprocessing of Numerical Features
    • Teacher Number of Previously Posted Projects
    • Price of the Project We have to approches for preprocessing numerical features:
    • Standardize the data using StandardScaler
    • MinMax Scaler to scale the data
  • Text Preprocessing
    • Project Title
    • Project Essay
    • Project Resource Summary

Text preprocessing can be done is sereveral different manners we have based on how we want to represent the text vectores. For methods like TF-IDF, Bag of Words we do the following steps.

  • Lower case the text
  • Tokenize the text
  • Remove stop words
  • Remove words with numbers ( Depend on the problem statement)
  • Remove words with special characters (Depend on the problem statement)
  • Remove purturations and irregular words.
  • Stemming or Lemmatization

Ones we have a clean corpus we can make vectores like TF-IDF, Bag of Words, etc. We have few more considerations when using TF-IDF and Bag of Words. We can use n-grams to capture the context of the words. We can also use the frequency of the words as a feature. We can also use the IDF values as a feature. We can also use the TF-IDF values as a feature. We can also use the TF-IDF weighted word2vec as a feature.

If we are using word2vec kind of method then we do not need to do all the above steps. We can just tokenize the text and use it and and when we encounter a UNK token we can use the word2vec model to get the vector for the word. The summazation of the vectors of all the words in the sentence is the vector for the sentence. These are dense work vectors. This will be same if we are using BERT or any other transformer based model. The vector corresponding to the CLS token is the vector for the sentence.

Machine Learning Models

This is a classification problem we have follwing models to try.

  • Naive Bayes, Decision Tree, Random Forest, XGBoost

    The this can handle ordinal features.

  • SVM, Logistic Regression

    These models can handle only one hot encoded features. The oridinal features are not suitable because they have an inherent order in them and we are learning weights for each feature. In both the models.

Results

Here are only present the results for the best model we have tried different hyperparameters and features engineering techniques to get the best results. Out of all the models we have tried the best model is LSTM + NN and needs some explanation. We used LSTM to get the feature embedding for the text data and then we used embedding matrix for each categorical feature and the numerical features are used as it is. We then concatenated all the features and passed it through a neural network to get the final result.

Model Train AUC Test AUC
KNN 1.000 0.560
Naive Bayes 0.715 0.650
Decision Tree 0.686 0.685
GBDT 0.747 0.710
LSTM + NN 0.844 0.754