RecSys Challenge
Deep funnel optimization for app installations focusing on user privacy.
In this challenge, we are provided a real-world ad dataset from the Sharechat and Moj apps to act as a benchmark for research into deep funnel optimization with a focus on user privacy. As part of the challenge, ShareChat released anonymized dataset corresponds to roughly 10M random users who visited the ShareChat + Moj app over three months.(We do not have the name of the feartures) we are directly given the preprocessed data.
Dataset
We are provided with the following data:
Feature | Description |
---|---|
Row ID |
f_0 |
Date |
f_1 |
Categorical Features |
f_2 … f_32 |
Binary Features |
f_33 … f_41 |
Numerical Features |
f_42 … f_79 |
Is_click |
[Binary] Did the user clicked on the ad |
Is Install |
[Binary] Did the user install the app |
As the data is anonymized, we do not have the name of the features. The data is preprocessed and we are directly given the preprocessed data. Basic Discription of data is given below:
Total Rows |
3485852 |
Only Installs |
358645 ~ 10% |
Only Clicks |
518357 ~ 15% |
Both Installed and Clicked |
247957 ~ 7% |
Unique Dates |
22 |
Data Exploration
We did some exporation regarding the range of the feautures and no. of classes in categorical features. Along with this we also explored the correlation between features and the target variable. We also explored the class imbalance in the dataset.
Data Preprocessing
All the data is given in terms of number so intialize steps of standardization and encoding (One hot or Label encoding) is not required. We just have to check for missing values and remove them. We have to check for outliers and remove them. We have to check for correlation between features and remove them. We have to check for class imbalance and use techniques like SMOTE to balance the classes.
Imputation
For categorical features we tried mode imputation , Knn based imputation and probabilistic imputation. For numerical features we tried mean imputation. The missing values can also be given a seperate category.
Models
We tried a lot of models like Logistic Regression, Random Forest, XGBoost, LightGBM, CatBoost, Naive Bayse, etc. We also tried stacking of models. We also tried different techniques like Grid Search, Random Search, Bayesian Optimization, etc. to tune the hyperparameters of the models.
Results
Model | Val Loss | Test Loss | Description |
---|---|---|---|
xgb_all_feat |
5.968 | 6.373 | Hyper parameter turing with optuna |
f1_xgb_catb_best |
Nan | 6.375 | Combined XGB and Catboost with f1 formula |
xgb_all_feat_xgb_chain |
5.99 | 6.385 | Xgb on top of another Xgb model |
catboost_all_feat |
5.739 | 6.461 | Catboost model with all the feature with optuna |
xgb_stacked_kfold_logistic |
Nan | 6.611 | XGB Stacked kflod Logistic Regression |
xgb_calibrated_logistic |
Nan | 6.613 | Calibrated XGB with logistic regression |
xgb_stack_all_cat_num_xgb |
6.026 | 6.643 | Stack all the above row 1,2 and 3 optuna with xgb |
xgb_num_cat_all_avg |
Nan | 6.690 | Simple average of all the xgb of row 1 |
xgb_cat_feat |
6.248 | 6.927 | Hyper parameter turing with optuna while using only Categorical Features |
xgb_float_all_feat |
5.79 | 7.254 | Converted all the categorical values to probablites |
xgb_num_feat |
6.245 | 7.401 | Hyper parameter turing with optuna while using only Numerical Features |
xgb_stack_all_cat_num |
7.685 | 9.910 | Stack all the above row 1,2 and 3 optuna with logistice regression model |