The Great India Data Science Challenge Solution (Edgeverve) | asingleneuron
The Great India Data Science Challenge was a machine learning hiring challenge organized by Edgeverve on Hackerearth platform.
It's a single round challenge, the problem statement was focused on NLP ( text classification ).
I ended up with 100% accuracy on test dataset for this problem. Curious to know How?
PROBLEM STATEMENT:
Data consists of Invoice details for multiple customers as described below:
- Inv_ID (Invoice ID): Unique number representing Invoice created by supplier/vendor
- Vendor_Code (Vendor ID): Unique number representing Vendor/Seller in the procurement system
- GL_Code: Account’s Reference ID
- Inv_Amt: Invoice Amount
- Item_Description: Description of Item Purchased Example: “Corporate Services Human Resources Contingent Labor/Temp Labor Contingent Labor/Temp Labor”
- Product Category: Category of Product for which Invoice is raised A pseudo product category is represented in the dataset as CLASS-???, where ? is a digit.
Train dataset has 5566 rows and 6 features, Test dataset has 2446 rows and 5 features.
Task:
Our task is to predict the Product Category from the given invoice information.
Evaluation Metric:
Evaluation Metric for this problem was Accuracy Score.
UNDERSTAND THE DATA:
Target Distribution:
Let's draw a frequency distribution plot of Product Category.
From the plot, it's very clear that this is an imbalanced dataset. Product_Categories like CLASS-1758, CLASS-1274, CLASS-1522, CLASS-1250, CLASS-1376 falls under majority class and CLASS-1688, CLASS-2015, CLASS-2146, CLASS-1838, CLASS-1957 falls under minority class.
GL_Code Feature:
GL_Code feature has 9 unique categories.
Blue bars represent the frequency distribution of GL_Code for training dataset and orange bars represent for the test dataset. From the above plot, it's very clear that the distribution of GL_Code is same for both the dataset.
Vendor_Code Feature:
Vendor_Code feature has 1253 unique features.
Above plot is the combined frequency distribution plot of Vendor_Code feature for training and test dataset. From the plot, we can observe that the distribution of train and test is not the same. You can clearly see orange bars for a few vendor codes but no blue lines for the same. Means vendor codes like VENDOR_1712, VENDOR_1714, etc only present in the test dataset.
Inv_Amt Feature:
Inv_Amt is a numerical feature, that represents the Invoice Amount. Let's see how the Invoice Amoun is distributed for train and test dataset.
Above curve has the rectangular shape, that confirms Inv_Amt has a uniform distribution. A uniform distribution has a constant probability.
If all the values of invoice amount are equally probable, then the ML model is highly likely not to gain any valuable information from this feature.
If all the values of invoice amount are equally probable, then the ML model is highly likely not to gain any valuable information from this feature.
Invoice_Description Feature:
Invoice_Descirption feature is a text feature that contains the description of the purchased Item.
IMPUTING MISSING VALUES:
Missing values in train and test dataset are as follows :
There are no missing values.
MODEL WITH SELECTED FEATURES:
Started with a simple model that has only three features:-
- Numerically encode the categorical values by using LabelEncoder.
- Feed the feature matrix to XGBClassifier.
- Achieved 0.914 validation score.
print("Accuracy : ",accuracy_score(y_valid, np.argmax(y_pred_valid, axis=1)))
- Achieved 0.899 test score.
- The model was able to predict 32 product_categories in test data.
Feature Importance Graph:
xgb.plot_importance(clf, importance_type='gain');
In the above plot f0, f1 and f2 represent GL_Code, Vendor_Code, and Inv_Amt features respectively.
So we can see that the feature that has contributed the lowest gain is Inv_Amt. This was our intuition also because we have seen that Inv_Amt has uniform distribution and all the values of this feature are equally probable.
MODEL WITH BOW :
The second try was model by using the text feature (Item_Description):
- Preprocess Item_Description feature
- create word tokens
- text normalization, convert all the words to lowercase
- digits removal
- special characters removal
- punctuation removal
- stopwords removal
- clean extra spaces
- Extract BOW (Bag of Words) features
- feed the cleaned tokens into TfidfVectorizer
- Using maximum features. Unigrams, and bi-grams as well.
- Feed the BOW features to XGBClassifier
- Achieved 0.9988 validation score.
print("Accuracy : ",accuracy_score(y_valid, np.argmax(y_pred_valid, axis=1)))
- Achieved 0.999 test score.
- The model was able to predict 33 product_categories in test_data.
WINNING OOF:
The final model is based on OOF (Out of Fold) prediction. So the idea was to ensemble 5 XGBClassifers together in the hope of acheving improved score in test data.
- Split the BOW features into 5 folds, using StratifiedKFold.
- Train 5 different XGBClassifier by holding out one fold for validation and 4 folds for training the model.
- Do the prediction by using 5 different trained XGBClassifier(max_depth=6).
- Average them together and make the final prediction.
- Achieved 0.999 validation score.
print("Validation Scores :", y_valid_scores)
print("Average Score: ",np.round(np.mean(y_valid_scores),3))
That's all for the challenge. It was fun to achieve 1.0 test score. We have also seen that by doing simple EDA we can get the better intuition about the features.
KAGGLE KERNEL
GITHUB REPO
Please feel free to give you valuable feedback/suggestions, that will definitely help me to understand what is your expectation. THANK YOU !!
I love your blogs , they are really helping everyone , and the way you tackle problems is amazing
ReplyDeleteThank you. Keep learning.
Delete