The Great India Data Science Challenge was a machine learning hiring challenge organized by Edgeverve on Hackerearth platform.

It's a single round challenge, the problem statement was focused on NLP ( text classification ).

I ended up with 100% accuracy on test dataset for this problem. Curious to know How?

PROBLEM STATEMENT:

Data consists of Invoice details for multiple customers as described below:

Inv_ID (Invoice ID): Unique number representing Invoice created by supplier/vendor

Vendor_Code (Vendor ID): Unique number representing Vendor/Seller in the procurement system

GL_Code: Account’s Reference ID

Inv_Amt: Invoice Amount

Item_Description: Description of Item Purchased Example: “Corporate Services Human Resources Contingent Labor/Temp Labor Contingent Labor/Temp Labor”

Product Category: Category of Product for which Invoice is raised A pseudo product category is represented in the dataset as CLASS-???, where ? is a digit.

Train dataset has 5566 rows and 6 features, Test dataset has 2446 rows and 5 features.

Task:

Our task is to predict the Product Category from the given invoice information.

Evaluation Metric:

Evaluation Metric for this problem was Accuracy Score.

UNDERSTAND THE DATA:

Target Distribution:

Let's draw a frequency distribution plot of Product Category.

From the plot, it's very clear that this is an imbalanced dataset. Product_Categories like CLASS-1758, CLASS-1274, CLASS-1522, CLASS-1250, CLASS-1376 falls under majority class and CLASS-1688, CLASS-2015, CLASS-2146, CLASS-1838, CLASS-1957 falls under minority class.

GL_Code Feature:

GL_Code feature has 9 unique categories.

Blue bars represent the frequency distribution of GL_Code for training dataset and orange bars represent for the test dataset. From the above plot, it's very clear that the distribution of GL_Code is same for both the dataset.

Vendor_Code Feature:

Vendor_Code feature has 1253 unique features.

Above plot is the combined frequency distribution plot of Vendor_Code feature for training and test dataset. From the plot, we can observe that the distribution of train and test is not the same. You can clearly see orange bars for a few vendor codes but no blue lines for the same. Means vendor codes like VENDOR_1712, VENDOR_1714, etc only present in the test dataset.

Inv_Amt Feature:

Inv_Amt is a numerical feature, that represents the Invoice Amount. Let's see how the Invoice Amoun is distributed for train and test dataset.

Above curve has the rectangular shape, that confirms Inv_Amt has a uniform distribution. A uniform distribution has a constant probability.

If all the values of invoice amount are equally probable, then the ML model is highly likely not to gain any valuable information from this feature.

Invoice_Description Feature:

Invoice_Descirption feature is a text feature that contains the description of the purchased Item.

IMPUTING MISSING VALUES:

Missing values in train and test dataset are as follows :

There are no missing values.

MODEL WITH SELECTED FEATURES:

Started with a simple model that has only three features:-

Numerically encode the categorical values by using LabelEncoder.
Feed the feature matrix to XGBClassifier.
Achieved 0.914 validation score.

print("Accuracy : ",accuracy_score(y_valid, np.argmax(y_pred_valid, axis=1)))

Accuracy :  0.9143712574850299

Achieved 0.899 test score.

The model was able to predict 32 product_categories in test data.

Feature Importance Graph:

xgb.plot_importance(clf, importance_type='gain');

In the above plot f0, f1 and f2 represent GL_Code, Vendor_Code, and Inv_Amt features respectively.

So we can see that the feature that has contributed the lowest gain is Inv_Amt. This was our intuition also because we have seen that Inv_Amt has uniform distribution and all the values of this feature are equally probable.

MODEL WITH BOW :

The second try was model by using the text feature (Item_Description):

Preprocess Item_Description feature

create word tokens
text normalization, convert all the words to lowercase
digits removal
special characters removal
punctuation removal
stopwords removal
clean extra spaces

Extract BOW (Bag of Words) features

feed the cleaned tokens into TfidfVectorizer
Using maximum features. Unigrams, and bi-grams as well.

Feed the BOW features to XGBClassifier
Achieved 0.9988 validation score.

print("Accuracy : ",accuracy_score(y_valid, np.argmax(y_pred_valid, axis=1)))

Accuracy :  0.9988023952095808

Achieved 0.999 test score.

The model was able to predict 33 product_categories in test_data.

WINNING OOF:

The final model is based on OOF (Out of Fold) prediction. So the idea was to ensemble 5 XGBClassifers together in the hope of acheving improved score in test data.

Split the BOW features into 5 folds, using StratifiedKFold.
Train 5 different XGBClassifier by holding out one fold for validation and 4 folds for training the model.
Do the prediction by using 5 different trained XGBClassifier(max_depth=6).
Average them together and make the final prediction.
Achieved 0.999 validation score.

print("Validation Scores :", y_valid_scores)
print("Average Score: ",np.round(np.mean(y_valid_scores),3))

Validation Scores : [0.9982222222222222, 
0.9991055456171736, 
0.9991015274034142, 
1.0, 
0.9972850678733032]
Average Score:  0.999

Achieved 1.0 test score.

The final model was able to predict 34 product_categories in test data.

That's all for the challenge. It was fun to achieve 1.0 test score. We have also seen that by doing simple EDA we can get the better intuition about the features.

KAGGLE KERNEL

GITHUB REPO

Please feel free to give you valuable feedback/suggestions, that will definitely help me to understand what is your expectation. THANK YOU !!

Search This Blog

A Single Neuron

The Great India Data Science Challenge Solution (Edgeverve) | asingleneuron