Building a Recommendation System

Introduction

Recommendation systems are everywhere: YouTube, Netflix, Supermarket coupons, Amazon, BestBuy, search engines, etc. They are often used to offer carefully selected products to the consumer and relevant information to web users. In today's blog, I will walk you through a recommendation code I implemented to help car shop employees select the correct label in the billing system. You can also follow along the

GitHub Repository

Data Set

The data set has been obtained from the said car shop. This is comprised of a sample of a database of descriptions generated over thirteen years (around 5 thousand bills). The data, stored in a DataFrame, contains the bill description and a label (Appendix B)- which can take on or multiple values of the 9 total categories. The first five rows of the DataFrame:

	description	label
0	upstream 02 sensor	5
1	install dual pipes	9
2	cqs 18 818285 front struts	1
3	oil change / mount 4 tires & ballance	2,8
4	labor on racks & pinion	9

Model

Due to the nonbinary nature of our categories, a one vs. all (or one vs. rest) classifier with a logistic regressor as the base classifier will be used - This classifier will create 5 logistic classifiers and will output the label with the highest probability.

To feed the data into the classifier, the labels have to be properly formatted (Appendix A) - in a "one hot" encoding string. Once formatted, the data set is split into the training and testing sets to be fed into the OneVsRestClassifier() with the next lines of code:

from sklearn.model_selection import train_test_split


X_train, X_test, Y_train, Y_test = train_test_split(DF['description'],
                    DF['emb_label'], test_size=0.25)

from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression


vectorizer = TfidfVectorizer()
vectorizer.fit(X_train)
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

base_classifier = LogisticRegression()
clf = OneVsRestClassifier(base_classifier)
clf.fit(X_train_tfidf, Y_train)

Despite an accuracy of 68%, the classifier is extremely useful for our purpose. We will suggest not only the best guest from the classifier but also a collection of the best. To retrieve the top recommendation the probabilities of each class must be sorted, this process is done with the help of the next lines of code:

def top_label_recommendation(description: str, number_of_recommenmdations = 3):
    probability_values = clf.predict_proba(vectorizer.transform([description]))
    probability_values_DF = pd.DataFrame(probability_values.transpose(), columns=['probability']) #WE WILL TAKE ADVANTAGE OF THE DATAFRAMES PROPETIES TO ORGANIZE THE PROBABILITIES AND                                                                                                  
    probability_values_DF.sort_values(by='probability', ascending=False, inplace=True, ignore_index=False)#                            TO CONSERVE THE INDEX LINKED TO THE CLASS PREDICTION

    label_recommendation=[]
    for index in probability_values_DF.index[:number_of_recommenmdations]:
        label_recommendation += one_hot_to_label_description(clf.classes_[index])
    return(list(set(label_recommendation)))

Results

The code works as expected and here you have a few examples:

#EX: 1
for element in top_label_recommendation('oil change, rear tie rod pass side'):
    print(element)
#OUTPUT
#Oil Change, Ignition, Fuel System
#No category
#Shocks, Control Arms, Tires, Alignment

#EX: 2
for element in top_label_recommendation('oxygen sensor up stream replacement', 4):
    print(element)
#OUTPUT
#ABS Control Module, Brake Lines, Brake Pads
#Oil Change, Ignition, Fuel System
#No category
#Check Engine Light, Inspections

Appendix

(A) Label formatting code:

#WE CREATE A NEW COLUMN WITH A PROPER LABEL TO ENCODE
def one_hot_label(label: str):
    place_holder_list = [0 for _ in range(9)]
    for ind_label in label.split(','):
        if int(ind_label)<=9:
            place_holder_list[int(ind_label)-1] = 1
    return ','.join(list(map(str, place_holder_list)))

DF['emb_label'] = DF['label'].apply(one_hot_label)

(B) Category table

index	category
0	Shocks, Control Arms, Tires, Alignmen
1	Oil Change, Ignition, Fuel System
2	Manufacturer Service Intervals
3	Dashboard, Door Locks, Windows
4	Check Engine Light, Inspections
5	Alternator, Battery, Starter, Switches
6	AC System, Blower Motor
7	ABS Control Module, Brake Lines, Brake Pads
8	No category