Data labeling is a critical step in the creation of supervised machine-learning models. The high quality and the large number of labels required to train a model have made this task an arduous and expensive endeavor, in many cases, the labeling step is made by humans. Thanks to the advancements of Large Language AI models such as Open AI, the labeling process can be made for a fraction of the cost - especially if we classify data in batches to reduce token usage.
In this blog, you will walk through the code I used to label a database of bill descriptions. You can find the complete code in here:
Data Set Used
The data displayed in this blog came from the bills accumulated over the years by a car repair center. The description of the bills is the subject of study in this blog.
description | |
0 | 2 EXHAUST PIPE |
1 | 2 EXTENSION PIPES |
2 | 2 rear struts |
3 | MASTER CYLINDER |
4 | oil change & filter |
They will be categorized in the following categories:
Shocks, Control Arms, Tires, Alignment
Oil Change, Ignition, Fuel System
Manufacturer Service Intervals
Dashboard, Door Locks, Windows
Check Engine Light, Inspections
Alternator, Battery, Starter, Switches
AC System, Blower Motor
ABS Control Module, Brake Lines, Brake Pads
None
Data Cleaning
To feed the data into OpenAI API, the data needs to be cleaned within some parameters which will make the final cleaning process much more easier. To accomplish this, we will use pandas DataFrames. Characters such as "|", "-", ",", and double spaces. Additionally, empty strings and strings that contain only numbers will be set to null (np.nan). All of these and some extra cleaning will be done with the cleaning function shown in the Appendix section (A).
Data Labeling
To label the data, the cleaned DataFrame will be passed through OpenAI's API in groups of rows of 100 to reduce the number of tokens used. The setting up of OpenAI's API and prompt were done by applying the next function to our DataFrame.
def gpt_category_bot(bill_description: str):
prompt = f"""You are a car expert. Label the next data:
{bill_description}
In one or multiple of the next categories:
1. Shocks, Control Arms, Tires, Alignment
2. Oil Change, Ignition, Fuel System
3. Manufacturer Service Intervals
4. Dashboard, Door Locks, Windows
5. Check Engine Light, Inspections
6. Alternator, Battery, Starter, Switches
7. AC System, Blower Motor
8. ABS Control Module, Brake Lines, Brake Pads
9. None
return a two column table with each data entry and the category numbers (not the name of the category), use "|" to separate the columns
"""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages = [{'role': 'user', 'content': prompt}],
top_p=1,
frequency_penalty=0,
presence_penalty=0)
print_billing(prompt=prompt,response=response)
return response.choices[0].message.content
Where the "bill description" is the string with the 100 rows from the DataFrame. The batches of descriptions are obtained by iterating on the rows of the DataFrame with the code shown in Appendix (B).
Result Cleaning
Even though the table format was requested from Chat Gpt, there are plenty of ways for errors and formatting errors to be made. First, the data must be extracted from the OpenAI's response, and the labels generated by OpenAI's API have to be standardized so they can be reassigned to the original DataFrame.
def label_cleaning(label_list: list):
result = []
for item in label_list:
if len(item) == 0: continue
if item[0] == '|': item = item[1:]
if item[-1] == '|': item = item[:-1]
if len(item.split('|')) < 2: continue
if len(item.split('|')) >= 2:
labels = ','.join(item.split('|')[1:])
labels = re.sub(r',\s+', ',', labels)
labels = re.sub(r'\s+,', ',', labels)
if labels[0] == ',': labels = labels[1:]
item = item.split('|')[0] + '|' + labels
description = item.split('|')[0]
description = re.sub(r'\s+', ' ', description)
description = description.strip()
item = description + '|' + labels
result.append(item)
return result
Finally, the labels stored in the resulting list have to be paired with the descriptions in the DataFrame with the next lines of code:
DF['label'] = np.nan
for row in clean_labels:
x = DF[DF['description'] == row.split('|')[0]].index
print(row.split('|')[0], row.split('|')[1])
DF['label'][x] = row.split('|')[1]
print(x)
Results and Conclusions
After finishing this process our data has been labeled. The final product will look like this:
description | label | |
0 | 2 exhaust pipe | 9 |
1 | 2 extension pipes | 9 |
2 | 2 rear struts | 1 |
3 | master cylinder | 8 |
4 | oil change & filter | 2 |
Of course, OpenAI's API isn't perfect and some loss of information and mistakes are expected. Although 12 of the entries were not labeled, it's still cheaper to label the data in batches instead of labeling each entry individually- the average token usage per description is 15. But, if labeled individually, the token usage can easily reach 150 tokens.
The classified data now can be used to train machine learning models and to extract insight from the data.
Appendix
A. Cleaning function:
def remove_cluster_from_a_column(DF: pd.DataFrame, column_name: str):
DF[column_name] = DF[column_name].str.lower()
DF[column_name] = DF[column_name].str.replace(',', ' ')
DF[column_name] = DF[column_name].str.replace('.', ' ')
DF[column_name] = DF[column_name].str.replace('w/', 'with ')
DF[column_name] = DF[column_name].str.replace('-', ' ')
DF[column_name] = DF[column_name].str.replace('`', ' ')
DF[column_name] = DF[column_name].str.replace('#', ' ')
DF[column_name] = DF[column_name].str.replace('|', ' ')
DF[column_name] = DF[column_name].str.replace('*', ' ')
DF[column_name] = DF[column_name].str.replace('...', ' ')
DF[column_name] = DF[column_name].str.replace(r'\d{1,2,3,4}-\d{1,2}-\d{1,2,3,4}', ' ')
DF[column_name] = DF[column_name].str.replace(r'\d{1,2,3,4}\/\d{1,2}\/\d{1,2,3,4}', ' ')
DF[column_name] = DF[column_name].str.replace(r'$ \d{1,2,3,4,5}', ' ')
DF[column_name] = DF[column_name].str.replace(r'$\d{1,2,3,4,5}', ' ')
DF[column_name] = DF[column_name].str.replace('r/l', 'right left')
DF[column_name] = DF[column_name].str.replace('r & l', 'right left')
DF[column_name] = DF[column_name].str.replace(r'\s+', ' ')
DF[column_name] = DF[column_name].str.replace(' ,', ' ')
DF[column_name] = DF[column_name].replace('', np.nan).replace(r'^\s*$', np.nan, regex=True)
DF[column_name] = DF[column_name].str.lstrip()
#DROP THE STRINGS WITH LENGHT 2 OR LESS
mask = DF[column_name].str.len() < 3
DF = DF[~mask]
#DROP NAN
DF.dropna(subset=[column_name], inplace= True)
#DROP ONLY NUMERICAL ENTRIES
mask = pd.to_numeric(DF[column_name], errors='coerce').notnull()
DF = DF[~mask]
B. Iterative loop:
number_of_iterations = DF.shape[0]
max_number_of_descriptions = 100
raw_labels = []
index = 0
while index < number_of_iterations:
descriptions = DF['description'][index] + "\n"
#GET THE NUMBER OF ROWS TO CATEGORIZE /batch size
for i in range(1, max_number_of_descriptions, 1):
if index + i > number_of_iterations - 1:
i += -1
break
descriptions += str(DF['description'][index + i]) + "\n"
# TO AVOID CRASHES, MULTIPLE EXCEPTIONS WERE CONSIDERED
try:
response = gpt_category_bot(bill_description=descriptions)
except openai.error.Timeout:
print('TIMEOUT ERROR')
time.sleep(10)
continue
except openai.error.APIError:
print('API ERROR')
time.sleep(10)
continue
except openai.error.RateLimitError:
print('RATE TIME LIMIT EXCEDED')
time.sleep(60)
continue
#GET THE RESPONSE INTO A TEXT FILE
for ind_response in response.split('\n'):
raw_labels.append(ind_response)
index = index + i + 1
print(index)