Data Labeling with OpenAI's API

Data labeling is a critical step in the creation of supervised machine-learning models. The high quality and the large number of labels required to train a model have made this task an arduous and expensive endeavor, in many cases, the labeling step is made by humans. Thanks to the advancements of Large Language AI models such as Open AI, the labeling process can be made for a fraction of the cost - especially if we classify data in batches to reduce token usage.

In this blog, you will walk through the code I used to label a database of bill descriptions. You can find the complete code in here:

GitHub Repository

Data Set Used

The data displayed in this blog came from the bills accumulated over the years by a car repair center. The description of the bills is the subject of study in this blog.

	description
0	2 EXHAUST PIPE
1	2 EXTENSION PIPES
2	2 rear struts
3	MASTER CYLINDER
4	oil change & filter

They will be categorized in the following categories:

Shocks, Control Arms, Tires, Alignment
Oil Change, Ignition, Fuel System
Manufacturer Service Intervals
Dashboard, Door Locks, Windows
Check Engine Light, Inspections
Alternator, Battery, Starter, Switches
AC System, Blower Motor
ABS Control Module, Brake Lines, Brake Pads
None

Data Cleaning

To feed the data into OpenAI API, the data needs to be cleaned within some parameters which will make the final cleaning process much more easier. To accomplish this, we will use pandas DataFrames. Characters such as "|", "-", ",", and double spaces. Additionally, empty strings and strings that contain only numbers will be set to null (np.nan). All of these and some extra cleaning will be done with the cleaning function shown in the Appendix section (A).

Data Labeling

To label the data, the cleaned DataFrame will be passed through OpenAI's API in groups of rows of 100 to reduce the number of tokens used. The setting up of OpenAI's API and prompt were done by applying the next function to our DataFrame.

def gpt_category_bot(bill_description: str):

    prompt = f"""You are a car expert. Label the next data:
{bill_description}
In one or multiple of the next categories:
 1. Shocks, Control Arms, Tires, Alignment
 2. Oil Change, Ignition, Fuel System
 3. Manufacturer Service Intervals
 4. Dashboard, Door Locks, Windows
 5. Check Engine Light, Inspections
 6. Alternator, Battery, Starter, Switches
 7. AC System, Blower Motor
 8. ABS Control Module, Brake Lines, Brake Pads
 9. None


return a two column table with each data entry and the category numbers (not the name of the category), use "|" to separate the columns
"""
    response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages = [{'role': 'user', 'content': prompt}],
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0)
    print_billing(prompt=prompt,response=response)
    return response.choices[0].message.content

Where the "bill description" is the string with the 100 rows from the DataFrame. The batches of descriptions are obtained by iterating on the rows of the DataFrame with the code shown in Appendix (B).

Result Cleaning

Even though the table format was requested from Chat Gpt, there are plenty of ways for errors and formatting errors to be made. First, the data must be extracted from the OpenAI's response, and the labels generated by OpenAI's API have to be standardized so they can be reassigned to the original DataFrame.

def label_cleaning(label_list: list):
    result = []
    for item in label_list:
        if len(item) == 0: continue
        if item[0] == '|': item = item[1:]
        if item[-1] == '|': item = item[:-1]
        if len(item.split('|')) < 2: continue
        if len(item.split('|')) >= 2:
            labels = ','.join(item.split('|')[1:])
            labels = re.sub(r',\s+', ',', labels)
            labels = re.sub(r'\s+,', ',', labels)
            if labels[0] == ',': labels = labels[1:]
            item = item.split('|')[0] + '|' + labels
        description = item.split('|')[0]
        description = re.sub(r'\s+', ' ', description)
        description = description.strip()
        item = description + '|' + labels
        result.append(item)

    return result

Finally, the labels stored in the resulting list have to be paired with the descriptions in the DataFrame with the next lines of code:

DF['label'] = np.nan
for row in clean_labels:
    x = DF[DF['description'] == row.split('|')[0]].index
    print(row.split('|')[0], row.split('|')[1])
    DF['label'][x] = row.split('|')[1]
    print(x)

Results and Conclusions

After finishing this process our data has been labeled. The final product will look like this:

	description	label
0	2 exhaust pipe	9
1	2 extension pipes	9
2	2 rear struts	1
3	master cylinder	8
4	oil change & filter	2

Of course, OpenAI's API isn't perfect and some loss of information and mistakes are expected. Although 12 of the entries were not labeled, it's still cheaper to label the data in batches instead of labeling each entry individually- the average token usage per description is 15. But, if labeled individually, the token usage can easily reach 150 tokens.

The classified data now can be used to train machine learning models and to extract insight from the data.

Appendix

A. Cleaning function:

def remove_cluster_from_a_column(DF: pd.DataFrame, column_name: str):
    DF[column_name] = DF[column_name].str.lower()
    DF[column_name] = DF[column_name].str.replace(',', ' ')
    DF[column_name] = DF[column_name].str.replace('.', ' ')
    DF[column_name] = DF[column_name].str.replace('w/', 'with ')
    DF[column_name] = DF[column_name].str.replace('-', ' ')
    DF[column_name] = DF[column_name].str.replace('`', ' ')
    DF[column_name] = DF[column_name].str.replace('#', ' ')
    DF[column_name] = DF[column_name].str.replace('|', ' ')
    DF[column_name] = DF[column_name].str.replace('*', ' ')
    DF[column_name] = DF[column_name].str.replace('...', ' ')
    DF[column_name] = DF[column_name].str.replace(r'\d{1,2,3,4}-\d{1,2}-\d{1,2,3,4}', ' ')
    DF[column_name] = DF[column_name].str.replace(r'\d{1,2,3,4}\/\d{1,2}\/\d{1,2,3,4}', ' ')
    DF[column_name] = DF[column_name].str.replace(r'$ \d{1,2,3,4,5}', ' ')
    DF[column_name] = DF[column_name].str.replace(r'$\d{1,2,3,4,5}', ' ')
    DF[column_name] = DF[column_name].str.replace('r/l', 'right left')
    DF[column_name] = DF[column_name].str.replace('r & l', 'right left')
    DF[column_name] = DF[column_name].str.replace(r'\s+', ' ')
    DF[column_name] = DF[column_name].str.replace(' ,', ' ')
    DF[column_name] = DF[column_name].replace('', np.nan).replace(r'^\s*$', np.nan, regex=True)
    DF[column_name] = DF[column_name].str.lstrip()
    #DROP THE STRINGS WITH LENGHT 2 OR LESS
    mask = DF[column_name].str.len() < 3
    DF = DF[~mask]
    #DROP NAN
    DF.dropna(subset=[column_name], inplace= True)
    #DROP ONLY NUMERICAL ENTRIES
    mask = pd.to_numeric(DF[column_name], errors='coerce').notnull()
    DF = DF[~mask]

B. Iterative loop:

number_of_iterations = DF.shape[0]
max_number_of_descriptions = 100
raw_labels = []

index = 0
while index < number_of_iterations:
    descriptions = DF['description'][index] + "\n"
    #GET THE NUMBER OF ROWS TO CATEGORIZE /batch size
    for i in range(1, max_number_of_descriptions, 1):
        if index + i > number_of_iterations - 1:
            i += -1
            break
        descriptions += str(DF['description'][index + i]) + "\n"
# TO AVOID CRASHES, MULTIPLE EXCEPTIONS WERE CONSIDERED        
    try:
        response = gpt_category_bot(bill_description=descriptions)

    except openai.error.Timeout:
        print('TIMEOUT ERROR')
        time.sleep(10)
        continue

    except openai.error.APIError:
        print('API ERROR')
        time.sleep(10)
        continue

    except openai.error.RateLimitError:
        print('RATE TIME LIMIT EXCEDED')
        time.sleep(60)
        continue
    #GET THE RESPONSE INTO A TEXT FILE
    for ind_response  in response.split('\n'):
        raw_labels.append(ind_response)

    index = index + i + 1
    print(index)