Department of Labor Logo United States Department of Labor
Dot gov

The .gov means it's official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.


The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Injuries, Illnesses, and Fatalities

Automated Coding of Injury and Illness Data

The Survey of Occupational Injuries and Illnesses (SOII) collects data from sampled establishments on OSHA forms 300 and 301. We use the information provided on these forms to generate detailed statistics on the characteristics of cases involving injury or illness.

In order to generate these statistics, survey staff must convert the text entries in the OSHA forms to standard codes used by BLS, as indicated in the table below:

OSHA fieldSOII CodeCoding Taxonomy Used

Job title

OccupationStandard Occupational Classification

What was the employee doing just before the incident occurred?

Event or exposureOccupational Injury and Illness Classification System

What happened?

Nature of injury or illness and Event or exposureOccupational Injury and Illness Classification System

What was the injury or illness?

Nature of Injury or illness and Part of bodyOccupational Injury and Illness Classification System

What object or substance directly harmed the employee?

Source of injury or illness and Secondary Source of injury or illnessOccupational Injury and Illness Classification System

The set of all fields, taken together, is considered the case "narrative." Prior to survey year 2014, BLS exclusively relied on humans to code cases. In 2014, BLS began using machine learning to code a subset of cases. To use machine learning we first select a learning algorithm and then train it on large quantities of previously coded SOII narratives. During this process the algorithm calculates how strongly various features, such as words, pairs of words, and other items are associated with the codes that can be assigned. After training, we use the algorithm to estimate the best codes for each uncoded narrative and assign those codes if the model’s confidence exceeds a predetermined threshold. For 2014-2017 BLS used regularized multinomial logistic regression. In 2018, BLS switched to deep neural networks with character-level convolutional embeddings and Long-Short-Term-Memory recurrent layers (source code is available here). In 2019, BLS began autocoding secondary source for the first time.

BLS use of autocoding has generally expanded over time. In 2014, only 26 percent of occupation codes were assigned by machine learning. By 2019 automatic coding had been expanded to include all six coding tasks (occupation, nature, part, source, secondary source and event) with the model assigning approximately 85% of all codes. The drop in autocoding in 2020 was due to COVID-19. Because the model learns to code from previously coded data and 2020 was the first year COVID-19 cases were collected, all 2020 cases mentioning ‘covid’ or ‘corona’ were manually coded.

View data

Related articles

For additional technical information on our techniques, please contact


Last Modified Date: November 23, 2021