The data annotation process – especially efficient data labeling activity at scale for machine learning projects – presents several complexities. Since data is the raw material on which machine learning projects are built, promoting its quality is essential. If labels lack precision and quality, an entire highly complex project based on artificial intelligence can suffer by invalidating predictive models. Have you ever wondered what the conditions of production of this data are?
It is crucial that the people in charge of carrying out the data annotation tasks know the context of their work: what they are doing the labeling for and what are the objectives of the project of which they are a crucial link. It is also relevant that they are aware of the impact of their work on the final quality of the dataset, and therefore, that their tasks are recognized and valued.
It is known that the preparation, loading and cleaning usually demand up to 45% of the time that is dedicated to working with data. Then applying complex ontologies, attributes, and various types of annotations to train and implement machine learning models add more difficulty. Therefore, training the people in charge of data annotation tasks, guaranteeing their working conditions and ensuring their well-being is central to improving the chances that the labeling results with the expected quality.
The big challenge: data processing and labeling
Today, companies have an abundance of data, arguably excess. The big challenge is how to process and label them so that they are usable. Precisely tagged data collaborates for machine learning systems to establish reliable models for pattern recognition, which forms the foundation of every AI project.
As data labeling requires managing a large volume of work, companies often need to find an external team. In such circumstances, ensuring smooth communication and collaboration between taggers and data scientists is vital to maintain quality control, validate data, and resolve issues and concerns that may arise.
In addition to the language and geographic issues, there are other aspects that can impact on the interpretation of the data and, therefore, on the correct annotation and labeling of the same. The experience of the annotator in the specific domain and its cultural associations will print a bias that can only be controlled if there is awareness of this during the process. When there is no single “correct” answer for subjective data, the data operations team can establish clear instructions to guide how the annotators understand each data point.
There are studies that focus on the problem of individual annotator bias in the data annotation task. However, there are also new lines of research. From this perspective, because annotators follow the exact instructions provided by clients, their interpretation of the data they perform “is profoundly limited by the interests, values, and priorities of the more powerful (financial) stakeholders.” That is, the interpretations and labels “are imposed vertically on the annotators and, through them, on the data.” Therefore, it would be wrong to naturalize the hierarchical powers that operate behind the annotation processes.
Even where the data is presumably more “objective” there will also be challenges, especially if labeling analysts do not know the context of their work and do not have good instructions and feedback processes.
Without neglecting the different factors involved in the annotation of data, what is observed in practice is that training is an important aspect of this process, since it helps that the annotators adequately understand the project and produce annotations that are valid (accurate) and reliable (consistent).