Hand labeling considered harmful

Click for: original source

Labeling training data is the one step in the data pipeline that has resisted automation. It’s time to change that. By Shayan Mohanty and Hugo Bowne-Anderson @oreilly.com.

There are serious challenges with software and models, including the data they’re trained on, how they’re developed, how they’re deployed, and their impact on stakeholders. These challenges commonly result in both algorithmic bias and lack of model interpretability and explainability.

The article does deep dive on:

  • Hand labels and algorithmic bias
  • Uninterpretable, unexplainable
  • On auditing
  • The prohibitive costs of hand labeling
  • The efficacy of automation techniques

There are no “gold labels”: even the most well-known hand labeled datasets have label error rates of at least 5%. According various papers, by introducing expensive hand labels sparingly into largely programmatically generated datasets, you can maximize the effort/model accuracy tradeoff on SOTA (state of the art) architectures that wouldn’t be possible if you had hand labeled alone. Very interesting read!

[Read More]

Tags cio big-data data-science analytics