Loading…
June 21-24, 2022
Austin, Texas, USA + Virtual
View More Details & Registration
Note: The schedule is subject to change.

The Sched app allows you to build your schedule but is not a substitute for your event registration. You must be registered for Open Source Summit North America 2022 to participate in the sessions. If you have not registered but would like to join us, please go to the event registration page to purchase a registration.

This schedule is automatically displayed in Central Daylight Time (UTC -5). To see the schedule in your preferred timezone, please select from the drop-down menu to the right, above "Filter by Date."

IMPORTANT NOTE: Timing of sessions and room locations are subject to change.

Back To Schedule
Thursday, June 23 • 2:55pm - 3:35pm
Labeling Tools are Great, but What About Quality Checks? - Marcus Edel & Jakub Piotr Cłapa, Collabora Ltd.

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Data sets are the backbone of Machine-Learning (ML), but some are more critical than others. There is a core set of them that researchers use to evaluate machine-learning models as a way to track how ML capabilities are advancing over time. One of the most known is the ImageNet data set, which kicked off the modern ML revolution. There's also Lyft's data set meant to train self-driving cars, etc. Over the years, studies have found that these data sets can contain serious flaws. ImageNet, for example, has several labels that are just flat-out wrong. A mushroom is labeled a spoon, a lion is labeled a monkey, or in the case of the Lyft data set, several cars are not annotated at all. All these datasets have one thing in common; they use a highly error-prone annotation pipeline with little or no quality checks. We worked on an open-source tool that uses and combines novel unsupervised machine-learning pipelines that help annotators and machine-learning engineers to identify and filter out potential label errors. In this talk, we will share our findings on how label errors affect the existing training process, discuss possible implications, and dive into how we leveraged unsupervised learning to filter out annotation errors while looking at real-world examples.

Speakers
avatar for Jakub Piotr Cłapa

Jakub Piotr Cłapa

Collabora Ltd., Machine Learning Engineer
Jakub Piotr Cłapa is a generalist engineer and researcher currently working on embedded Deep Learning at Collabora. He is very interested in unsupervised learning of the environment as a way to pave the way for autonomous robots. During the previous 10+ years he shipped full-stack... Read More →
ME

Marcus Edel

Machine Learning Engineer, Collabora Ltd.
Marcus Edel is a software engineer in the machine-learning team at Collabora, Inc. where he leads the effort to optimise and apply deep networks for inference, with a focus on embedded devices. Marcus completed his graduate studies in 2020 with a focus on fast algorithms for core... Read More →



Thursday June 23, 2022 2:55pm - 3:35pm CDT
Room 303/304 (Level 3)