Skip to main content
Welcome to Labelbox Catalog, your command center for understanding, curating, and preparing unstructured data for your machine learning workflows. Before you can build a high-performing model, you need to deeply understand the data you’re working with. Catalog is designed to move you from raw data to curated, label-ready datasets with confidence and speed. Think of Catalog as an interactive, searchable index of all your training data. It provides the tools to not just see your data, but to interact with it, ask questions of it, and organize it in powerful ways.

What can you do with Catalog?

  • Visualize and explore your data to uncover insights: Instead of guessing what’s in your dataset, you can directly visualize it. Spot imbalances, identify outliers, find rare edge cases, and understand the distribution of your data before it ever touches a model. Use the gallery view for a visual survey, the list view for metadata analysis, and the analytics view to see statistical breakdowns.
  • Find specific data with powerful search and filtering: Move beyond simple filename searches. Catalog allows you to build complex queries to find the exact data you need. You can filter by a rich set of attributes including metadata, annotation-class, dataset, project, and even the content of the data itself using AI-powered search methods.
  • Curate and organize datasets for any workflow: Your raw data is just the beginning. Catalog helps you organize it for specific tasks. You can create static batches of data to send to a labeling project or define dynamic slices that automatically track specific subsets of your data over time, like “all images flagged for review.”
  • Take targeted action on your data: Finding data is only half the battle. Catalog is fully integrated with the rest of the Labelbox platform, allowing you to take immediate action on your findings. Select a group of data rows and, with a few clicks, you can add metadata, export them for analysis, or send them directly to a labeling project.
By providing a single, unified interface to explore and manage all your unstructured data, Catalog empowers you to make smarter, data-driven decisions throughout the entire model development lifecycle.

Key concepts

To effectively navigate and use Catalog, it’s important to understand its core components. These are the fundamental building blocks you’ll encounter as you explore and manage your data.
TermDefinition
Data rowsWhat it is: The most basic unit in Labelbox, representing a single item of your data. A data row is a pointer to a single data asset (e.g., an image, a video, a text file, a medical image) along with all its associated information, including metadata, attachments, and annotations.
Why it’s important: Every action in Catalog—from filtering to labeling—is performed on one or more data rows.
DatasetsWhat it is: A dataset is a top-level container that holds a collection of data rows, often grouped by a specific project, data source, or collection period. You organize your data into datasets when you upload it to Labelbox.
Why it’s important: Datasets provide the initial organization for your data and serve as the starting point for exploration in Catalog.
SlicesWhat it is: A slice is a dynamic, saved query that represents a subset of your data. Think of it as a “smart folder” or a saved search that always stays up-to-date.
Why it’s important: Slices let you continuously monitor specific subsets of your data without re-applying filters. When new data is added to the dataset that matches the slice’s filters, it automatically appears in the slice. This is perfect for tracking data quality issues or monitoring for specific edge cases.
BatchesWhat it is: A batch is a static, fixed group of data rows that you can send to a labeling project.
Why it’s important: Batches are the primary mechanism for queuing up work for human labelers. Once created, a batch does not change unless you manually add or remove items, providing a stable workload for your labeling projects.
Embeddings**What it is: **Embeddings are powerful numerical representations (vectors) of your data that capture its semantic meaning. They are the engine behind Catalog’s AI-powered search features. Labelbox can generate these for you, or you can provide your own.
Why it’s important: Embeddings allow you to search for data based on meaning and similarity, not just metadata. They power features like similarity search (“find more images like this one”) and natural language search (“find images of a dog playing in a park”), making it possible to find relevant data in a more intuitive and powerful way.