What can you do with Catalog?
- Visualize and explore your data to uncover insights: Instead of guessing what’s in your dataset, you can directly visualize it. Spot imbalances, identify outliers, find rare edge cases, and understand the distribution of your data before it ever touches a model. Use the gallery view for a visual survey, the list view for metadata analysis, and the analytics view to see statistical breakdowns.
- Find specific data with powerful search and filtering: Move beyond simple filename searches. Catalog allows you to build complex queries to find the exact data you need. You can filter by a rich set of attributes including metadata, annotation-class, dataset, project, and even the content of the data itself using AI-powered search methods.
- Curate and organize datasets for any workflow: Your raw data is just the beginning. Catalog helps you organize it for specific tasks. You can create static batches of data to send to a labeling project or define dynamic slices that automatically track specific subsets of your data over time, like “all images flagged for review.”
- Take targeted action on your data: Finding data is only half the battle. Catalog is fully integrated with the rest of the Labelbox platform, allowing you to take immediate action on your findings. Select a group of data rows and, with a few clicks, you can add metadata, export them for analysis, or send them directly to a labeling project.
Key concepts
To effectively navigate and use Catalog, it’s important to understand its core components. These are the fundamental building blocks you’ll encounter as you explore and manage your data.| Term | Definition |
|---|---|
| Data rows | What it is: The most basic unit in Labelbox, representing a single item of your data. A data row is a pointer to a single data asset (e.g., an image, a video, a text file, a medical image) along with all its associated information, including metadata, attachments, and annotations. Why it’s important: Every action in Catalog—from filtering to labeling—is performed on one or more data rows. |
| Datasets | What it is: A dataset is a top-level container that holds a collection of data rows, often grouped by a specific project, data source, or collection period. You organize your data into datasets when you upload it to Labelbox. Why it’s important: Datasets provide the initial organization for your data and serve as the starting point for exploration in Catalog. |
| Slices | What it is: A slice is a dynamic, saved query that represents a subset of your data. Think of it as a “smart folder” or a saved search that always stays up-to-date. Why it’s important: Slices let you continuously monitor specific subsets of your data without re-applying filters. When new data is added to the dataset that matches the slice’s filters, it automatically appears in the slice. This is perfect for tracking data quality issues or monitoring for specific edge cases. |
| Batches | What it is: A batch is a static, fixed group of data rows that you can send to a labeling project. Why it’s important: Batches are the primary mechanism for queuing up work for human labelers. Once created, a batch does not change unless you manually add or remove items, providing a stable workload for your labeling projects. |
| Embeddings | **What it is: **Embeddings are powerful numerical representations (vectors) of your data that capture its semantic meaning. They are the engine behind Catalog’s AI-powered search features. Labelbox can generate these for you, or you can provide your own. Why it’s important: Embeddings allow you to search for data based on meaning and similarity, not just metadata. They power features like similarity search (“find more images like this one”) and natural language search (“find images of a dog playing in a park”), making it possible to find relevant data in a more intuitive and powerful way. |