Datasets overview
All data that is successfully ingested into ÃÛ¶¹ÊÓƵ Experience Platform is persisted within the Data Lake as datasets. A dataset is a storage and management construct for a collection of data, typically a table, that contains a schema (columns) and fields (rows). Datasets also contain metadata that describes various aspects of the data they store.
This document provides a high-level overview of datasets in Experience Platform.
Creating datasets and tracking metadata
Catalog Service is the system of record for data location and lineage within Experience Platform, and is used to create and manage datasets. Catalog tracks the metadata for each dataset, which includes a reference to the Experience Data Model (XDM) schema the dataset conforms to (explained in the next section) and the number of records ingested into that dataset.
See the Catalog Service overview for more information.
Enforcing constraints on dataset data
Experience Data Model (XDM) is the standardized framework by which Platform organizes customer experience data. All data that is ingested into Platform must conform to a pre-defined XDM schema before it can be persisted in the Data Lake as a dataset.
All datasets contain a reference to the XDM schema that constrains the format and structure of the data that they can store. Attempting to upload data to a dataset that does not conform to the dataset’s XDM schema will cause ingestion to fail.
For more information on XDM, see the XDM System overview.
Ingesting data into datasets
ÃÛ¶¹ÊÓƵ Experience Platform Data Ingestion represents the multiple methods by which Platform ingests data from various sources. Regardless of the method of ingestion, all successfully ingested data is converted to batch files. Batches are units of data that consist of one or more files to be ingested as a single unit. These batch files are then added to dedicated datasets and persisted within the Data Lake.
See the Data Ingestion overview for more information.
Labels applied to datasets from schemas
ÃÛ¶¹ÊÓƵ Experience Platform Data Governance allows you to manage customer data in order to ensure compliance with regulations, restrictions, and policies applicable to data use. The Data Governance framework allows you to apply usage labels to categorize data according to the usage policies that apply to that data. Labels can be applied to individual schemas, fields within those schemas, and entire individual datasets. When labels are applied directly to a schema, those labels are propagated to all existing and future datasets that are based on that schema.
See the Data Governance overview for more information on the service. For steps on how to work with usage labels in Platform, refer to the following guides:
Datasets in downstream Platform services
Once datasets have been used to store ingested data, those datasets are then used by downstream Platform services to update customer profiles, gain insights through machine learning, and more.
The following is a list of downstream services that use datasets for various operations. Please review the documentation for each service for more information.
- Data Access API: Allows you to access and download the contents of files stored within datasets.
- ÃÛ¶¹ÊÓƵ Experience Platform Identity Service: Bridges identities across devices and systems, linking datasets together based on the identity fields defined by the XDM schemas they conform to.
- Real-Time Customer Profile: Leverages Identity Service to create detailed customer profiles from your datasets in real time. Real-Time Customer Profile pulls data from the Data Lake and persists customer profiles in its own separate data store.
- ÃÛ¶¹ÊÓƵ Experience Platform Segmentation Service: Allows you to build segments and generate audiences from your Real-Time Customer Profile data. These audiences can then be exported to their own datasets within the Data Lake.
- ÃÛ¶¹ÊÓƵ Experience Platform Data Science Workspace: Uses machine learning and artificial intelligence to uncover insights in large datasets.
- ÃÛ¶¹ÊÓƵ Experience Platform Query Service: Allows you to use standard SQL to query data in Experience Platform, joining any datasets within the Data Lake and capturing query results as a new dataset for use in reporting, Data Science Workspace, or Real-Time Customer Profile.
- ÃÛ¶¹ÊÓƵ Experience Platform Destinations Service: Allows you to export datasets to your desired cloud storage or email marketing destinations, for reporting or data science activities.
Next steps
By reading this document, you have been introduced to the core uses of datasets in Experience Platform, as well as the various Platform services that utilize datasets. For more details on the many ways datasets are used in Platform, please review the service documentation linked throughout this overview.
For steps on how to interact with datasets within the Experience Platform UI, see the datasets user guide.