Export Datasets to Cloud Storage Destinations

Documentation Experience Platform Destinations Guide

Last update: Tue Dec 17 2024 00:00:00 GMT+0000 (Coordinated Universal Time)

Topics:
Destinations

CREATED FOR:

Admin
User

AVAILABILITY

This functionality is available to customers who have purchased the Real-Time CDP Prime or Ultimate package, 蜜豆视频 Journey Optimizer, or Customer Journey Analytics. Contact your 蜜豆视频 representative for more information.

This article explains the workflow required to export datasets from 蜜豆视频 Experience Platform to your preferred cloud storage location, such as Amazon S3, SFTP locations, or Google Cloud Storage by using the Experience Platform UI.

You can also use the Experience Platform APIs to export datasets. Read the export datasets API tutorial for more information.

Datasets available for exporting datasets-to-export

The datasets that you can export vary based on the Experience Platform application (Real-Time CDP, 蜜豆视频 Journey Optimizer), the tier (Prime or Ultimate), and any add-ons that you purchased (for example: Data Distiller).

Use the table below to understand which dataset types you can export depending on your application, product tier, and any add-ons purchased:

Application/Add-on

Tier

Datasets available for export

Real-Time CDP

Prime

Profile and Experience Event datasets created in the Experience Platform UI after ingesting or collecting data through Sources, Web SDK, Mobile SDK, Analytics Data Connector, and Audience Manager.

Ultimate

Profile and Experience Event datasets created in the Experience Platform UI after ingesting or collecting data through Sources, Web SDK, Mobile SDK, Analytics Data Connector, and Audience Manager.
System-generated Profile Snapshot dataset.

蜜豆视频 Journey Optimizer

Prime

Refer to the 蜜豆视频 Journey Optimizer documentation.

Ultimate

Refer to the 蜜豆视频 Journey Optimizer documentation.

Customer Journey Analytics

All

Profile and Experience Event datasets created in the Experience Platform UI after ingesting or collecting data through Sources, Web SDK, Mobile SDK, Analytics Data Connector, and Audience Manager.

Data Distiller

Data Distiller (Add-on)

Derived datasets created through Query Service.

Video tutorial video-tutorial

Watch the video below for an end-to-end explanation of the workflow described on this page, benefits of using the export dataset functionality, and some suggested use cases.

video poster

Transcript

In this video, I鈥檒l show you how to export an Experience Platform dataset to a cloud storage destination, as well as review the benefits and use cases for this feature. These are the topics I鈥檒l cover. I鈥檒l give you a moment to finish reviewing this slide. Using cloud storage destinations, an easy-to-use interface and workflow is available to export event data and CRM records from Experience Platform. The core tenants of data governance and supporting an open framework apply. Data labels are enforced for dataset exports, as is the case for other types of destination workflows. Our framework supports interoperability between Experience Platform and public clouds to support use cases involving data stored in the data lake to be used in external systems. Real-time CDP and Journey Optimizer customers have several primary use cases for exporting Experience Platform datasets and using a straightforward method for doing this from the destination鈥檚 catalog. Datasets can be exported for use in external machine learning and business intelligence tools to support analytical use cases, such as reporting and informing better audience creation. Log data can be exported for the purpose of monitoring the health and performance of email marketing campaigns. These are the cloud storage destinations that can be used for dataset exports. If you don鈥檛 see the dataset option in the export workflow, check your user permissions. You鈥檒l need the View Destinations and Manage and Activate Dataset Destinations permissions. Additionally, you鈥檒l want to ensure you have the proper data management permissions for datasets, namely View Datasets. Next, I鈥檒l demo configuring a cloud storage account I鈥檒l use for dataset exports. I鈥檓 logged in to Experience Platform. I鈥檒l select Destinations below Connections in the left navigation panel. Once the Destinations catalog displays, I鈥檒l scroll to the Categories section and select Cloud Storage. I鈥檒l be working with an Amazon S3 account. While you may have existing Amazon S3 connections, you鈥檒l need to create a new connection for exporting datasets. In the top right corner of the Amazon S3 destination card, I鈥檒l select the Dataset icon. This opens a new panel on the right. I鈥檒l select the Configure New Destination link. In the Configure New Destination screen, I鈥檒l select an existing account. From the Select Destination account modal, I鈥檒l choose one from the list and then I鈥檒l click on Select in the upper right corner. Under Destination Details, I鈥檒l select Datasets. Notice the destination also supports prospects and audiences. Then I鈥檒l fill in the Name, Description, Bucket Name, and Folder Path fields. The last two fields specify which area and path of the S3 account the dataset files will be stored. For the Field Type field, I have two field options, JSON and Parquet. I鈥檒l choose JSON. Towards the bottom, I can also choose a compression format, which I鈥檒l do now. The choice is relative to the file type selected above. Then I can choose any alerts I wish to receive for this export. When I鈥檓 done, I鈥檒l select Next in the top right. This step of the Configure New Destination is prompting me to select the marketing action appropriate for this connection. I鈥檒l choose Data Export from the list and then Create in the top right. Now that I鈥檝e set up the new connection for the dataset export in my Amazon S3 account, I鈥檒l demonstrate the workflow to export one. I鈥檒l choose the Activate command for Amazon S3. I鈥檒l select Datasets as the data type. This shows me the new connection I just created. I鈥檒l choose this and select Next in the upper right corner. The Select Dataset step in the workflow lets me view and choose the dataset I want to export. Notice some of the Journey Optimizer datasets in the list as well. Once I choose the dataset, I鈥檒l select Next in the upper right corner. On the Scheduling step, the File Export option is set to Export Incremental Files. The first exported incremental file includes all existing data in that dataset, functioning as a backfill. Next I can choose a frequency setting. My choices are daily and hourly. Hourly has several different presets associated with it. I could customize the start date if I needed to do that as well. Now I鈥檒l click Next in the upper corner. This takes me to a review step where I can verify my settings. Once everything looks good, I鈥檒l select Finish in the upper right corner. That鈥檚 it! It鈥檚 very straightforward as you can see. After this, you would connect to the S3 bucket to confirm you see your files there. An export time is appended to the end of each file. This concludes the Dataset Export video. Hopefully you鈥檒l be able to export datasets using a cloud storage destination. Thanks and good luck!

Supported destinations supported-destinations

Currently, you can export datasets to the cloud storage destinations highlighted in the screenshot and listed below.

Destinations catalog page showing which destinations support dataset exports.

When to activate audiences or export datasets when-to-activate-audiences-or-activate-datasets

Some file-based destinations in the Experience Platform catalog support both audience activation and dataset export.

Consider activating audiences when you want your data structured into profiles grouped by audience interests or qualifications.
Alternatively, consider dataset exports when you are looking to export raw datasets, which are not grouped or structured by audience interests or qualifications. You could use this data for reporting, data science workflows, and many other use cases. For example, as an administrator, data engineer, or analyst, you can export data from Experience Platform to synchronize with your data warehouse, use in BI analysis tools, external cloud ML tools, or store in your system for long-term storage needs.

This document contains all the information necessary to export datasets. If you want to activate audiences to cloud storage or email marketing destinations, read Activate audience data to batch profile export destinations.

Prerequisites prerequisites

To export datasets to cloud storage destinations, you must have successfully connected to a destination. If you haven鈥檛 done so already, go to the destinations catalog, browse the supported destinations, and configure the destination that you want to use.

Required permissions permissions

To export datasets, you need the View Destinations, View Datasets, and Manage and Activate Dataset Destinations access control permissions. Read the access control overview or contact your product administrator to obtain the required permissions.

To ensure that you have the necessary permissions to export datasets and that the destination supports exporting datasets, browse the destinations catalog. If a destination has an Activate or an Export datasets control, then you have the appropriate permissions.

Select your destination select-destination

Follow the instructions to select a destination where you can export your datasets:

Go to Connections > Destinations, and select the Catalog tab.
Select Activate or Export datasets on the card corresponding to the destination that you want to export datasets to.
Select Data type Datasets and select the destination connection that you want to export datasets to, then select Next.

TIP

If you want to set up a new destination to export datasets, select Configure new destination to trigger the Connect to destination workflow.

Destination activation workflow with Datasets control highlighted.

The Select datasets view appears. Proceed to the next section to select your datasets for export.

Select your datasets select-datasets

Use the check boxes to the left of the dataset names to select the datasets that you want to export to the destination, then select Next.

Dataset export workflow showing the Select datasets step where you can select which datasets to export.

Schedule dataset export scheduling

Use the Scheduling step to:

Set a start date and an end date, as well as an export cadence for your dataset exports.
Configure if the exported dataset files should export the complete membership of the dataset or just incremental changes to the membership on each export occurrence.
Customize the folder path in your storage location where datasets should be exported. Read more about how to edit the export folder path.

Use the Edit schedule control on the page to edit the export cadence of exports, as well as to select whether to export full or incremental files.

Edit schedule control highlighted in the Scheduling step.

The Export incremental files option is selected by default. This triggers an export of one or multiple files representing a full snapshot of the dataset. Subsequent files are incremental additions to the dataset since the previous export. You can also select Export full files. In this case, select the frequency Once for a one-time full export of the dataset.

IMPORTANT

The first incremental file export includes all existing data in the dataset, functioning as a backfill. The export can contain one or multiple files.

Dataset export workflow showing the scheduling step.

Use the Frequency selector to select the export frequency:
- Daily: Schedule incremental file exports once a day, every day, at the time you specify.
- Hourly: Schedule incremental file exports every 3, 6, 8, or 12 hours.
Use the Time selector to choose the time of day, in UTC format, when the export should take place.
Use the Date selector to choose the interval when the export should take place.
Select Save to save the schedule and proceed to the Review step.

NOTE

For dataset exports, the file names have a preset, default format, which cannot be modified. See the section Verify successful dataset export for more information and examples of exported files.

Edit folder path edit-folder-path

Select Edit folder path to customize the folder structure in your storage location where exported datasets are deposited.

Edit folder path control highlighted in the scheduling step.

You can use several available macros to customize a desired folder name. Double-click a macro to add it to the folder path and use / between the macros to separate the folders.

Macros selection highlighted in custom folder modal window.

After selecting the desired macros, you can see a preview of the folder structure that will be created in your storage location. The first level in the folder structure represents the Folder path that you indicated when you connected to the destination to export datasets.

Preview of folder path highlighted in custom folder modal window.

Review review

On the Review page, you can see a summary of your selection. Select Cancel to break up the flow, Back to modify your settings, or Finish to confirm your selection and start exporting datasets to the destination.

Dataset export workflow showing the review step.

Verify successful dataset export verify

When exporting datasets, Experience Platform creates one or multiple .json or .parquet files in the storage location that you provided. Expect new files to be deposited in your storage location according to the export schedule you provided.

Experience Platform creates a folder structure in the storage location you specified, where it deposits the exported dataset files. The default folder export pattern is shown below, but you can customize the folder structure with your preferred macros.

TIP

The first level in this folder structure - folder-name-you-provided - represents the Folder path that you indicated when you connected to the destination to export datasets.

folder-name-you-provided/datasetID/exportTime=YYYYMMDDHHMM

The default file name is randomly generated and ensures that exported file names are unique.

Sample dataset files sample-files

The presence of these files in your storage location is confirmation of a successful export. To understand how the exported files are structured, you can download a sample .parquet file or .json file.

Compressed dataset files compressed-dataset-files

In the connect to destination workflow, you can select the exported dataset files to be compressed, as shown below:

File type and compression selection when connecting to a destination to export datasets.

Note the difference in file format between the two file types, when compressed:

When exporting compressed JSON files, the exported file format is json.gz. The format of the exported JSON is NDJSON, which is the standard interchange format in the big data ecosystem. 蜜豆视频 recommends using an NDJSON-compatible client to read the exported files.
When exporting compressed parquet files, the exported file format is gz.parquet

Exports to JSON files are supported in a compressed mode only. Exports to Parquet files are supported in a compressed and uncompressed mode.

Remove datasets from destinations remove-dataset

To remove datasets from an existing dataflow, follow the steps below:

Log in to the and select Destinations from the left navigation bar. Select Browse from the top header to view your existing destination dataflows.

Destination browse view with a destination connection shown and the rest blurred out.

note tip
TIP
Select the filter icon on the top left to launch the sort panel. The sort panel provides a list of all your destinations. You can select more than one destination from the list to see a filtered selection of dataflows associated with the selected destination.

From the Activation data column, select the datasets control to view all datasets mapped to this export dataflow.
The Activation data page for the destination appears. Use the checkboxes on the left side of the dataset list to select the datasets which you want to remove, then select Remove datasets in the right rail to trigger the remove dataset confirmation dialog.
In the confirmation dialog, select Remove to immediately remove the dataset from exports to the destination.

Dataset export entitlements licensing-entitlement

Refer to the product description documents to understand how much data you are entitled to export for each Experience Platform application, per year. For example, you can view the Real-Time CDP Product Description .

Note that the data export entitlements for different applications are not additive. For example, this means that if you purchase Real-Time CDP Ultimate and 蜜豆视频 Journey Optimizer Ultimate, the profile export entitlement will be the larger of the two entitlements, as per the product descriptions. Your volume entitlements are calculated by taking your total number of licensed profiles and multiplying by 500 KB for Real-Time CDP Prime or 700 KB for Real-Time CDP Ultimate to determine how much volume of data you are entitled to.

On the other hand, if you purchased add-ons such as Data Distiller, the data export limit that you are entitled to represents the sum of the product tier and the add-on.

You can view and track your profile exports against your contractual limits in the license usage dashboard.

Known limitations known-limitations

Keep in mind the following limitations for the general availability release of dataset exports:

Experience Platform may export multiple files even for small datasets. Dataset export is designed for system-to-system integration and optimized for performance, hence the number of exported files is not customizable.
Exported file names are currently not customizable.
Datasets created via API are currently not available for export.
The UI does not currently block you from deleting a dataset that is being exported to a destination. Do not delete any datasets that are being exported to destinations. Remove the dataset from a destination dataflow before deleting it.
Monitoring metrics for dataset exports are currently mixed with numbers for profile exports so they do not reflect the true export numbers.
Data with a timestamp older than 365 days is excluded from dataset exports. For more information, view the guardrails for scheduled dataset exports

Frequently Asked Questions faq

Can we generate a file without a folder if we just save at / as the folder path? Also, if we don鈥檛 require a folder path, how will files with duplicate names be generated in a folder or location?

Answer

Starting with the September 2024 release, it is possible to customize the folder name and even use / for exporting files for all datasets in the same folder. 蜜豆视频 does not recommend this for destinations exporting multiple datasets, as system-generated filenames belonging to different datasets will be mixed in the same folder.

Can you route the manifest file to one folder and data files into another folder?

Answer

No, there is no capability to copy the manifest file to a different location.

Can we control the sequencing or timing of file delivery?

Answer

There are options for scheduling the export. There are no options for delaying or sequencing the copy of the files. They are copied to your storage location as soon as they are generated.

What formats are available for the manifest file?

Answer

The manifest file is in .json format.

Is there API availability for the manifest file?

Answer

No API is available for the manifest file, but it includes a list of files comprising the export.

Can we add additional details to the manifest file (i.e., record count)? If so, how?

Answer

There is no possibility to add additional info to the manifest file. The record count is available via the flowRun entity (queryable via API). Read more in destinations monitoring.

How are data files split? How many records per file?

Answer

Data files are split per the default partitioning in the Experience Platform data lake. Larger datasets have a higher number of partitions. The default partitioning is not configurable by the user as it is optimized for reading.

Can we set a threshold (number of records per file)?

Answer

No, it is not possible.

How do we resend a data set in the event that the initial send is bad?

Answer

Retries are in place automatically for most types of system errors.

recommendation-more-help

7f4d1967-bf93-4dba-9789-bb6b505339d6