Data Ingestion Overview

Documentation Experience Platform Data Ingestion Guide

Last update: Tue Aug 20 2024 00:00:00 GMT+0000 (Coordinated Universal Time)

Topics:
Data Ingestion

CREATED FOR:

Developer

ÃÛ¶¹ÊÓÆµ Experience Platform brings data from multiple sources together in order to help marketers better understand the behavior of their customers. ÃÛ¶¹ÊÓÆµ Experience Platform Data Ingestion represents the multiple methods by which Experience Platform ingests data from these sources, as well as how that data is persisted within the Data Lake for use by downstream Experience Platform services.

This document introduces the three main ways in which data is ingested into Experience Platform, with links to their respective overview documentation for more detailed information.

Batch ingestion

Batch ingestion allows you to ingest data into Experience Platform as batch files. Batches are units of data that consist of one or more files to be ingested as a single unit. Once ingested, batches provide metadata that describes the number of records successfully ingested, as well as any failed records and associated error messages.

Manually uploaded datafiles such as flat CSV files (mapped to XDM schemas) and Parquet dataframes must be ingested using this method.

See the batch ingestion overview for more information.

TIP

Use single-line JSON instead of multi-line JSON as input for batch ingestion. Single-line JSON allows for better performance as the system can divide one input file into multiple chunks and process them in parallel, whereas multi-line JSON cannot be split. This can significantly reduce data processing costs and improve batch processing latency.

Streaming ingestion

Streaming ingestion allows you to send data from client- and server-side devices to Experience Platform in real time. Experience Platform supports the use of data inlets to stream incoming experience data, which is persisted in streaming-enabled datasets within the Data Lake. Data inlets can be configured to automatically authenticate the data they collect, ensuring that the data is coming from a trusted source.

See the streaming ingestion overview for more information.

Sources

Experience Platform allows you to set up source connections to various data providers. These connections enable you to authenticate to your external data sources, set times for ingestion runs, and manage ingestion throughput.

Source connections can be configured to gather data from other ÃÛ¶¹ÊÓÆµ applications (such as ÃÛ¶¹ÊÓÆµ Analytics and ÃÛ¶¹ÊÓÆµ Audience Manager), third-party cloud storage sources (such as Azure Blob, Amazon S3, FTP servers, and SFTP servers), and third-party CRM systems (such as Microsoft Dynamics and Salesforce).

See the Sources overview for more information.

ML-Assisted schema creation ml-assisted-schema-creation

To quickly integrate new data sources, you can now use machine learning algorithms to generate a schema from sample data. This automation simplifies the creation of accurate schemas, reduces errors, and speeds up the process from data collection to analysis and insights.

See the ML-assisted schema creation guide for more information on this workflow.

Next steps and additional resources

This document provided a brief introduction to the different aspects of Data Ingestion in Experience Platform. Please continue to read the overview documentation for each ingestion method to familiarize yourself with their different capabilities, use cases, and best practices. You can also supplement your learning by watching the ingestion overview video below. For information on how Experience Platform tracks the metadata for ingested records, see the Catalog Service overview.

WARNING

The term â€œUnified Profileâ€ thats used in the following video is out-of-date. The terms â€œProfileâ€ or â€œReal-Time Customer Profileâ€ are the correct terms used in the Experience Platform documentation. Please refer to the documentation for the latest functionality.

video poster

Transcript

Hi there, Iâ€™m going to give you a quick overview of how to ingest data into ÃÛ¶¹ÊÓÆµ Experience Platform. Data ingestion is a fundamental step to getting your data in Experience Platform so you can use it to build 360 degree real-time customer profiles and use them to provide meaningful experiences. ÃÛ¶¹ÊÓÆµ Experience Platform allows data to be ingested from various external sources while giving you the ability to structure, label and enhance incoming data using Platform services. You can ingest data from various sources such as ÃÛ¶¹ÊÓÆµ applications, enterprise sources, databases, stream data using a web or mobile SDK and many others. Platform is API friendly and lets you ingest data using batch and streaming APIs. Experience Platform provides tools to ensure that the ingested data is XDM compliant and helps prepare that data for real-time customer profiles and other services. You can ingest data into Platform using various sources. You can configure a streaming source connector in Platform that provides an HTTP API endpoint, and then you can either do a batch ingestion or stream data into Platform using the endpoint. You can drag and drop files into the UI and ingest it with the batch mode. You can also configure it a source connector in the UI that will ingest data from the origin system using the most appropriate mode for that system. Source connectors ingest data using either batch ingestion or streaming ingestion. Platform provides you with the state of art streaming infrastructure to collect, enrich and activate data in real time. Streaming ingestion APIs makes it easy for customers to ingest data from the real-time messaging systems, other first party systems and partners. When data is streamed to Platform, data is verified to ensure that itâ€™s coming from trusted sources and itâ€™s in the XDM format. The is then placed on Experience Platform pipeline for consumption by other services as fast as possible. Different services within Platform then consume the data from the pipeline. In the next step that is stored in Data Lake as a dataset, comprised of batches and files that can be accessed by various Platform components. All data sets contain a reference to the XDM schema that constraints the format and structure of the data that they can store. Attempting to upload that to a dataset that does not conform to the datasets XDM schema, will cause an ingestion to fail. Any data that is configured to be processed into the profile gets flagged for immediately processing up into the identity graph and profile store. With real-time customer profile, you can see a holistic view of each individual customer by combining data from multiple channels, including online, offline, CRM and third-party. Profile allows you to consolidate your customer data into a unified view, offering an actionable, timestamped account of every customer interaction. With ÃÛ¶¹ÊÓÆµ Experience Platform Query Service, you can prove all your stored customer datasets including behavioral, CRM, point of sales data and more into one place and run faster petabyte SQL queries to discover the story behind customer behavior and generate impactful insights using a BI tool of your choice. Although your real-time customer profile requests real time data ingestion and activation, there are still many use cases where batch ingestion is needed. Many first party and third party systems do not support streaming edition yet. Plus, you might want to completely refresh the data in Platform with an updated version from your own Data Lake such as monthly refresh of your product catalog. In addition, if you want to upload large volumes of data, batch ingestion is still the optimal method to load terabytes of data into Platform. To support these use cases, Platform provides batch data ingestion pipelines that allow you to ingest data from any system. Batch pipeline validates, transforms and partitions data before itâ€™s stored in the Data Lake. This ensures that the data is stored in the most optimized format to support easy access at petabyte scale. Letâ€™s take a cute look at the source connector example to get a better understanding. When you log into a Platform you will see sources in the left navigation. Clicking sources will take you to the source catalog screen where you can see all of the source connectors currently available in Platform. For our video, letâ€™s use the Amazon S3 cloud storage to perform a batch ingestion. Click on the add data option and choose an existing Amazon account and then move to the next step. In this step, we choose the source file for data ingestion and verify the file data format. Not that the ingested file data can be formatted as XDM JSON, XDM Parquet or delimited. Currently for delimited files, you have an option to preview sample data of the source file. You can also choose a custom delimiter for your source data. For streaming and batch ingestion, ÃÛ¶¹ÊÓÆµ Experience Platform currently supports the following file formats. For data ingestion, another requirement is to have a dataset to store the incoming data. A data set is a storage and management construct for a collection of data, typically a table, that contains columns derived from a schema and the ingested data gets stored as rows. All data sets are based on existing XDM schemas which provide constraints for what the ingested data should contain and how it should be structured. Experience Platform uses schema to describe the structure of data in a consistent and reusable way. Before ingesting data into Platform, a schema must be composed to describe the data structure and provide constraints to the type of data that can be contained within each field, so data can be validated as it moves between systems. Schema consists of a base class and zero or more mix-ins. First you assign a class that defines what a schema is. For example, an individual profile or an Experience event. Next you can add mix-ins which are reusable components defining fields like personal details, preferences or addresses. ÃÛ¶¹ÊÓÆµ Experience Platform provides standard classes and mix-ins related to these classes. If there is a need you can also define a customer class or a custom mix-in for your use case. Data appropriation allows data engineers to map, transform and validate source data to and from Experience at a model. Data appropriation appears as a mapping step in the data ingestion process. Data engineers can use data prep to perform data manipulating during ingestion. You can define simple pass through mappings to assign source input attributes to XDM target attributes, create calculated fields to perform in row calculations that can we assign to XDM attributes. In this example, you can combine the first name and the last name source fields to populate the full name field in the target field using a concatenation operation. Similarly, you can also transform a particular field by applying string, numeric, a date manipulation functions provided by Platform. Letâ€™s select a frequency for this batch ingestion and move to the next step. With the help of error diagnosis, Platform allows users to generate error reports for newly ingested batches. Error diagnostics for failed records can be downloaded using the API. Partial ingestion enables the ingestion of valid records of new batch data, within a specified error threshold. The error threshold enables the configuration of personally acceptable errors before the entire batch fields. Letâ€™s review the changes and save your configuration. At this step, we are successfully configured a data ingestion flow from a source location to Platform. ÃÛ¶¹ÊÓÆµ Experience Platform allows data to be ingested from various external sources while giving you the ability to structure, label and enhance incoming data using Platform services. -

recommendation-more-help

2ee14710-6ba4-4feb-9f79-0aad73102a9a