Decile-based derived datasets use case
Derived datasets facilitate complicated use cases for analyzing data from the data lake that can be used with other downstream Platform services or published into your Real-Time Customer Profile data.
This example use case demonstrates how to create decile-based derived datasets for use with your Real-Time Customer Profile data. Using an airline loyalty scenario as an example, this guide informs you how to create a dataset that uses categorical deciles to segment and create audiences based on ranked attributes.
The following key concepts are illustrated:
- Schema creation for decile bucketing.
- Categorical decile creation.
- Creation of complex derived datasets.
- Calculation of deciles over a lookback period.
- An example query to demonstrate aggregation, ranking, and adding unique identities to allow for audiences to be generated based on these decile buckets.
Getting started
This guide requires a working understanding of query execution in Query Service and the following components of 蜜豆视频 Experience Platform:
- Real-Time Customer Profile overview: Provides a unified, real-time consumer profile based on aggregated data from multiple sources.
- Basics of schema composition: An introduction to Experience Data Model (XDM) schemas and the building blocks, principles, and best practices for composing schemas.
- How to enable a schema for Real-Time Customer Profile: This tutorial outlines the steps necessary to add data to Real-Time Customer Profile.
- How to define a custom data type: Data types are used as reference-type fields in classes or schema field groups and allow for the consistent use of a multi-field structure that can be included anywhere in the schema.
Objectives
The example given in this document uses deciles to build derived datasets for ranking data from an airline loyalty schema. Derived datasets allow you to maximise the utility of your data by identifying an audience based on the top 鈥榥鈥 % for a chosen category.
Build decile-based derived datasets
To define the ranking of deciles based on a particular dimension and a corresponding metric, a schema must be designed to allow for decile bucketing.
This guide uses an airline loyalty dataset to demonstrate how to use Query Service to build deciles based on the miles flown over various lookback periods.
Use Query Service to create deciles
Using Query Service, you can create a dataset that contains categorical deciles, which can then be segmented to create audiences based on attribute ranking. The concepts displayed in the following examples can be applied to create other decile bucket datasets, as long as a category is defined and a metric is available.
The example airline loyalty data uses an XDM ExperienceEvents class. Each event is a record of a business transaction for mileage, either credited or debited, and the membership loyalty status of either 鈥淔lyer鈥, 鈥淔requent鈥, 鈥淪ilver鈥, or 鈥淕old鈥. The primary identity field is membershipNumber
.
Sample datasets
The initial airline loyalty dataset for this example is 鈥淎irline Loyalty Data鈥, and has the following schema. Note that the primary identity for the schema is _profilefoundationreportingstg.membershipNumber
.
Sample data
The following table displays the sample data contained in the _profilefoundationreportingstg
object used for this example. It provides context for the use of decile buckets to create complex derived datasets.
_profilefoundationreportingstg
has been omitted from the start of the namespace in the column titles and subsequent mentions throughout the document..membershipNumber
.emailAddress.address
.transactionDate
.transactionType
.transactionDetails
.mileage
.loyaltyStatus
Generate decile datasets
In the airline loyalty data seen above, the .mileage
value contains the number of miles flown by a member for every individual flight taken. This data is used to create deciles for the number of miles flown over lifetime lookbacks and a variety of lookback periods. For this purpose, a dataset is created that contains deciles in a map data type for each lookback period and an appropriate decile for each lookback period assigned under membershipNumber
.
Create an 鈥淎irline Loyalty Decile Schema鈥 to create a decile dataset using Query Service.
Enable the schema for Real-Time Customer Profile
Data being ingested into Experience Platform for use by Real-Time Customer Profile must conform to an Experience Data Model (XDM) schema that is enabled for Profile. In order for a schema to be enabled for Profile, it must implement either the XDM Individual Profile or XDM ExperienceEvent class.
Enable your schema for use in Real-Time Customer Profile using the Schema Registry API or the Schema Editor user interface. Detailed instructions on how to enable a schema for Profile are available in their respective documentation.
Next, create a data type to be reused for all decile-related field groups. The creation of the decile field group is a one-time step per sandbox. It can also be reused for all decile-related schemas.
Create an identity namespace and mark it as the primary identifier identity-namespace
Any schema created for use with deciles must have a primary identity assigned. You can define an identity field in the 蜜豆视频 Experience Platform Schemas UI, or through the Schema Registry API.
Query Service also allows you to set an identity or a primary identity for ad hoc schema dataset fields directly through SQL. See the documentation on setting a secondary identity and primary identity in ad hoc schema identities for more information.
Create a query for calculating deciles over a lookback period create-a-query
The following example demonstrates the SQL query for calculating a decile over a lookback period.
A template can be made either using the Query Editor in the UI, or through the Query Service API.
CREATE TABLE AS airline_loyality_decile
{ WITH summed_miles_1 AS (
SELECT _profilefoundationreportingstg.membershipNumber AS membershipNumber,
_profilefoundationreportingstg.loyaltyStatus AS loyaltyStatus,
SUM(_profilefoundationreportingstg.mileage) AS totalMiles
FROM airline_loyalty_data
WHERE _profilefoundationreportingstg.transactionDate < (MAKE_DATE(YEAR(CURRENT_DATE), MONTH(CURRENT_DATE), 1) - MAKE_YM_INTERVAL(0, 0))
GROUP BY 1,2
),
summed_miles_3 AS (
SELECT _profilefoundationreportingstg.membershipNumber AS membershipNumber,
_profilefoundationreportingstg.loyaltyStatus AS loyaltyStatus,
SUM(_profilefoundationreportingstg.mileage) AS totalMiles
FROM airline_loyalty_data
WHERE _profilefoundationreportingstg.transactionDate < (MAKE_DATE(YEAR(CURRENT_DATE), MONTH(CURRENT_DATE), 1) - MAKE_YM_INTERVAL(0, 1))
GROUP BY 1,2
),
summed_miles_6 AS (
SELECT _profilefoundationreportingstg.membershipNumber AS membershipNumber,
_profilefoundationreportingstg.loyaltyStatus AS loyaltyStatus,
SUM(_profilefoundationreportingstg.mileage) AS totalMiles
FROM airline_loyalty_data
WHERE _profilefoundationreportingstg.transactionDate < (MAKE_DATE(YEAR(CURRENT_DATE), MONTH(CURRENT_DATE), 1) - MAKE_YM_INTERVAL(0, 4))
GROUP BY 1,2
),
rankings_1 AS (
SELECT membershipNumber,
loyaltyStatus,
totalMiles,
NTILE(10) OVER (PARTITION BY loyaltyStatus ORDER BY totalMiles DESC) AS decileBucket
FROM summed_miles_1
),
rankings_3 AS (
SELECT membershipNumber,
loyaltyStatus,
totalMiles,
NTILE(10) OVER (PARTITION BY loyaltyStatus ORDER BY totalMiles DESC) AS decileBucket
FROM summed_miles_3
),
rankings_6 AS (
SELECT membershipNumber,
loyaltyStatus,
totalMiles,
NTILE(10) OVER (PARTITION BY loyaltyStatus ORDER BY totalMiles DESC) AS decileBucket
FROM summed_miles_6
),
map_1 AS (
SELECT membershipNumber,
MAP_FROM_ARRAYS(COLLECT_LIST(loyaltyStatus), COLLECT_LIST(decileBucket)) AS decileMonth1
FROM rankings_1
GROUP BY membershipNumber
),
map_3 AS (
SELECT membershipNumber,
MAP_FROM_ARRAYS(COLLECT_LIST(loyaltyStatus), COLLECT_LIST(decileBucket)) AS decileMonth3
FROM rankings_3
GROUP BY membershipNumber
),
map_6 AS (
SELECT membershipNumber,
MAP_FROM_ARRAYS(COLLECT_LIST(loyaltyStatus), COLLECT_LIST(decileBucket)) AS decileMonth6
FROM rankings_6
GROUP BY membershipNumber
),
all_memberships AS (
SELECT DISTINCT _profilefoundationreportingstg.membershipNumber AS membershipNumber FROM airline_loyalty_data
)
SELECT STRUCT(
all_memberships.membershipNumber AS membershipNumber,
STRUCT(
map_1.decileMonth1 AS decileMonth1,
map_3.decileMonth3 AS decileMonth3,
map_6.decileMonth6 AS decileMonth6
) AS decilesMileage
) AS _profilefoundationreportingstg
FROM all_memberships
LEFT JOIN map_1 ON (all_memberships.membershipNumber = map_1.membershipNumber)
LEFT JOIN map_3 ON (all_memberships.membershipNumber = map_3.membershipNumber)
LEFT JOIN map_6 ON (all_memberships.membershipNumber = map_6.membershipNumber)
}
Query review
Sections of the example query are examined in greater detail below.
Lookback periods
The decile data type contains a bucket for 1, 3, 6, 9, 12, and lifetime lookbacks. The query uses the lookback periods of 1, 3, and 6 months, so each section will contain some 鈥渞epeated鈥 queries in order to create temporary tables for each lookback period.
decileMonthAll
.Aggregation
Use common table expressions (CTE) to aggregate the mileage together before creating decile buckets. This provides the total miles for a specific lookback period. CTEs exist temporarily and are only usable within the scope of the larger query.
summed_miles_1 AS (
SELECT _profilefoundationreportingstg.membershipNumber AS membershipNumber,
_profilefoundationreportingstg.loyaltyStatus AS loyaltyStatus,
SUM(_profilefoundationreportingstg.mileage) AS totalMiles
FROM airline_loyalty_data
WHERE _profilefoundationreportingstg.transactionDate < (MAKE_DATE(YEAR(CURRENT_DATE), MONTH(CURRENT_DATE), 1) - MAKE_YM_INTERVAL(0, 0))
GROUP BY 1,2
)
The block is repeated twice in the template (summed_miles_3
and summed_miles_6
) with a change in the date calculation in order to generate the data for the other lookback periods.
It is important to note the identity, dimension, and metric columns for the query (membershipNumber
, loyaltyStatus
and totalMiles
respectively).
Ranking
Deciles allow you to perform categorical bucketing. To create the ranking number, the NTILE
function is used with a parameter of 10
within a WINDOW grouped by the loyaltyStatus
field. This results in a ranking from 1 to 10. Set the ORDER BY
clause of the WINDOW
to DESC
to ensure that a ranking value of 1
is given to the greatest metric within the dimension.
rankings_1 AS (
SELECT membershipNumber,
loyaltyStatus,
totalMiles,
NTILE(10) OVER (PARTITION BY loyaltyStatus ORDER BY totalMiles DESC) AS decileBucket
FROM summed_miles_1
)
Map aggregation
With multiple lookback periods, you need to create the decile bucket maps in advance using the MAP_FROM_ARRAYS
and COLLECT_LIST
functions. In the example snippet, MAP_FROM_ARRAYS
creates a map with a pair of keys (loyaltyStatus
) and values (decileBucket
) arrays. COLLECT_LIST
returns an array with all values in the specified column.
map_1 AS (
SELECT membershipNumber,
MAP_FROM_ARRAYS(COLLECT_LIST(loyaltyStatus), COLLECT_LIST(decileBucket)) AS decileMonth1
FROM rankings_1
GROUP BY membershipNumber
)
Unique identities
The list of unique identities (membershipNumber
) is required to create a unique list of all of the memberships.
all_memberships AS (
SELECT DISTINCT _profilefoundationreportingstg.membershipNumber AS membershipNumber FROM airline_loyalty_data
)
membershipNumber
can be done in the final step.Stitch together all temporary data
The final step is to stitch together all of the temporary data into a form that is identical to the structure of the deciles in the field group.
SELECT STRUCT(
all_memberships.membershipNumber AS membershipNumber,
STRUCT(
map_1.decileMonth1 AS decileMonth1,
map_3.decileMonth3 AS decileMonth3,
map_6.decileMonth6 AS decileMonth6
) AS decilesMileage
) AS _profilefoundationreportingstg
FROM all_memberships
LEFT JOIN map_1 ON (all_memberships.membershipNumber = map_1.membershipNumber)
LEFT JOIN map_3 ON (all_memberships.membershipNumber = map_3.membershipNumber)
LEFT JOIN map_6 ON (all_memberships.membershipNumber = map_6.membershipNumber)
If only lifetime data is available, your query would appear as follows:
SELECT STRUCT(
rankings.membershipNumber AS membershipNumber,
STRUCT(
MAP_FROM_ARRAYS(COLLECT_LIST(loyaltyStatus), COLLECT_LIST(decileBucket)) AS decileMonthAll
) AS decilesMileage
) AS _profilefoundationreportingstg
FROM rankings
GROUP BY rankings.membershipNumber
A correlation between the ranking number and the percentile is guaranteed in the query results because of the use of deciles. Each rank equates to 10%, so identifying an audience based on the top 30% only needs to target ranks 1, 2, and 3.
Run the query template
Run the query to populate the decile dataset. You can also save the query as a template and schedule it to run at a cadence. When saved as a template, the query can also be updated to use the create and insert pattern that references the table_exists
command. More information on how to use the table_exists
command can be found in the SQL syntax guide.
Next steps
The example use case provided above highlights steps to make decile-based derived datasets available in Real-Time Customer Profile. This allows for Segmentation Service, either via a user interface or RESTful API, to be able to generate audiences based on these decile buckets. See the Segmentation Service overview for information on how to create, evaluate, and access segments.