Model Authoring Using the 蜜豆视频 Experience Platform Platform SDK

Documentation Experience Platform Data Science Workspace Guide

Model authoring using the 蜜豆视频 Experience Platform Platform SDK

Last update: Mon Aug 05 2024 00:00:00 GMT+0000 (Coordinated Universal Time)

Topics:
Data Science Workspace

CREATED FOR:

User
Developer

NOTE

Data Science Workspace is no longer available for purchase.

This documentation is intended for existing customers with prior entitlements to Data Science Workspace.

This tutorial provides you with information on converting data_access_sdk_python to the new Python platform_sdk in both Python and R. This tutorial provides information on the following operations:

Build authentication
Basic reading of data
Basic writing of data

Build authentication build-authentication

Authentication is required to make calls to 蜜豆视频 Experience Platform, and is comprised of API Key, organization ID, a user token, and a service token.

Python

If you are using Jupyter Notebook, please use the below code to build the client_context:

client_context = PLATFORM_SDK_CLIENT_CONTEXT

If you are not using Jupyter Notebook or you need to change the organization, please use the below code sample:

from platform_sdk.client_context import ClientContext
client_context = ClientContext(api_key={API_KEY},
              org_id={ORG_ID},
              user_token={USER_TOKEN},
              service_token={SERVICE_TOKEN})

R

If you are using Jupyter Notebook, please use the below code to build the client_context:

library(reticulate)
use_python("/usr/local/bin/ipython")
psdk <- import("platform_sdk")

py_run_file("../.ipython/profile_default/startup/platform_sdk_context.py")
client_context <- py$PLATFORM_SDK_CLIENT_CONTEXT

If you are not using Jupyter Notebook or you need to change organization, please use the below code sample:

library(reticulate)
use_python("/usr/local/bin/ipython")
psdk <- import("platform_sdk")
client_context <- psdk$client_context$ClientContext(api_key={API_KEY},
              org_id={ORG_ID},
              user_token={USER_TOKEN},
              service_token={SERVICE_TOKEN})

Basic reading of data basic-reading-of-data

With the new Platform SDK, the maximum read size is 32 GB, with a maximum read time of 10 minutes.

If your read time is taking too long, you can try using one of the following filtering options:

Filtering data by offset and limit
Filtering data by date
Filtering data by column
Getting sorted results

NOTE

The organization is set within the client_context.

Python

To read data in Python, please use the code sample below:

from platform_sdk.dataset_reader import DatasetReader
dataset_reader = DatasetReader(client_context, "{DATASET_ID}")
df = dataset_reader.limit(100).read()
df.head()

R

To read data in R, please use the code sample below:

DatasetReader <- psdk$dataset_reader$DatasetReader
dataset_reader <- DatasetReader(client_context, "{DATASET_ID}")
df <- dataset_reader$read()
df

Filter by offset and limit filter-by-offset-and-limit

Since filtering by batch ID is no longer supported, in order to scope reading of data, you need to use offset and limit.

Python

df = dataset_reader.limit(100).offset(1).read()
df.head

R

df <- dataset_reader$limit(100L)$offset(1L)$read()
df

Filter by date filter-by-date

Granularity of date filtering is now defined by the timestamp, rather than being set by the day.

Python

df = dataset_reader.where(\
    dataset_reader['timestamp'].gt('2019-04-10 15:00:00').\
    And(dataset_reader['timestamp'].lt('2019-04-10 17:00:00'))\
).read()
df.head()

R

df2 <- dataset_reader$where(
    dataset_reader['timestamp']$gt('2018-12-10 15:00:00')$
    And(dataset_reader['timestamp']$lt('2019-04-10 17:00:00'))
)$read()
df2

The new Platform SDK supports the following operations:

Operation

Function

Equals (=)

eq()

Greater than (>)

gt()

Greater than or equal to (>=)

ge()

Less than (<)

lt()

Less than or equal to (<=)

le()

And (&)

And()

Or (`

Filter by selected columns filter-by-selected-columns

To further refine your reading of data, you can also filter by column name.

Python

df = dataset_reader.select(['column-a','column-b']).read()

R

df <- dataset_reader$select(c('column-a','column-b'))$read()

Get sorted results get-sorted-results

Results received can be sorted by specified columns of the target dataset and in their order (asc/desc) respectively.

In the following example, dataframe is sorted by 鈥渃olumn-a鈥� first in ascending order. Rows having the same values for 鈥渃olumn-a鈥� are then sorted by 鈥渃olumn-b鈥� in descending order.

Python

df = dataset_reader.sort([('column-a', 'asc'), ('column-b', 'desc')])

R

df <- dataset_reader$sort(c(('column-a', 'asc'), ('column-b', 'desc')))$read()

Basic writing of data basic-writing-of-data

NOTE

The organization is set within the client_context.

To write data in Python and R, use one of the following examples below:

Python

from platform_sdk.models import Dataset
from platform_sdk.dataset_writer import DatasetWriter

dataset = Dataset(client_context).get_by_id("{DATASET_ID}")
dataset_writer = DatasetWriter(client_context, dataset)
write_tracker = dataset_writer.write({PANDA_DATAFRAME}, file_format='json')

R

dataset <- psdk$models$Dataset(client_context)$get_by_id("{DATASET_ID}")
dataset_writer <- psdk$dataset_writer$DatasetWriter(client_context, dataset)
write_tracker <- dataset_writer$write({PANDA_DATAFRAME}, file_format='json')

Next steps

Once you have configured the platform_sdk data loader, the data undergoes preparation and is then split to the train and val datasets. To learn about data preparation and feature engineering please visit the section on data preparation and feature engineering in the tutorial for creating a recipe using JupyterLab notebooks.

recommendation-more-help

cc79fe26-64da-411e-a6b9-5b650f53e4e9