Creating a Pipeline

Create a Data Science pipeline to run a task.

Ensure that you have created the necessary policies, authentication, and authorization for pipelines.

Important

For proper operation of script steps, ensure that you have added the following rule to a dynamic group policy:

all {resource.type='datasciencepipelinerun', resource.compartment.id='<pipeline-run-compartment-ocid>'}

Before you begin:

You can create pipelines by using the ADS SDK, OCI Console, or the OCI SDK.

Using ADS for creating pipelines can make developing the pipeline, the steps, and the dependencies easier. ADS supports reading and writing the pipeline to and from a YAML file. You can use ADS to view a visual representation of the pipeline. We recommend that you use ADS to create and manage pipeline using code.

    1. Use the Console to sign in to a tenancy with the necessary policies.
    2. Open the navigation menu and click Analytics & AI. Under Machine Learning, click Data Science.
    3. Select the compartment that contains the project that you want to use.

      All projects in the compartment are listed.

    4. Click the name of the project.

      The project details page opens and lists the notebook sessions.

    5. Under Resources, click Pipelines.
    6. Click Create pipeline.
    7. (Optional) Select a different compartment for the pipeline.
    8. (Optional) Enter a name and description for the pipeline (limit of 255 characters). If you don't provide a name, a name is automatically generated.

      For example, pipeline2022808222435.

    9. Click Add pipeline steps to start defining the workflow for the pipeline.
    10. In the Add pipeline step panel, select one of the following options, and then finish the pipeline creation:
    From a Job

    The pipeline step uses an existing job. Select one of the jobs in the tenancy.

    1. Enter a unique name for the step. You can't repeat a step name in a pipeline.
    2. (Optional) Enter a step description, which can help you find step dependencies.
    3. (Optional) If this step depends on another step, select one or more steps to run before this step.
    4. Select the job for the step to run.
    5. (Optional) Enter or select any of the following values to control this pipeline step:
      Custom environment variable key and value

      The environment variables for this pipeline step.

      Value

      The value for the custom environment variable key.

      You can click Additional custom environment key to specify more variables.

      Command line arguments

      The command line arguments that you want to use for running the pipeline step.

      Maximum runtime (in minutes)

      The maximum number of minutes that the pipeline step is allowed to run. The service cancels the pipeline run if its runtime exceeds the specified value. The maximum runtime is 30 days (43,200 minutes). We recommend that you configure a maximum runtime on all pipeline runs to prevent runaway pipeline runs.

    6. Click Save to add the step and return to the Create pipeline page.
    7. (Optional) Click +Add pipeline steps to add more steps to complete your workflow, and repeat the preceding steps.
    8. (Optional) Create a default pipeline configuration that's used when the pipeline is run by entering environment variable, command line arguments, and maximum runtime options. See step 5 for an explanation of these fields.
    9. (Optional) Select a Compute shape by clicking Select and following these steps:
      1. Select an instance type.
      2. Select a shape series.
      3. Select one of the supported Compute shapes in the series.
      4. Select the shape that best suits how you want to use the resource. For the AMD shape, you can use the default or set the number of OCPUs and memory.

        For each OCPU, select up to 64 GB of memory and a maximum total of 512 GB. The minimum amount of memory allowed is either 1 GB or a value matching the number of OCPUs, whichever is greater.

      5. Click Select shape.
    10. For Block Storage, enter the amount of storage that you want to use between 50 GB and 10, 240 GB (10 TB). You can change the value by 1 GB increments. The default value is 100 GB.
    11. (Optional) To use logging, click Select, and then ensure that Enable logging is selected.
      1. Select a log group from the list. You can change to a different compartment to specify a log group in a different compartment from the job.
      2. Select one of the following to store all stdout and stderr messages:
        Enable automatic log creation

        Data Science automatically creates a log when the job starts.

        Select a log

        Select a log to use.

      3. Click Select to return to the job run creation page.
    12. (Optional) Click Show advanced options to add tags to the pipeline.
    13. (Optional) Enter the tag namespace (for a defined tag), key, and value to assign tags to the resource.

      To add more than one tag, click Add tag.

      Tagging describes the various tags that you can use organize and find resources including cost-tracking tags.

    14. Click Create.

      After the pipeline is in an active state, you can use pipeline runs to repeatedly run the pipeline.

    From a Script

    The step uses a script to run. You need to upload the artifact containing all the code for the step to run.

    1. Enter a unique name for the step. You can't repeat a step name in a pipeline.
    2. (Optional) Enter a step description, which can help you find step dependencies.
    3. (Optional) If this step depends on another step, select one or more steps to run before this step.
    4. Drag a job step file into the box, or click select a file to navigate to it for selection.
    5. In Entry point, select one file to be the entry run point of the step. This is useful when you have many files.
    6. (Optional) Enter or select any of the following values to control this pipeline step:
      Custom environment variable key and value

      The environment variables for this pipeline step.

      Value

      The value for the custom environment variable key.

      You can click Additional custom environment key to specify more variables.

      Command line arguments

      The command line arguments that you want to use for running the pipeline step.

      Maximum runtime (in minutes)

      The maximum number of minutes that the pipeline step is allowed to run. The service cancels the pipeline run if its runtime exceeds the specified value. The maximum runtime is 30 days (43,200 minutes). We recommend that you configure a maximum runtime on all pipeline runs to prevent runaway pipeline runs.

    7. (Optional) Create a default pipeline configuration that's used when the pipeline is run by entering environment variable, command line arguments, and maximum runtime options. See step 6 for an explanation of these fields.
    8. For Block Storage, enter the amount of storage that you want to use between 50 GB and 10, 240 GB (10 TB). You can change the value by 1 GB increments. The default value is 100 GB.
    9. Click Save to add the step and return to the Create pipeline page.
    10. (Optional) Use +Add pipeline steps to add more steps to complete your workflow by repeating the preceding steps.
    11. (Optional) Create a default pipeline configuration that's used when the pipeline is run by entering environment variable, command line arguments, and maximum runtime options. See step 6 for an explanation of these fields.
    12. For Block Storage, enter the amount of storage that you want to use between 50 GB and 10, 240 GB (10 TB). You can change the value by 1 GB increments. The default value is 100 GB.
    13. (Optional) To use logging, click Select, and then ensure that Enable logging is selected.
      1. Select a log group from the list. You can change to a different compartment to specify a log group in a different compartment from the job.
      2. Select one of the following to store all stdout and stderr messages:
        Enable automatic log creation

        Data Science automatically creates a log when the job starts.

        Select a log

        Select a log to use.

      3. Click Select to return to the job run creation page.
    14. (Optional) Click Show advanced options to add tags to the pipeline.
    15. (Optional) Enter the tag namespace (for a defined tag), key, and value to assign tags to the resource.

      To add more than one tag, click Add tag.

      Tagging describes the various tags that you can use organize and find resources including cost-tracking tags.

    16. Click Create.

      After the pipeline is in an active state, you can use pipeline runs to repeatedly run the pipeline.

  • These environment variables control the pipeline run.

    You can use the OCI CLI to create a pipeline as in this Python example:

    1. Create a pipeline:

      The following parameters are available to use in the payload:

      Parameter name Required Description
      Pipeline (top level)
      projectId Required The project OCID to create the pipeline in.
      compartmentId Required The compartment OCID to the create the pipeline in.
      displayName Optional The name of the pipeline.
      infrastructureConfigurationDetails Optional

      Default infrastructure (compute) configuration to use for all the pipeline steps, see infrastructureConfigurationDetails for details on the supported parameters.

      Can be overridden by the pipeline run configuration.

      logConfigurationDetails Optional

      Default log to use for the all the pipeline steps, see logConfigurationDetails for details on the supported parameters.

      Can be overridden by the pipeline run configuration.

      configurationDetails Optional

      Default configuration for the pipeline run, see configurationDetails for details on supported parameters.

      Can be overridden by the pipeline run configuration.

      freeformTags Optional Tags to add to the pipeline resource.
      stepDetails
      stepName Required Name of the step. Must be unique in the pipeline.
      description Optional Free text description for the step.
      stepType Required CUSTOM_SCRIPT or ML_JOB
      jobId Required* For ML_JOB steps, this is the job OCID to use for the step run.
      stepInfrastructureConfigurationDetails Optional*

      Default infrastructure (Compute) configuration to use for this step, see infrastructureConfigurationDetails for details on the supported parameters.

      Can be overridden by the pipeline run configuration.

      *Must be defined on at least one level (precedence based on priority, 1 being highest):

      1 pipeline run and/or

      2 step and/or

      3 pipeline

      stepConfigurationDetails Optional*

      Default configuration for the step run, see configurationDetails for details on supported parameters.

      Can be overridden by the pipeline run configuration.

      *Must be defined on at least one level (precedence based on priority, 1 being highest):

      1 pipeline run and/or

      2 step and/or

      3 pipeline

      dependsOn Optional List of steps that must be completed before this step begins. This creates the pipeline workflow dependencies graph.
      infrastructureConfigurationDetails
      shapeName Required Name of the Compute shape to use. For example, VM.Standard2.4.
      blockStorageSizeInGBs Required Number of GBs to use as the attached storage for the VM.
      logConfigurationDetails
      enableLogging Required Define to use logging.
      logGroupId Required Log group OCID to use for the logs. The log group must be created and available when the pipeline runs
      logId Optional* Log OCID to use for the logs when not using the enableAutoLogCreation parameter.
      enableAutoLogCreation Optional If set to True, a log for each pipeline run is created.
      configurationDetails
      type Required Only DEFAULT is supported.
      maximumRuntimeInMinutes Optional Time limit in minutes for the pipeline to run.
      environmentVariables Optional

      Environment variables to provide for the pipeline step runs.

      For example:

      "environmentVariables": {
      
       "CONDA_ENV_TYPE": "service"
      
      }

      Review the list of service supported environment variables.

      pipeline_payload = {
          "projectId": "<project_id>",
          "compartmentId": "<compartment_id>",
          "displayName": "<pipeline_name>",
          "pipelineInfrastructureConfigurationDetails": {
              "shapeName": "VM.Standard2.1",
              "blockStorageSizeInGBs": "50"
          },
          "pipelineLogConfigurationDetails": {
              "enableLogging": True,
              "logGroupId": "<log_group_id>",
              "logId": "<log_id>"
          },
          "pipelineDefaultConfigurationDetails": {
              "type": "DEFAULT",
              "maximumRuntimeInMinutes": 30,
              "environmentVariables": {
                  "CONDA_ENV_TYPE": "service",
                  "CONDA_ENV_SLUG": "classic_cpu"
              }
          },
          "stepDetails": [
              {
                  "stepName": "preprocess",
                  "description": "Preprocess step",
                  "stepType": "CUSTOM_SCRIPT",
                  "stepInfrastructureConfigurationDetails": {
                      "shapeName": "VM.Standard2.4",
                      "blockStorageSizeInGBs": "100"
                  },
                  "stepConfigurationDetails": {
                      "type": "DEFAULT",
                      "maximumRuntimeInMinutes": 90
                      "environmentVariables": {
                          "STEP_RUN_ENTRYPOINT": "preprocess.py",
                          "CONDA_ENV_TYPE": "service",
                          "CONDA_ENV_SLUG": "onnx110_p37_cpu_v1"
                  }
              },
              {
                  "stepName": "postprocess",
                  "description": "Postprocess step",
                  "stepType": "CUSTOM_SCRIPT",
                  "stepInfrastructureConfigurationDetails": {
                      "shapeName": "VM.Standard2.1",
                      "blockStorageSizeInGBs": "80"
                  },
                  "stepConfigurationDetails": {
                      "type": "DEFAULT",
                      "maximumRuntimeInMinutes": 60
                  },
                  "dependsOn": ["preprocess"]
              },
          ],
          "freeformTags": {
              "freeTags": "cost center"
          }
      }
      pipeline_res = dsc.create_pipeline(pipeline_payload)
      pipeline_id = pipeline_res.data.id

      Until all pipeline steps artifacts are uploaded, the pipeline is in the CREATING state.

    2. Upload a step artifact:

      After an artifact is uploaded, it can't be changed.

      fstream = open(<file_name>, "rb")
       
      dsc.create_step_artifact(pipeline_id, step_name, fstream, content_disposition=f"attachment; filename={<file_name>}")
    3. Update a pipeline:

      You can only update a pipeline when it's in an ACTIVE state.

      update_pipeline_details = {
      "displayName": "pipeline-updated"
      }
      self.dsc.update_pipeline(<pipeline_id>, <update_pipeline_details>)
    4. Start pipeline run:
      pipeline_run_payload = {
      "projectId": project_id,
      "displayName": "pipeline-run",
      "pipelineId": <pipeline_id>,
      "compartmentId": <compartment_id>,
      }
      dsc.create_pipeline_run(pipeline_run_payload)
  • The ADS SDK is also a publicly available Python library that you can install with this command:

    pip install oracle-ads

    You can use the ADS SDK to create and run pipelines.