Jobs

Troubleshoot your jobs and job runs.

Can't create log object on behalf of the user Errors During a Job Run Creation

If the job run creation fails and you are seeing the following lifecycle details:

The specified log group is not found or not authorized. Cannot create log object on behalf of the user.
Ensure the log group is valid and the user has appropriate permissions configured
Incorrect Log Group OCID

Ensure that the log group OCID specified in the job run create configuration is correct.

Incorrect Permissions

You are missing permissions. The user creating the job run must have permissions to log groups and logging content. This is to ensure that the user has access to the specified log group and log object. Also, to facilitate creating a new log object on behalf of the user when enableAutoLogCreation is enabled.

allow group <group-name> to manage log-groups in compartment <log-compartment-name>
allow group <group-name> to use log-content in compartment <log-compartment-name>

Common mistakes are:

  • Only giving the user use permissions on log groups. The manage permission is required when enableAutoLogCreation is enabled.
  • Allowing the wrong group. The group refers to the group the creator of the job run is in. If you are creating job runs using instance principals, the required policy would:
    dynamic group <instance-principal-dynamic-group-name>

Bring Your Own Container Job Run Failure When Downloading the Image

When attempting to create a bring your own container job run it fails with errors when downloading the image, ensure the following:

  • You could be missing the host in the path to the image. The correct format for image path is <region-key>.ocir.io/<tenancy-namespace>/<repository-name>:<tag>. A common mistake is to miss the first portion of the path (the host URL).
  • The container image is in a different region than the job run: Data Science jobs don't support pulling images from OCIR cross-region. Ensure that the container image is in the same region as the job run.

Why Isn't Fast Launch an Option in the Console When Creating a Job

The fast launch option is only available in the regions where it is supported. Not all regions and realms support this feature. For example, it is generally not supported in Dedicated Region Cloud@Customer (DRCC) realms.

The same is true for the ListFastLaunchJobConfigs API endpoint. The API responds with the list of options for fast launch, so for regions where fast launch is not supported the response is an error or empty list.

There is currently no capacity for the specified shape Error

If this error occurs when creating a job run (as the lifecycle detail describes), there is no capacity to create the run. You must retry later, try in other regions, or use different shape families.

401 NotAuthenticated Error When making Requests to the Data Science API

This type of error is entirely unrelated to the Data Science service. Rather, it's an issue on the user side when creating and signing the requests.

If you are using user principal to make the request, some common mistakes are:

  • Having invalid API keys, see assigning keys.
  • Making a request immediately after uploading a public key. The identity information needs time to propagate across the regions in a realm. Typically, occurs within 5 minutes though occasionally more time might be required.

Job Run Logging Integration is Enabled Though Logs Aren't Generated

For a successfully created job run that reached an IN_PROGRESS state, but no logs appear in the log object. Typically, this occurs when policies are missing or incorrect. The job run must have permissions to write to the job run log.

First, define a dynamic group for the job run resource:

all { resource.type='datasciencejobrun', resource.compartment.id='<job-run-compartment-ocid>' }

Then set this dynamic group access:

allow dynamic-group <job-runs-dynamic-group> to use log-content in compartment <log-compartment-name>

Some common mistakes are:

  • An incorrect compartment is specified. Notice that the compartment described in preceding policies are different.
    • For the dynamic group definition, it is the compartment of the job run.
    • For the policy statement for access to log content, it is the compartment of the log.
  • Defining the dynamic group using the compartment.id instead of the resource.compartment.id.
  • An incorrect resource type was included in the dynamic group definition. Likely, the dynamic group defined is for the notebook session resource and doesn't include the job run resource. The datasciencejobrun resource principal is used to write to logs for job run logging integration so must be included in the dynamic group definition.

Job Run Logging Integration is Enabled Though the Logs Appear Truncated

Data Science jobs supports integration with the OCI Logging service for automatic logging. If the logs appear truncated or incomplete, it is likely because of the following Logging service limits:

  • Each entry must be less than 1MB.
  • Any log data field can't be more than 10,000 characters.

If the data exceeds these limits, then the log entry is truncated during ingestion.

Job Run Metrics Have No Data

If you don't see the job run metrics during or after job processing, likely you don't have the correct policies configured. Ensure that you have the following policy:

allow group <user-group-name> to read metrics in compartment <compartment-name>

The compartment is the compartment of the job run.

Job run artifact execution failed with exit code ___ Error

This means that the execution of the code failed with the indicated exit code related to the code. Enable logging integration, and ensure that you have sufficient log statements in the code to debug the issue.

Job Run Exit Code Isn't Indicated

Jobs indicate the exit code of a job run failure when it exits. This information is available in the job run's lifecycle detail field. This is supported for all job runs including bring your own container job runs.

If you are observing that the exit code you know the job run failed with isn't correctly indicated, likely the exit code isn't being propagated correctly.

Some common mistakes are:

  • If you are using a shell script as an entry point start other files to run (other python files), then the shell script must capture the exit code from the internal file execution, then subsequently exit the shell script with the captured exit code.
  • Throwing exceptions might not be sufficient. The file run (or container for bring your own container) must explicitly exit with an exit code. In Python, this is done using sys.exit(ERROR_CODE).
  • Using an incorrect type for the exit code value. Typically, the incorrect type used is a string. Exit codes must be Numbers or integers, and between 1-255 as described in Job with Exit Codes.

Job Run Invalid Entry Point

Specifying JOB_RUN_ENTRYPOINT to a file that doesn't exist or the file isn't at the location specified results in this error:

Job run bootstrap failure: invalid job run entry point (JOB_RUN_ENTRYPOINT).