Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Running PDI-CLI on AWS

You can use PDI-CLI images to run the Kitchen command to run transformations and the Pan command to run jobs on AWS.

Prerequisites for installing PDI-CLI on AWS

Observe the following prerequisites before installing Pentaho:

  • A stable version of Docker must be installed on your workstation.
  • You must have an AWS account to complete this installation.
  • Amazon AWS CLI must be installed on your workstation.
  • The following software versions are supported:

    ApplicationSupported version
    Dockerv20.10.21 or a later stable version
    AWS CLIv2.x

Process overview for running PDI-CLI on AWS

Use the following steps to deploy PDI-CLI on the AWS cloud platform:

  1. Download and extract Pentaho for AWS
  2. Create an Amazon ECR
  3. Load and push the Pentaho Docker image to ECR
  4. Create an S3 bucket
  5. Configure and execute PDI-CLI

Download and extract Pentaho for AWS

Download and open the package files that contain the files needed to install Pentaho.

Procedure

  1. Navigate to the Support Portal and download the AWS version of the Docker image with the corresponding license file for the applications you want to install on your workstation.

  2. Extract the image to view the directories and the readme file.

    The image package file (<package-name>.tar.gz) contains the following:
    Directory or file nameContent description
    imageDirectory containing all the Pentaho source images.
    yamlDirectory containing YAML configuration files and various utility files.
    README.mdFile containing a link to detailed information about what we are providing for this release.

Create an Amazon ECR for PDI-CLI

Before pushing the Pentaho image to AWS, you need to create an Amazon ECR.

For information on how to create an Amazon ECR, see instructions for creating a private repository on AWS.

Load and push the PDI-CLI Docker image to ECR

Select and tag the Pentaho Docker image and then push it to the ECR registry.

Procedure

  1. Navigate to the image directory containing the Pentaho tar.gz files.

  2. Select and load the tar.gz file into the local registry by running the following command:

    docker load -i <pentaho-image>.tar.gz
  3. Record the name of the source image that was loaded into the registry by using the following command:

    docker images
  4. Tag the source image so it can be pushed to the cloud platform by using the following command:

    docker tag <source-image>:<tag> <target-repository>:<tag>
  5. Push the image file into the ECR registry using the following Docker command:

    docker push <target-repository>:<tag>
    The AWS Management Console displays the uploaded image URI.

    For general AWS instructions on how to push an image to AWS, see Pushing a Docker image.

Load and push the Pentaho Docker image to ECR

Select and tag the Pentaho Docker image and then push it to the ECR registry.

Procedure

  1. Navigate to the image directory containing the Pentaho tar.gz files.

  2. Select and load the tar.gz file into the local registry by running the following command:

    docker load -i <pentaho-image>.tar.gz
  3. Record the name of the source image that was loaded into the registry by using the following command:

    docker images
  4. Tag the source image so it can be pushed to the cloud platform by using the following command:

    docker tag <source-image>:<tag> <target-repository>:<tag>
  5. Push the image file into the ECR registry using the following Docker command:

    docker push <target-repository>:<tag>
    The AWS Management Console displays the uploaded image URI.

    For general AWS instructions on how to push an image to AWS, see Pushing a Docker image.

  6. Record the newly created ECR repository URI in the Worksheet for AWS hyperscaler.

Create an S3 bucket for PDI-CLI

You must create an S3 bucket to deploy the PDI-CLI image on AWS.

Procedure

  1. Create an S3 bucket .

    To create an S3 bucket, see Creating a bucket. To upload a file to S3, see Uploading objects.
  2. Record the newly created S3 bucket name in the Worksheet for AWS hyperscaler.

  3. Upload files into the S3 bucket.

    After the S3 bucket is created, manually create any needed directories as shown in the table below and upload the relevant files to an appropriate directory location by using the AWS Management Console. The relevant Pentaho directories are explained below:
    DirectoryActions
    /root

    All the files in the S3 bucket are copied to the /home/pentaho/data-integration/data directory.

    If you need to copy a file to the /home/pentaho/data-integration/data directory, drop the file in the root directory of the S3 bucket.

    This directory contains the Pentaho licenses files.

    Jdbc-drivers

    If your Pentaho installation needs JDBC drivers, do the following:

    1. Add the jdbc-drivers directory to your S3 bucket.
    2. Place the drivers in this directory.

    Any files within this directory will be copied to Pentaho’s lib directory.

    plugins

    If your Pentaho installation needs additional plugins installed, do the following:

    1. Add the plugins directory to your S3 bucket.
    2. Copy your plugins to the plugins directory.

    Any files within this directory are copied to Pentaho’s plugins directory. For this reason, the plugins should be organized in their own directories as expected by Pentaho.

    metastore

    Pentaho can execute jobs and transformations. Some of these require additional information that is usually stored in the Pentaho metastore.

    If you need to provide your Pentaho metastore to Pentaho, copy your local .pentaho directory to the metastore directory (you can name it something else by passing a variable) of the S3 storage bucket. From there, the content of the .pentaho directory is copied to the /home/pentaho/.pentaho folder within the Docker image.

    The relevant Pentaho files are explained below:
    FileActions
    content-config.properties

    The content-config.properties file is used by the Pentaho Docker image to provide instructions on, which S3 files to copy over and their location.

    The instructions are populated as multiple lines in the following format:

    ${KETTLE_HOME_DIR}/<some-dir-or-file>=${APP_DIR}/<some-dir>

    A template for this file can be found in the templates project directory.

    The template has an entry where the file context.xml is copied to the required location within the Docker image:

    ${KETTLE_HOME_DIR}/context.xml=${APP_DIR}/context.xml
    content-config.sh

    This is a bash script that can be used to configure files, change file and directory ownership, move files around, install missing apps, and so on.

    You can add it to the S3 bucket.

    It is executed in the Docker image after the other files are processed.

Configure and execute PDI-CLI

Configure and run AWS Batch using the PDI-CLI image.

Refer to the AWS instructions for the following steps at Getting Started with AWS Batch.

Procedure

  1. Navigate to the AWS Batch home page.

  2. Create a compute environment by selecting Compute environments and follow the instructions.

  3. Create a job queue by selecting Job queues and follow the instructions.

  4. Create a job definition by selecting Job definitions and follow the instructions.

    Provide the image name in the section for configuring the container.
  5. Create a job by selecting Jobs and follow the instructions.

    In the Environment Variables section, configure the following variables:
    VariableDescription
    PROJECT_S3_LOCATION

    Configures the S3 path from where the data is downloaded. It is then uploaded to the container.

    Example: Set PROJECT_S3_LOCATION to s3://pentaho-samples/

    METASTORE_LOCATION

    Configures the metastore path from where the metastore content and configuration will be downloaded. It is then uploaded to the path of the container: /home/pentaho/.pentaho.

    Example: Set METASTORE_LOCATION to metastore

    PROJECT_STARTUP_JOB

    Path used to execute KJB files.

    Example: Set PROJECT_STARTUP_JOB to jobs/run_job_write_to_s3/read_csv_from_s3_job.kjb

Results

You can now run Pentaho transformations and jobs using PDI-CLI.

Worksheet for AWS hyperscaler

Use the following worksheet for important information needed during installation and configuration of Pentaho.

VariableRecord your setting
ECR_IMAGE_URI

(only Platform/PDI Server and Carte Server)

RDS_HOSTNAME

(only Platform/PDI Server and Carte Server)

RDS_PORT

(only Platform/PDI Server and Carte Server)

S3_BUCKET_NAME
EKS_CLUSTER_NAME

(only Platform/PDI Server and Carte Server)