Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Installing the Platform Server or PDI Server on AWS

These instructions provide the steps necessary to deploy Docker images of the Platform Server or PDI Server on AWS.

Prerequisites for installing the Platform or PDI Server on AWS

Observe the following prerequisites before installing the Platform or PDI Server:

  • A stable version of Docker must be installed on your workstation.
  • You must have an AWS account to complete this installation.
  • Amazon AWS CLI must be installed on your workstation.
  • The following software versions are supported:

    ApplicationSupported version
    EKSv1.x
    Dockerv20.10.21 or a later stable version
    AWS CLIv2.x
    Pythonv3.x

Process overview for installing the Platform or PDI Server on AWS

Use the following steps to deploy the Platform Server or PDI Server on the AWS cloud platform:

  1. Download and extract Pentaho for AWS
  2. Create an Amazon ECR
  3. Load and push the Pentaho Docker image to ECR
  4. Create an RDS database
  5. Create an S3 bucket
  6. Create an EKS cluster and add a node group
  7. Install the Platform or PDI Server on AWS

You can also perform the following operations:

  • Update the Platform or PDI Server licenses on AWS
  • Dynamically update server configuration content from S3

Download and extract Platform or PDI Server for AWS

Download and open the package files that contain the files needed to install Pentaho.

Procedure

  1. Navigate to the Support Portal and download the AWS version of the Docker image with the corresponding license file for the applications you want to install on your workstation.

  2. Extract the image to view the directories and the readme file.

    The image package file (<package-name>.tar.gz) contains the following:
    Directory or file nameContent description
    imageDirectory containing all the Pentaho source images.
    sql-scriptsDirectory containing SQL scripts for various operations.
    yamlDirectory containing YAML configuration files and various utility files.
    README.mdFile containing a link to detailed information about what we are providing for this release.

Create an Amazon ECR

Before pushing the Pentaho image to AWS, you need to create an Amazon ECR.

Procedure

  1. Create an ECR repository to load the Pentaho image.

    For information on how to create an Amazon ECR, see instructions for creating a private repository on AWS.
  2. Record the name of the ECR repository that you have created in the Worksheet for AWS hyperscaler.

Load and push the Pentaho Docker image to ECR

Select and tag the Pentaho Docker image and then push it to the ECR registry.

Procedure

  1. Navigate to the image directory containing the Pentaho tar.gz files.

  2. Select and load the tar.gz file into the local registry by running the following command:

    docker load -i <pentaho-image>.tar.gz
  3. Record the name of the source image that was loaded into the registry by using the following command:

    docker images
  4. Tag the source image so it can be pushed to the cloud platform by using the following command:

    docker tag <source-image>:<tag> <target-repository>:<tag>
  5. Push the image file into the ECR registry using the following Docker command:

    docker push <target-repository>:<tag>
    The AWS Management Console displays the uploaded image URI.

    For general AWS instructions on how to push an image to AWS, see Pushing a Docker image.

  6. Record the newly created ECR repository URI in the Worksheet for AWS hyperscaler.

Create an RDS database

Use these instructions to create an Relational Database Service (RDS) database in AWS.

Procedure

  1. Create an RDS PostgreSQL database for Pentaho to use in the ECR.

    See the AWS instructions at Creating and connecting to a PostgreSQL DB instance and apply the settings in the table below.
    SectionActions
    Create database

    Choose Standard create.

    Select the PostgreSQL engine.

    Set the engine version to a PostgreSQL version supported by the Components Reference, such as PostgreSQL 13.5-R1.

    Templates

    It is recommended to select the Free tier option.

    NoteFor this installation, the Free tier PostgreSQL database is used with a set of options as an example. However, you are free to use other database servers with different options as necessary.
    Settings

    Set the DB instance identifier.

    Retain the default user name postgres and set the Master password.

    Use the default password authentication setting.

    Use the default values for the rest of the settings in this section.

    Instance configuration

    Use the default settings for each section.

    Storage

    Use the default settings for each section.

    Connectivity

    Set the Virtual private cloud (VPC) and the DB subnet group to any of the options available to you. If in doubt, use the default values.

    Select Public access.

    Make sure that the VPC security groups selected have a rule enabling communication to the database through the PostgreSQL port, which is 5432 by default.

    For other options, use the default settings.

    Database authentication

    Use the default setting Password authentication.

  2. Run the scripts in the sql-scripts folder in the distribution in the numbered order.

  3. From the AWS Management Console connection tab, record the database Endpoint and Port number in the Worksheet for AWS hyperscaler.

Create an S3 bucket

You should create an S3 bucket only if you want to do one or more of the following. Otherwise proceed to create the EKS cluster.

  • Store the Pentaho license. (Alternatively, you can store the license in the Kubernetes secret. In that case, do not store the license in the S3 bucket as storing a license in both places is not supported by Pentaho.)
  • Add third party JAR files like JDBC drivers or custom JAR files for Pentaho to use.
  • Customize the default Pentaho configuration.
  • Replace the server files.
  • Upload or update the metastore.
  • Add files to the Platform and PDI Server's /home/pentaho/.kettle directory. This is mapped to the "KETTLE_HOME_DIR" environment variable, which is used by the content-config.properties file.

Procedure

  1. Create an S3 bucket.

    To create an S3 bucket, see Creating a bucket. To upload a file to S3, see Uploading objects.
  2. Record the newly created S3 bucket name in the Worksheet for AWS hyperscaler.

  3. Upload files into the S3 bucket.

    After the S3 bucket is created, manually create any needed directories as shown in the table below and upload the relevant files to an appropriate directory location by using the AWS Management Console. The relevant Pentaho directories are explained below:
    DirectoryActions
    /root

    All the files in the S3 bucket are copied to the Platform and PDI Server's /home/pentaho/.kettle directory.

    If you need to copy a file to the /home/pentaho/.kettle directory, drop the file in the root directory of the S3 bucket.

    licenses

    The licenses directory contains the Pentaho licenses files. However, the Server Secret Generation Tool documented in Install the Platform or PDI Server on AWS automatically retrieves the needed license file from the proper location, as long as you download the license file with the image distribution as described in Download and extract Platform or PDI Server for AWS.

    Without the license file, the server will ask for the license file the first time you connect to Pentaho. You can provide the file, but it will not be persisted, and the server will ask for it every time you reboot.

    This file can be found in the local.pentaho directory.

    custom-lib

    If Pentaho needs custom JAR libraries, add the custom-lib directory to your S3 bucket and place the libraries there.

    Any files within this directory will be copied to Pentaho’s lib directory.

    Jdbc-drivers

    If your Pentaho installation needs JDBC drivers, do the following:

    1. Add the jdbc-drivers directory to your S3 bucket.
    2. Place the drivers in this directory.

    Any files within this directory will be copied to Pentaho’s lib directory.

    plugins

    If your Pentaho installation needs additional plugins installed, do the following:

    1. Add the plugins directory to your S3 bucket.
    2. Copy your plugins to the plugins directory.

    Any files within this directory are copied to Pentaho’s plugins directory. For this reason, the plugins should be organized in their own directories as expected by Pentaho.

    drivers

    If your Pentaho installation needs big data drivers installed, do the following:

    1. Add the drivers directory to your S3 bucket.
    2. Place the big data drivers in this directory.

    Any files placed within this directory will be copied to Pentaho’s drivers directory.

    metastore

    Pentaho can execute jobs and transformations. Some of these require additional information that is usually stored in the Pentaho metastore.

    If you need to provide your Pentaho metastore to Pentaho, copy your local metastore directory to the root of the S3 Storage bucket. From there, the metastore directory is copied to the proper location within the Docker image.

    server-structured-override

    The server-structured-override directory is the last resort if you want to make changes to any other files in the image at runtime.

    For example, you could use it for configuring authentication and authorization.

    Any files and directories within this directory will be copied to the pentaho-server directory the same way they appear in the server-structured-override directory.

    If the same files exist in the pentaho-server directory, they will be overwritten.

    The relevant Pentaho files are explained below:
    FileActions
    context.xml

    The Pentaho configuration YAML is included with the image in the templates project directory and is used to install this product. You must set the RDS host and RDS port parameters when you install Pentaho. Upon installation, the parameters in the configuration YAML are used to generate a custom context.xml file for the Pentaho installation so it can connect to the database-specific repository.

    If these are the only changes required in your context.xml, you don’t need to provide a context.xml in your S3 bucket. On the other hand, if you need to configure additional parameters in your context.xml, please provide the custom.xml file in your S3 bucket.

    In the context.xml template, replace the <RDS_HOST_NAME> and <RDS_PORT> entries with the values you recorded on the Worksheet for AWS hyperscaler.

    content-config.properties

    The content-config.properties file is used by the Pentaho Docker image to provide instructions on, which S3 files to copy over and their location.

    The instructions are populated as multiple lines in the following format:

    ${KETTLE_HOME_DIR}/<some-dir-or-file>=${SERVER_DIR}/<some-dir>

    A template for this file can be found in the templates project directory.

    The template has an entry where the file context.xml is copied to the required location within the Docker image:

    ${KETTLE_HOME_DIR}/context.xml=${SERVER_DIR}/tomcat/webapps/pentaho/META-INF/context.xml
    content-config.sh

    This is a bash script that can be used to configure files, change file and directory ownership, move files around, install missing apps, and so on.

    You can add it to the S3 bucket.

    It is executed in the Docker image after the other files are processed.

    metastore.zip

    Pentaho can execute jobs and transformations. Some of these require additional information that is usually stored in the Pentaho metastore.

    If you need to provide your Pentaho metastore to Pentaho, zip the content of your local.pentaho directory with the name metastore.zip and add it to the root of the Cloud Storage bucket. The metastore.zip file is extracted to the proper location within the Docker image.

    NoteThe VFS connections cannot be copied to the hyperscaler server from PDI the same way as the named connection. You need to connect to Pentaho on the hyperscaler and create the new VFS connection.
    For instructions on how to dynamically update server configuration content from the S3 bucket, see Dynamically update server configuration content from S3.

Create an EKS cluster and add a node group

Use Amazon Elastic Kubernetes Service (EKS) to create a cluster for running the Platform or PDI Server.

Procedure

  1. Create an EKS cluster on AWS.

    For instructions, see Create an Amazon EKS cluster. For a beginner's introduction to EKS, see Getting started with Amazon EKS. For information about creating roles to delegate permissions to an AWS service, see Create a role.
    SettingsActions
    Cluster service role

    You can select any existing role, as long as the following policies are attached to the role:

    • AmazonEKSClusterPolicy
    • AmazonS3FullAccess
    • AmazonEKSServicePolicy
    VPC

    In the Networking section, do the following:

    1. Select an existing VPC. The selected VPC populates a group of subnets. It should be created before you create a computing or cloud stack.
    2. Make sure that the Auto-assign public IPv4 address property under subnets is set to Yes.
    Cluster endpoint access

    Select the Public and private option.

    Amazon VPC CNI

    CoreDNS

    kube_proxy

    Select all three EKS add-ons with their default configurations.

  2. Record the newly created EKS cluster name in the Worksheet for AWS hyperscaler.

  3. On the Compute tab under Node groups, add a node group to the EKS cluster by clicking Add node group.

    NoteThe EKS cluster must be in active state before starting the process of creating a node. For further instructions, see Create a managed node group.
  4. In the Node group configuration section, add the group Name.

  5. Select a Node IAM role from the list or create a new role. Make sure that the role contains the following policies:

    • AmazonS3FullAccess
    • AmazonEC2ContainerRegistryReadOnly
    • AmazonEKSWorkerNodePolicy
    • AmazonEKS_CNI_Policy
  6. Set the instance type to one that has at least 8 GB of memory.

  7. In the Node group scaling configuration section, set the value for Desired size, Minimum size, and Maximum size to the desired number of nodes.

  8. In the Node group network configuration section, select the subnets for your node group.

  9. For the subnets, set the Auto-assign public IPv4 address property to Yes.

    For further instructions, contact your AWS administrator or see IP addressing for your VPCs and subnets.
  10. Select a load balancer.

    For instructions on how to create an AWS Application load balancer, see Application load balancing on Amazon EKS.

Install the Platform or PDI Server on AWS

When your AWS environment is properly configured, you can proceed to install the Platform or PDI Server.

Procedure

  1. Retrieve the kubeconfig from the EKS cluster.

    In the workstation console, obtain the kubeconfig from the EKS cluster you created by using the following command:
    aws eks update-kubeconfig --name <my_eks_cluster_name> --region <my_EKS_region>
  2. To configure the Platform or PDI Server YAML file, open the file pentaho-server-aws-rds-<lb-type>.yaml in the yaml project directory.

    lb-typeWhen to use
    albUse if you installed the AWS Application load balancer.
    nginxUse if you have installed the NGINX Ingress Controller.
  3. Add the Pentaho license by running one of the following scripts in the distribution.

    • Windows: Run start.bat
    • Linux: Run start.sh
    This opens the Server Secret Generation Tool.
  4. Complete the configuration page of the Server Secret Generation Tool by adding the license files and using the values you recorded in the Worksheet for AWS hyperscaler.

    Server Secret Generation Tool configuration page
  5. Click Generate Yaml.

  6. Retrieve the Platform or PDI Server entry point URI information.

    The Platform or PDI Server entry point URI information can be retrieved by running either of the following commands on the workstation console:
    kubectl get ingress -n pentaho-server
    or
    echo $( kubectl get ingress -n pentaho-server -o jsonpath='{.items..hostname}' )
    The port number is 80 by default.
  7. If the Pentaho license file is stored in an S3 bucket, deploy the Platform or PDI Server using the following command:

    kubectl apply -f <path to Pentaho deployment YAML>
  8. Test the Platform or PDI Server by retrieving the LoadBalancer Ingress URI.

    This is done by running the following command in your workstation console:
    echo $( kubectl get ingress -n pentaho-server -o jsonpath='{.items..hostname}' )
    NoteThe port number for this load balancer is 80, not 8080.
  9. Use the URI that you received in the previous step in a Pentaho-supported browser to open the Platform or PDI Server login screen and access the Platform or PDI Server.

    FieldDefault value
    Usernameadmin
    Passwordpassword

Update the Platform or PDI Server licenses on AWS

How your license is refreshed or updated depends on how it is stored.

Execute the following instructions from within each of the active the Platform or PDI Server replicas:

Update a license when stored in an S3 bucket

Complete the following steps to update the license:

Procedure

  1. Navigate to the home/pentaho directory within the Docker container.

  2. Run the load-s3-data.sh script.

  3. Run the copy-license.sh script.

Update a license when stored in a Kubernetes secret

If you license is stored in a Kubernetes secret, complete the following steps to update the license:

Procedure

  1. Navigate to the home/pentaho directory.

  2. Run the copy-license.sh script.

Dynamically update server configuration content from S3

If the content of the S3 bucket changed and you need to reflect these changes in the Platform or PDI Server, follow these instructions:

Before you begin

Before deploying the Platform or PDI Server, set the value of the allow_live_config property in the file pentaho-server-aws-rds.yaml to true.

Procedure

  1. Navigate to the relevant directory, where the configuration needs to be updated.

  2. Prepare the configuration update script in a later step by setting the <config_command> part of the script to one of the command options in the following table.

    Command optionDescription
    load_from_s3Copies the content of the bucket to the server’s /home/pentaho/.kettle directory.
    restartRestarts the Platform or PDI Server without restarting the pod.
    update_configExecutes load_from_s3, executes all the configuration and initialization scripts, and then executes the restart command.
    update_licenseExecutes load_from_s3 and updates the Platform or PDI Server license from the Kubernetes secret or S3 bucket.
    NoteWhen using the restart or update_config command, a disruption occurs in the Platform or PDI Server's use of sticky sessions. Using the restart or update_config command causes a server restart that impacts the user sessions.
  3. Run the configuration update script.

    NoteIf you have multiple Platform or PDI Server replicas, remove the comment hashtag in front of sleep 60.
    for pod in $( kubectl get pods -o name -n pentaho-server )
    do
      	echo "Forwarding port on pod: $pod"
     	 pid=$( kubectl port-forward -n pentaho-server $pod 8090:8090 1>/dev/null & echo $! )
     	 while ! nc -z localhost 8090; do 
       		 sleep 0.1
      	done
     	 echo "Executing command ..."
     	 result=$( curl http://localhost:8090/<config_command> )
     	 echo "Command result: $result"
      echo "Killing port forward pid: $pid"
      	while $(kill -9 $pid 2>/dev/null); do 
      	  	sleep 1
     	done
      	# sleep 60
    done;
  4. Check that the servers restart properly.

Worksheet for AWS hyperscaler

To access the common worksheet for the AWS hyperscaler, go to Worksheet for AWS hyperscaler.