Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Installation on Kubernetes

Parent article

The topic guides you on how to install Kuberenetes for Lumada Data Catalog. See the System requirements, before you proceed.

Loading Docker images

By default, the Hitachi Vantara-owned Docker images are not publicly available. These images are included in the artifacts as a TAR GZ file and must be loaded onto a private container registry first using the provided ldc-load-images.sh script. You may need to login to your private container registry before running the script.

On the terminal of your Kubernetes cluster, run the ldc-load-images.sh script, specifying your private registry and the path to where the images are stored: ./ldc-load-images.sh -r <private registry> --images <path to ldc-images TAR GZ file>

If executed successfully, the following Docker images will be available on your private registry (note that the tags for each Docker image will vary for each Lumada Data Catalog release):

  • lumada-catalog/app-server:<app-server tag>
  • lumada-catalog/agent:<agent tag>
  • lumada-catalog/mongodb-migration-tool:<mongodb-migration-tool tag>
  • lumada-catalog/spark:<spark tag>
  • lumada-catalog/mongodb:<mongodb tag>
  • lumada-catalog/rest-server-native:<rest-server tag>

Minimal Helm chart values

You will need to customize certain values that you provide to the Helm chart. You can do this by creating a custom-values.yml file, which you can use to override default Helm chart configuration during installation.

Create this file in the same location where your Data Catalog artifacts are stored. As a minimum, some Data Catalog services must be exposed to be able to access them on your browser. Two minimal examples of exposing these services are shown below:

Exposing services using NodePort - For local debug purposes only

This example configuration will expose:

  • Keycloak
    • HTTPS Port 30843
  • App-server (Data Catalog UI)
    • HTTPS port 31083
keycloak:
  service:
    type: NodePort
    httpsType: NodePort
app-server:
  untrustedCertsPolicy: ALLOW
  service:
    type: NodePort
  keycloak:
    authServerUrl: "https://<k8s node hostname>:30843"
global:
  registry: <private registry>

Where <k8s node hostname> is the hostname of the server running the Kubernetes cluster, and <private registry> is the registry containing the loaded Data Catalog Docker images. This is an arbitrary value based on how your organization’s registry was set up. For example, an Azure container registry would follow the format myregisty.azurecr.io.

Exposing services using an Ingress controller - For production environments

For Ingress, it is assumed that the host of your Kubernetes cluster already has the relevant DNS configuration (domain name that you own and can create DNS records for) for your cluster.

In this example, the configuration will expose services on the following subdomains:

  • Keycloak via keycloak-dev1.hv.com
  • App-server (Data Catalog UI) via app-server-dev1.hv.com
keycloak:
  ingress:   
    enabled: true
    hosts:
    - host: keycloak-dev1.hv.com  
      paths:
      - path: /
        pathType: Prefix
    tls:
    - hosts:
      - "keycloak-dev1.hv.com"
      secretName: keycloak-ingress-certs
app-server:
  ingress:
    enabled: true
    hosts:
    - host: app-server-dev1.hv.com
      paths:
      - path: /
        pathType: Prefix
    tls:
    - hosts:
      - "app-server-dev1.hv.com"
    secretName: app-server-ingress-certs
  keycloak:
    authServerUrl: "https://keycloak-dev1.hv.com/auth"
global:
  registry: <private registry>

Where <private registry> is the registry containing the loaded Data Catalog Docker images. This is an arbitrary value based on how your organization’s repository was set up. For example, an Azure container registry would follow the format myregisty.azurecr.io.

Customizing Helm chart values for production

You can customize Helm chart values for your production environment. Use the following guidelines.

Large properties

Large properties configuration determines where metadata from your job runs are stored. By default, the large properties location is set to the local MinIO component included in the Helm chart, which is suitable for debug purposes. For production, however, this location should be on an object store owned by your organization (for example, AWS S3, or MinIO).

Overriding the default large properties settings can be done by adding the following agent configuration in your custom-values.yml file. This setup provides the minimum settings you will need to set your large properties location to your organization’s objects store:

app-server:
  <other app-server custom values>
  configurationOverrides:
    - propertyKey: ldc.metadata.hdfs.large_properties.uri
      value: <large properties URI>
      component: __template_agent
    - propertyKey: ldc.metadata.hdfs.large_properties.attributes
      value:
        - fs.s3a.access.key=<access key>
        - fs.s3a.secret.key=<secret key>
        - fs.s3a.endpoint=<object store endpoint>
        - fs.s3a.path.style.access=true
        - fs.s3a.threads.max=20
        - fs.s3a.connection.maximum=200
      component: __template_agent
    - propertyKey: ldc.metadata.hdfs.large_properties.path
      value: <large properties path>
      component: __template_agent

Where:

  • <access key>, <secret key>, and <object store endpoint> refer to the connection details for your organization’s object store. This will be used by Data Catalog to connect and write to your object storage.
  • <large properties URI> is the URI of your object store. For an S3 bucket, you can set the URI to one of two formats, depending on the environment in which the agent is running:
    • For most cases: s3a://<Bucket Name>
    • For a remote agent running on EMR: s3://<Bucket Name>
  • <large properties path> refers to the location in the object store where metadata will be stored.

If you need to change a single agent’s large properties configuration, you can do this in the user interface by navigating to Management and then Configurations, and changing the following MISC values for your agent:

  • Attributes for discovery cache metadata store
  • URI for discovery cache metadata store
  • Relative location for a large properties metadata store

Spark history server

The Spark driver history location is where spark execution logs are written. By default, the Spark history server location is on the local MinIO component included in the Helm chart. Setting the new spark history server location to a S3 bucket can be done by overriding spark-history-server configuration your custom-values.yml file:

spark-history-server:
  <other spark history server configuration>
  historyServer:
    events:
      s3:
        createBucket: true
        # either use existingSecretName OR access keys
        existingSecretName: "<existing secret name>"
        accessKey: "<access key>"
        secretKey: "<secret key>"
        # aws s3 session token (optional)
        sessionToken: ""
        endpoint: "<endpoint>"
        bucket: <bucket name>
        eventLogPath: "<event log path>"
        # s3 event history file system log directory
        historyPath: "<history path>"

Where:

  • <existing secret name> refers to a Kubernetes secret that holds the credentials to your S3 bucket (this would have been set up by your Kubernetes admin separately). Note that if you are using an existing secret name, accessKey and secretKey do not need to be set here.
  • <access key> and <secret key> refer to the connection details to your organization’s S3 bucket. Note that if you are using keys to access your S3 bucket, you do not need to define existingSecretName.
  • <endpoint> is the endpoint of the S3 or MinIO bucket. By default, this points to local MinIO accessible to the agent via http://<release name>-minio-bundled:9000.
  • <bucket name> and <path> refers to the location in the object store’s where the jars will be placed.
  • <event log path> and <history path> are locations on the bucket that logs are stored in. By default, these are both set to /events/.

MongoDB

Though Data Catalog comes with its own MongoDB, you can edit your custom-values.yml by providing the URI to an external MongoDB instance:

app-server:
  mongodbURI: <MongoDB URI>

Where the MongoDB URI will be in the form mongodb://<username>:<password>@ <mongodb host>.

Keycloak

Though Data Catalog comes with its own Keycloak component for authentication, it is possible for Data Catalog to point to your organization’s own Keycloak configuration by overriding Keycloak configuration in your custom-values.yml:

keycloak:
  authServerUrl: <auth server url>
  callbackUrl: <callback Url>
  realm: <realm>
  clientID: <client ID>  
  clientSecret: <client secret>
  authUser: <username>
  authPass: <password>
  # -- Mapping of OAuth profile fields (on the right) to those in LDC Application (on the left)
  userFields:
    id: sub
    email: email
    username: preferred_username
    firsName: given_name
    lastName: family_name

Where:

  • <auth server url> is the (accessible) base URL of your Keycloak realm's authorization endpoint.
  • <callback Url> is the URL to that Keycloak will redirect the user to after granting authentication.
  • <realm> is the name of your Keycloak realm.
  • <client ID> is a value that will match your Application Name, resource, or OAuth Client Name.
  • <client secret> is the value of your OAuth client secret.
  • <username> and <password> are the credentials of the user used for role syncing.

JDBC JAR file location

JDBC JAR files are required by Data Catalog agents to access and process data sources. In the custom-values.yml file for the Helm chart the JAR files are referred to as "seed JDBC" JAR files. For the local agent, custom-values.yml can be configured to pull additional seed JDBC JAR files in two different ways:

  • s3 – During initialization, the agent pulls JDBC JAR files from a specified location (for example, your organization’s S3 bucket)
  • http – During initialization, the agent uses a provided list of download links to the JAR files the agent requires to pull JAR files via URL.

This is useful if local agent unexpectedly terminates, as a replacement of the agent will retrieve required JAR files during initialization.

S3 configuration

By default, the local agent will pull from JDBC JAR files from a bucket in the local MinIO component included in the Helm chart. In production, it is recommended that this location is set to an organization’s own object store, such as an AWS S3 or MinIO bucket. You can override your local agents JAR file location by adding the following to your custom-values.yml helm chart.

agent:
  seedJDBC:
    sources:
      - s3
  # default values used in default secret
    s3:
      existingSecretName: "<secret name>"
      accessKey: <access key>
      secretKey: <secret key>
      bucket: <bucket name>
      path: <path>
      endpoint: "<endpoint>"

Where:

  • <access key> and <secret key> refer to the connection details to your organization’s S3 or MinIO bucket.
  • <endpoint> is the endpoint of the S3 or MinIO bucket. By default, this points to local MinIO accessible to the agent via http://<release name>-minio-bundled:9000).
  • <bucket name> is the name of the S3 or MinIO bucket.
  • <path> refers to the location in the S3 or MinIO bucket where the JAR files will be placed. By default, this is set to ext/jdbc.
  • <secret name> refers to the name of the object store. By default, this is set to <release name>-minio-bundled.
HTTP configuration

For HTTP configuration, you can provide a list of download links to different JAR files in your custom-values.yml:

agent:
  seedJDBC:
    sources:
      - http
    http:
      # array of links
      list:
      - <download link to a JDBC jar>
      - <download link to another JDBC jar>

Agent Spark JAR staging

Not to be confused with Spark History Server history server, this configuration is defined on the agent's side and provides a path where the Spark driver can upload additional JAR files for executors.

By default, this location is on the local MinIO component included in the helm chart, which can be used for debug purposes. However, as a best practice for a production environment, you should set this location to an object store owned by your organization (for example, AWS S3). You can do this by overriding the agent configuration in your custom-values.yml file:

agent:
  spark:
    jarUpload:
      endpoint: "<endpoint>"
      existingSecretName: "<secret name>"
      accessKey: <access key>
      secretKey: <secret key>
      secretToken:
      bucket: <bucket name>
      path: <path>

Where:

  • <secret name> refers to the name of the object store. By default, this is set to <release name>-minio-bundled.
  • <access key> and <secret key> refer to the connection details to your organization’s S3 or MinIO bucket.
  • <endpoint> is the endpoint of the S3 or MinIO bucket. By default, this points to local MinIO accessible to the agent via http://<release name>-minio-bundled:9000.
  • <bucket name> and <path> refers to the location in the object store’s where the JAR files will be placed.
  • <path> refers to the location in the object store’s where the JAR files will be placed. By default, this is set to /cluster_jars.

Once you have made your edits, save the file, and run a helm upgrade using the updated custom-values.yml file:

helm upgrade --wait <release name> ldc-<version number>.tgz -f custom-values.yml -n <namespace>

Where <release name> and <namespace> refer to values that were set during the initial helm installation process, and <version number> refers to the version number of your Data Catalog software.

REST server

The Data Catalog REST server API leverages the Quarkus platform and contains two components:

  • An API that you can use to interact with Data Catalog.
  • The Swagger UI, which contains the documentation for all supported REST server API calls. Each supported API call has an example payload that the call can consume, and a list of possible responses.

A user with relevant permissions must be authorized before any of the REST server API calls can run. If calls are run using the Swagger user interface, the page includes an Authorize button that will redirect the user to an authorization URL.

NoteThe examples below assume that Keycloak and the REST server can be accessed using the same hostname.
Exposing the REST server using NodePort

This example exposes the service using a node port on the HTTP port 31088. The configuration includes a variable authServerUrl that points to the Keycloak component included in the Helm chart, using the default settings HTTPS port 30843 and Keycloak realm ldc-realm.

rest-server:
  service:
    type: NodePort
    nodePort: 31088
  keycloak:
    authServerUrl: "https://<hostname>:30843/auth/realms/ldc-realm"

You can access the components for this HTTP setup as follows:

  • You can access the API endpoint at: http://<hostname>:31088/api/v1
  • You can access the Swagger UI endpoint at: http://<hostname 31088/swagger-ui
    • Alternatively, you can retreive the Swagger definition YAML for these calls from: http://<hostname>:31088/api-docs
Exposing the REST server using Ingress

In the example below, the REST server is exposed using an HTTPS REST server configuration. The configuration also includes a variable authServerUrl that points to the Keycloak component included in the Helm chart, using the default settings HTTPS port 30843 and Keycloak realm ldc-realm.

rest-server:
keycloak:
   authServerUrl: "https://<hostname>:30843/realms/ldc-realm" 
ingress:
   enabled: true
   hosts:
   - paths:
     - path: /api/v1
       pathType: Prefix
     - path: /swagger-ui
       pathType: Prefix
     - path: /api-docs
       pathType: Prefix

Components for this setup can be accessed as follows:

  • The API endpoint can be called from https://<hostname>/api/v1.
  • The Swagger UI endpoint can be accessed at https://<hostname>/swagger-ui.
  • You can retrieve the Swagger definition YAML for these calls from https://<hostname>/api-docs.

Deploy Data Catalog with Helm

In this example, a Helm install is done where the Helm release name is ldc7, namespace is ldc, the custom values file is called custom-values.yml and the Helm chart used is called ldc-7.0.1.tgz.

NoteThis is an example only, and you should customize your command for your environment as necessary, such as for your version of Data Catalog software.

Procedure

  1. Create a namespace where your Data Catalog components will reside.

    kubectl create namespace ldc
  2. Use a command like the following to deploy the chart, specifying the release name, paths to the Helm chart and custom values, and the namespace created in the previous step:

    • Example:

      helm install --wait ldc7 ldc-7.0.1.tgz -f custom-values.yml -n ldc

    Typically, this Helm install takes a few minutes to fully complete.

  3. On your web browser, confirm that you can access the Data Catalog login page. The URL will vary based on what was set in custom-values.yml:

    • If the Node Port configuration has been used, this will be at https://<hostname>:<app-server.service.httpsNodePort>
    • If the Ingress controller configuration has been used, this will depend on what was set under app-server.ingress.hosts.

Uninstall

To uninstall the helm chart, run the helm uninstall command, passing the release name and namespace that was specified during install.

helm uninstall ldc7 -n ldc

Troubleshooting

Use the following guidelines for troubleshooting your installation or uninstall issues.

Install

If a server is reused to install Data Catalog, a resource conflict message may appear if the previous release has not been cleanly uninstalled (or fails):

Error: rendered manifests contain a resource that already exists.
Unable to continue with install: existing resource conflict:

If this error appears, manually delete any pre-existing resources:

kubectl delete all -l “app.kubernetes.io/instance=ldc” -l “release=ldc7”

Uninstall

For most cases, the provided helm uninstall command should suffice. However, if the uninstall is not completed successfully, you can run the command without enabling uninstall hooks:

helm uninstall <release name> --no-hooks -n <namespace>

If there are any orphaned or incomplete jobs, these can be found and deleted manually:

kubectl get jobs -o name | grep <release name>