Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at


Hitachi Vantara Lumada and Pentaho Documentation

System requirements

The Lumada Data Catalog software builds a metadata catalog from data assets residing in tools and databases. It profiles the data assets to produce field-level data quality statistics and to identify representative data so users can efficiently analyze the content and quality of the data.

Data Catalog requires specific external components and applications to operate optimally. This article provides a list of those components and applications along with details of their use and the versions Data Catalog supports. If you have questions about your particular computing environment, contact Hitachi Vantara Lumada and Pentaho Support.


You need a Kubernetes cluster to set up the multiple Data Catalog components. The following table lists the requirements for Kubernetes installation.

Minimum hardware requirements
  • 16 GB RAM
  • 8 cores
  • 100 GB storage
  • Though there is no hard requirement for operating systems, Linux or a similar operating system is typically used.
Kubernetes clusterKubernetes version 1.23
Software for cluster
  • Helm version 3.8.x
  • kubectl 1.21

For installation instructions, see Installation on Kubernetes.

Web browsers

Data Catalog supports major versions of web browsers that are publicly available.

Google Chrome* (Recommended)110.0.5481.100 (Official Build) (x86_64)
iOS Safari*16.3 (17614., 17614)
Microsoft Edge*110.0.1587.50 (Official Build) (x86_64)
Mozilla Firefox*110.0 (64-bit)

* Backward version compatibility depends on the changes made in browser libraries. Contact Hitachi Vantara Lumada and Pentaho Support for any specific version compatibility.


For Data Catalog, Keycloak 20.0.1 is the default identity and access management tool, which allows the creation of a user database with custom roles and groups. It is installed during the Helm deployment of Data Catalog.

For Data Catalog, Keycloak 20.0.5 is the default identity and access management tool, which allows the creation of a user database with custom roles and groups. It is installed during the Helm deployment of Data Catalog.

Port and firewall requirements

If the Data Catalog users, including service users, need to access Data Catalog across a firewall, you should allow access to the following ports at the cluster IP address.

NotePorts listed in the table below are the default ports that Data Catalog uses when installed. However, you can change these ports during installation. If you need to change these ports after installation, contact Hitachi Vantara Lumada and Pentaho Support.
Component PortDescription
Secure Data Catalog browser application (HTTPS)31083Grants users access to the Data Catalog application
Authentication and access management30843Manages user authentication and access through Keycloak
Metadata repository (MongoDB)30017Stores metadata collected from processing functions in a MongoDB repository
Object storage (MinIO)30900Serves as object storage for debugging purposes
Metadata repository REST API endpoint (HTTP)31088Used internally by the Lumada Data Catalog Application Server and agent components to communicate with the repository

Data sources

Data Catalog can process data from different data sources. The following table lists the different types of supported data sources:

CategoryData SourceVersion
MS SQL Server2019
Oracle12, 19c, 21c
Snowflake3.13 (as per JDBC jar)
Document DBMongoDB


Object storeAmazon S3
Hitachi Content Platform (HCP) 9.4.0
Microsoft Azure Data Lake Storage (ADLS)
Columnar DBApache Hive3.1.3 (for EMR and CDP)
3.1.0 (for HDP)
Hadoop Distribution File System (HDFS)3.2.1 (for EMR)
3.1.1 (for CDP and HDP)
Hitachi Network-Attached Storage Platform (HNAS)
NFSv3, v4
SMBv2, v3
Vertica10.1ce, 11.1ce
Virtual DBDenodo8

Remote agent

A Lumada Data Catalog Agent is responsible for initiating, executing, and monitoring jobs that communicate with the data sources, process the data, and create fingerprints. Refer to the following sections listing the requirements for remote agent installation, distributions, and Kerberos environments respectively.


The following table lists the requirements for installation:

  • 8 cores
  • 64 GB RAM
  • 100 GB storage
  • Data Catalog has already been set up on a Kubernetes cluster
  • The server hosting your remote agent can connect to the Data Catalog server

Remote agent setup supports a few distributions that vary in requirements. The following table lists the distribution requirements most suitable for your Data Catalog setup:

Amazon Elastic Map Reduce (EMR)Version 6.7.0
NoteWhen prompted by the remote agent script, you set up the remote agent using the Data Catalog service user for Hadoop.
Cloudera Data Platform (CDP)Version 7.1.3

See Configure Data Catalog for CDP for configuration information.

Kerberos environments

Additionally, you can enable Kerberos on your remote agent server. Additional configuration is required since Kerberos-enabled environments add extra security between your remote agent and the Data Catalog cluster.

Kerberos has the following requirements:

  • Your Hadoop administrator has created a service user on your environment.
  • A Kerberos keytab file has already been set up for your service user on the Kerberos machine.

See Remote agent for more information.