Requirements
Pentaho Data Catalog requires specific external components and applications to operate optimally. This article provides a list of those components and applications along with details of their use and the versions Data Catalog supports.
Environment considerations
To ensure proper software development and deployment practices, it is a best practice to have two separate environments:
- Development or Staging
- Production
System requirements
This section outlines the software, hardware, and access requirements you should have before you install Data Catalog.
Checklist for infrastructure requests
Perform the following tasks as needed to prepare your environment for Data Catalog:
- Request a Virtual Machine (VM) on Azure or AWS or on-premises.
- Request IDs with remote access permissions to the VM on your cloud or on-premises.
- Request necessary access to systems, applications, and data sources.
- Request VDI or VPN access for Data Catalog data engineers to enable remote access to the VM.
- Request a database user account (service account) or logins for connecting to the data sources.
- Make sure the database user account has read-only permissions for the database objects, including system catalog tables.
- Make sure that your system owner or Database Administrator (DBA) has copied or extracted any required data or files.
- Obtain an SSL certificate from a certificate authority. If required by your organization's security policy, raise an infrastructure support request for an SSL certificate. The certificate authority will give you a key file and a certificate file.
Hardware requirements
Your server and network must meet the following requirements:
Category | Description |
CPU | 16 cores (minimum) 32 cores (recommended) |
RAM |
64 GB (minimum) 128 GB (recommended) |
Disk storage | 1 TB (minimum) |
Network | 1 Gbps |
If the server is running on AWS or Azure, review the following requirements.
AWS EC2 details
An AWS EC2 virtual machine (VM) has the following requirements:
Minimum Requirements | Preferred Requirements | |
Size | m5.4xlarge | m2.8xlarge |
vCPU | 16 cores | 32 cores |
Memory | 64 GB | 128 GB |
Azure VM details
An Azure VM has the following requirements:
Minimum Requirements | Preferred Requirements | |
Size | B_16s_v2 | B_32s_v2 |
vCPU | 16 cores | 32 cores |
Memory | 64GB | 128 GB |
Server storage requirements
The server file systems and storage must meet the following requirements:
- At least 10 GB of storage should be allocated for the root file system.
- Ample storage should be mounted in the designated Docker storage area (typically the default on Linux servers).
Operating system requirements
You must have dedicated servers available with a hosting environment. The hosting environment can be on-premises or on the cloud using platforms such as Azure or AWS.
The server must run one of the following amd64 architecture Linux operating systems:
- Amazon Linux 2 (AWS only)
- CentOS 7 or 8
Linux kernel version
Version 4.0 or higher of the Linux kernel is required. For RHEL, use version 3.10.0-514 of the kernel or a higher version.
d_type=true
enabled.- To verify that the
ftype
option is set to 1, use the command xfs_info and check the output. To format an XFS file system correctly, use the flag-n ftype=1
. - If the dedicated server is restarted, make sure to enable auto start-up for Docker by executing the following commands:
sudo systemctl enable docker.service sudo systemctl enable containerd.service
Network security and firewall requirements
The network security and firewall must meet the following requirements:
- Ports
80
and443
should be open for inbound traffic. - The application server must have network connectivity to the database server and port.
443
. However, if desired, you can obtain an SSL certificate from a certificate authority.User account
The server user account used for the installation must either be the root user or have appropriate permissions to run Docker. To set up Docker permissions for non-root users, see the official Docker documentation at https://docs.docker.com/engine/install/linux-postinstall/.
Software requirements
Before you install Data Catalog, Docker must already be installed on the server and configured to start on boot. See the official Docker documentation at https://docs.docker.com/engine/install/ for instructions on installing Docker.
Name | Requirements |
Docker | Version 22.0+ |
Docker Compose | Version 2.22.0+ |
Additional software
For seamless SSH connectivity and secure file transfer between your machine and the server, it is a best practice to install the following software on your machine:
- An SSH client such as PuTTY (recommended), a widely used SSH client for Windows.
- WinSCP for a graphical user interface to securely transfer files between the client and the server using SSH.
Data source connectivity
The following table contains the supported data sources and respective requirements to connect with Data Catalog.
Data source | Requirements |
AWS S3 |
|
Azure Blob Storage |
|
HCP |
|
RDBMS | To enable Data Catalog to perform data profiling, grant read-only access to all database objects and system catalog tables. |
SMB/CIFS |
|
(Optional) Client Virtual Device Interface (VDI)
The following table contains the client’s VDI requirements.
Category | Requirements |
Server configuration |
|
Disk or storage |
|
Others |
|