System requirements
Lumada Data Catalog runs on an edge node in a Hadoop cluster. Data Catalog requires specific external components and applications to operate optimally. This article provides a list of those components and applications along with details of their use and versions we support. . If you have questions about your particular computing environment, please contact support at https://support.pentaho.com.
Processing engine
Data Catalog uses Apache Spark for profiling jobs against HDFS and Hive data. The application code is transferred to each cluster node and is executed against the data that resides on that node. The results are collected in either the Postgres repository, the Discovery Cache, or both. Data Catalog then runs Spark jobs against the resulting metadata to determine tag association suggestions. See Discovery Cache storage
Data Catalog jobs are standard Spark jobs, so your organizational practices for tuning cluster operations apply to running and tuning Data Catalog jobs.
Hadoop distributions
Data Catalog must be installed on the edge node within the Hadoop cluster. The Hadoop and Hive clients installed on this edge node must have access to the Hadoop NameNode and HiveServer2.
The following table lists the supported distributions and the compatible applications for each Hadoop distribution on the edge node.
Distribution | Components | Versions | Notes |
Cloudera |
CDH Apache Spark™ Solr HDFS HIVE Postgres Navigator |
6.1 2.4 8.4.1 3.0.0 2.1.1 11.9 - |
- - Installed separately - - Installed separately Certification available on demand |
Cloudera |
CDP Apache Spark Solr HDFS HIVE Postgres Atlas |
7.1.3 2.4 8.4.1 3.1.1 3.1.3 11.9 2.0.0 |
- - - - - - - |
HortonWorks |
HDP Apache Spark Solr HDFS HIVE Postgres Atlas |
3.1.0 2.3.2 8.4.1 3.1.1 3.1.0 11.9 1.1.0 |
- - Installed separately - - Installed separately |
MEP (MapR Ecosystem Pack) |
MapR Apache Spark Solr MapR-FS HIVE Postgres |
6.1.0 2.4.0 8.4.1 6.1.0.20180926230239.GA 2.3.3 11.9 |
- - Installed separately - - Installed separately |
AWS |
EMR Apache Spark Solr HDFS HIVE Postgres |
5.30.1 2.4.5 8.4.1 2.8.5 2.3.6 11.9 |
- - Installed separately - - Installed separately |
Microsoft Azure |
HDI Apache Spark Solr HDFS HIVE Postgres Atlas |
4.0 2.4.4 8.4.1 3.1.1 3.1.0 11.9 - |
- - Installed separately - - Installed separately NA (HDP Only) |
Indexing engine
Data Catalog uses Solr to provide an integrated repository and index solution. Configuration settings for Solr are described in Apache Solr configuration.
Distribution | Supported versions | Unsupported |
Apache Solr | 8.4.1 | 5.5.4, 6.1 and 7.5 |
Lucidworks Solr | 8.4.1 | on HDP; 5.x, 6.1 and 7.5 |
Capacity planning
The Data Catalog stores data in three locations:
Solr storage
The Solr Repository stores the indexes required for Data Catalog searches.
Discovery Cache storage
The Discovery Cache uses HDFS or S3 storage to store the metadata of profiled resources.
Repository Storage
The Postgres database stores the transactional metadata.
Solr storage
Solr storage requirements are 1.5 KB per document or Data Catalog item, including data sources, resources, fields, tags, tag domains, and tag associations. You must multiply this value by the replication factor configured for the Solr collection.
If you do not have details for calculating this volume, you can estimate it by using 10% of the size of the data you intend to catalog.
Discovery Cache storage
The Discovery Cache stores the metadata information gathered by the Data Catalog. Data Catalog gathers metadata information about every resource it profiles in the data lake. This metadata forms the fingerprints of the data lake and are stored in the Discovery Cache on local storage.
The Discovery Cache stores tag propagation and lineage discovery algorithms. Typically, this local storage is in HDFS, but in non-Hadoop deployments, a S3 bucket or other cloud storage can be used.
The Discovery Cache URI must be owned by the Data Catalog service user with full read, write, and execute privileges. Other non-service users must not be granted write permissions to this location. As a best practice, the Discovery Cache should be periodically backed up.
The size requirements are 4 KB per column.
Repository storage
Data Catalog uses a Postgres SQL database for storing audit logs and transactional data, including the sample values and sample data shown in resource views. The Data Catalog-specific data is stored in a separate database, ldc_db (default). Since this database contains sample values and sample data, the ldc_db database should be only accessible to the Data Catalog service users.
The repository storage requirements are 2GB of Postgres storage per 100K resources or 1M resource fields.
Data at rest
Data Catalog does not suggest any method to secure data at rest on the Postgres server. You should secure data according to your organization's policies.
Data in transit
Data Catalog offers support for securing data in transit when its clients are communicating with the Postgres server. Securing data in transit between the Data Catalog and Postgres should be done according to the guidelines listed in the Postgres documentation. See Secure TCP/IP Connections with SSL.
Cluster compute resources
Data Catalog performs most of its computation as Spark jobs on the cluster. It profiles new and updated files, so after profiling existing files in a data lake, it needs resources only to manage the ingress of data.
Hitachi Vantara suggests the memory settings for the Apache YARN container as 2 GB.
The suggested Spark memory requirements are:
- Driver memory: 2 GB.
- Executor memory: Proportional to the size of the records. A size estimate is roughly 12 times the size of a single row of data being processed. For example, if your data
contains rows that are 100 MB, the executor memory should be at least 1200 MB. NoteA row is defined as a row of data, not a single line in the file. For a JSON file that does not include line breaks, the row is the single row of data, not the entire file.
- When you run deep profiling jobs on data sets in the order of 10,000 resources, you should increase executor memory to as high as 12 GB.
- When you run tag discovery jobs on data sets of 5,000 resources, increase executor memory to 12 GB.
You can manage cluster resources that Data Catalog consumes for any given operation using YARN.
For information about tuning Spark jobs, see the Spark article https://spark.apache.org/docs/2.3.2/tuning#tuning-spark
Example estimate
Data Catalog stores data in three locations:
Solr storage
1.5 KB per Data Catalog item, including data sources, resources, fields, tags, tag domains, tag associations, etc. This value must be multiplied by the replication factor configured for the Solr collection. If you do not have details for calculating this volume, consider using 10% of the size of the data you intend to catalog
HDFS storage
4 KB per column
PostgreSQL database
The following example provides an estimate of the storage you would need to set up for a 250 TB data lake. It assumes an environment with the following conditions:
- Both Hive tables and HDFS data are included in the Data Catalog.
- All HDFS data appears in a Hive table.
- Files and tables have an average of 30 columns, that are fields in Data Catalog.
- Each table has a large set of backing files, such as daily files for multiple years.
The example data lake contains 2500 tables of 100 GB each. Each table has 700 files, and each table/file has 30 fields/columns.
- Total tables and files: 2500 tables with 700 files each = 1.75M resources.
- Tables: 2500 tables x 30 columns = 75,000 columns.
- Files: 2500 tables x 700 files per table x 30 columns = 52,500,000 columns.
- Total of about 53 million columns.
- HDFS storage: 53 M x 4 KB = 212 GB.
- Solr storage: 55 M x 1.5 KB = 82.5 GB (x Solr collection replication factor).
In addition, Data Catalog produces logs that are stored on the host node. A best practice is to plan for 200 MB a day and storing several weeks of logs for a total of 6 to 10 GB.
Data sources
Lumada Data Catalog can process data from file systems, Hive, and relational databases. The following table lists the different types of supported data sources:
Data source type | Supported versions | Notes |
Aurora |
MySQL Compatible 5.6.10 PostgreSQL Compatible 11.6 |
- - |
Cloud/Blob storage |
S3 ADL WASB GCP |
- For Azure only Certification available on demand Certification available on demand |
HDFS | As supported by the distribution |
- |
HIVE | As supported by the distribution |
- |
Oracle | 11g Express Edition | Release 11.2.0.2.0 - 64bit production |
MSSQL | 14.30.3030 |
- |
MySQL | 5.7.29 |
- |
PostgreSQL | 11.9 |
- |
Redshift |
- | Certification available on demand |
SAP-HANA |
- | Certification available on demand |
Snowflake | 3.x | |
Sybase |
- | Certification available on demand |
Teradata |
- | Certification available on demand |
Data Catalog stores the connection information to these data sources including their access URLs and credentials for the LDC service user in a data sources entity. This entity stores the details of the connections established between the Data Catalog's web server and engine, and the sources.
Authentication
Data Catalog integrates with your existing Linux and Hadoop authentication mechanisms. When you install Data Catalog, you can select one user authentication method from the following options: SSH, Kerberos, or LDAP.
When using SSH for user authentication, the Data Catalog Application Server communicates with the host system on the listen address and port defined in the /etc/ssh/sshd_config file. The person who installs Data Catalog needs this address and port information.
If the cluster is controlled using Kerberos authentication, there are two ways to configure Data Catalog to interact with the KDC (Key Distribution Center):
- All user interactions are controlled through Kerberos.
- Only service user operations are controlled through Kerberos.
You can configure Data Catalog to interact with LDAP for user authentication.
For more information about customizing configurations for these authentication mechanisms, see Component validations.
Data Catalog also supports the following federated authentication mechanisms in addition to these basic authentication integrations. The configuration settings for these additional methods are reviewed in the following security topics:
Web browsers
Data Catalog supports major versions of web browsers that are publicly available prior to the finalization of the Lumada Data Catalog release.
Browser | Supported Version | Notes |
Google Chrome | 84.0.4147.105 (Official 64-bit) build) | Backward version compatibility is contingent upon changing the browser libraries. Contact our support team for any specific version compatibility. |
Microsoft Edge | 42.17134.1098.0 | Backward version compatibility is contingent upon changing the browser libraries. Contact our support team for any specific version compatibility. |
Mozilla Firefox | 79.0 (64-bit) | Backward version compatibility is contingent upon changing the browser libraries. Contact our support team for any specific version compatibility. |
MacOS Safari | 13.1.2 | Catalina |
Data Catalog ports
Data Catalog uses the following ports and protocols for the specified components:
Component | Protocol | Port |
Data Catalog Application server | HTTP | 8082 |
Data Catalog Application server | HTTPS | 4039 |
Data Catalog Metadata server | HTTP | 4242 |
PostgreSQL Server | TCP or TCP-over-SSL | 5432 |
Firewall considerations
If the Data Catalog users, including service users, need to access Data Catalog across a firewall, you should allow access to the following ports at the cluster IP address.
Component | Port | Use |
Data Catalog browser application | 8082 | Grants users access to the Data Catalog browser application |
Secure Data Catalog browser application (with HTTPS) | 4039 | Grants users secure access to the Data Catalog browser application |
Metadata repository REST API endpoint | 4242 | Used internally by the Data Catalog Application Server and agent components to communicate with the repository |
Minimum node requirements
The sections listed below list the minimum node requirements for your system:
Data Catalog Application Server
The Data Catalog Application Server can be installed either on a Hadoop edge node as a traditional single cluster installation, or in a non-Hadoop virtual machine outside the cluster as a distributed, multi-cluster installation.
The requirements include:
- Minimum of 20 GB, typically 100 GB of disk space. The node does not store data or metadata, just the software and logs.
- Minimum of 4 CPUs, running at least 2-2.5 GHz, typically 8 CPUs.
- Minimum of 6 GB of RAM, typically 16 GB.
- Bonded Gigabit Ethernet or 10 Gigabit Ethernet.
- Linux operating system. The netstat, and lsof commands are useful.
- JDK version 1.8.x.
- Installed Java Cryptography Extension (JCE) policy file available from Oracle at www.oracle.com/technetwork/java/javase/downloads/index.html.
Agent
Agents are processing engine components that can run on multiple remote clusters, and are installed on the edge nodes of a Hadoop cluster.
The requirements include:
- Minimum of 20 GB, typically 100 GB disk space. The node does not store data or metadata, just the software and logs.
- Minimum 4 CPUs, running at least 2 to 2.5 GHz, typically 16 CPUs if running Spark in Yarn client mode.
- Minimum 6 GB of RAM, typically 64 GB if running Spark in Yarn client mode.
- Bonded Gigabit Ethernet or 10 Gigabit Ethernet.
- Linux operating system. The netstat, and lsof commands are useful.
- JDK version 1.8.x.
- Installed Java Cryptography Extension (JCE) policy file available from Oracle at www.oracle.com/technetwork/java/javase/downloads/index.html.
Data Catalog Metadata Server
The Data Catalog Metadata Server is installed close to the Solr nodes and acts as an API endpoint for profiling jobs to communicate with the metadata repository.
The requirements include:
- Minimum of 20 GB, typically 100 GB disk space. The node does not store data or metadata, just the software and logs.
- Minimum 4 CPUs, running at least 2 to 2.5 GHz, typically 8 CPUs.
- Minimum 6 GB of RAM, typically 16 GB.
- Bonded Gigabit Ethernet or 10 Gigabit Ethernet.
- Linux operating system. The netstat, and lsof commands are useful.
- JDK version 1.8.x.
Solr server
These requirements assume that the operating system and other components of the node are properly provisioned in terms of memory. The following values provide a baseline for an example Solr server environment:
- OS: 4 GB minimum.
- Solr runtime: 2 GB.
- Solr Index: 8 GB. NoteThis index size is proportional to the number of documents in the repository. An 8 GB index accounts for approximately 1 million Solr documents. As your index size increases, you may be able to improve performance by increasing the RAM available on the Solr server.
Multi-byte support
Data Catalog handles cluster data transparently, assuming the data is stored in formats that Data Catalog supports. Data Catalog does not enforce any additional limitations beyond what Hadoop and its components enforce. However, there are locations where the configuration of your Hadoop environment should align with the data you are managing. These locations include:
- Operating system locale.
- Character set supported by Hive client and server.
- Character set supported by Solr.
The Data Catalog browser application allows users to enter multi-byte characters to annotate HDFS data. Where Data Catalog interfaces with other applications, such as Hive, Data Catalog enforces the requirements of the integrated application.