Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Apache Solr configuration

Parent article

Lumada Data Catalog provides an integrated repository and indexing solution using Apache Solr. This solution eliminates the complexity of synchronizing a stand-alone database with Apache Lucene indexes, while providing expanded opportunities for managing repository metadata at scale and for integrating Data Catalog into Hadoop cluster failover systems.

Lumada Data Catalog is certified for Solr 8.4.1 version, which can be downloaded from the Apache archives website. Solr versions higher than 8.4.1 must be certified before usage. Contact the Hitachi Vantara Lumada and Pentaho Support Portal for compatibility and certification requests.

Choosing your configuration

Because Solr is available in different versions in different Hadoop distributions and because of the variety of installation configurations supported, you have some choices to make to determine the correct Solr setup for your environment. After some testing across various configurations, Hitachi Vantara offers the following best practices for selecting the best configuration for your needs and environment.

Solr on your distribution

The Hadoop distribution support for Solr varies.

  • CDH/CDP includes Solr in the baseline of services.
  • HDP can accommodate Solr as a managed service when added to an existing HDP installation.
  • AWS-EMR and Microsoft Azure do not have any Solr integration.

Generally, Data Catalog recommends that if you have Solr running on your cluster, you should use your existing configuration. To support Data Catalog, add a collection to the existing setup.

NoteIf you are upgrading to the current release of Data Catalog on CDH 5.x distributions, you must install Solr 8.4.1 or later on a separate node If your cluster is not already running Solr, we recommend that you use the most recent Apache Solr 8.4.1 version, which can be downloaded from Apache.

Solr Security

Data Catalog uses Solr's integrated repository for storing catalog search-related resource metadata and aggregate data. Although Data Catalog does not store any raw data in the Solr repository, securing access to Solr should be in accordance with the policies employed for Hadoop storage at the organization level. As a general rule, if you require HDFS to be secure, then you should secure Solr as well.

Solr version 8.2.0 and later is secure by default. For Solr versions prior to version 8.2.0, take the following steps to mitigate security vulnerabilities:

  • Configure the network settings to only allow trusted traffic from the Lumada Data Catalog Application Server and Lumada Data Catalog Metadata Server to communicate with Solr, especially with the DataImportHandler.
  • Edit the solrconfig.xml file to configure all DataImportHandler usages with an invariants section that specifies an empty string for the dataConfig parameter.
NoteIf you are installing the latest version of Data Catalog, you must use Solr version 8.4.1.

Kerberos integration

Solr can be included under your cluster's Kerberos security umbrella. If you are protecting your HDFS with Kerberos, then Solr needs to participate in the Kerberos ecosystem. Installing Solr outside of Kerberos may cause incompatibilities and performance issues.

SSL support

Solr can be configured with SSL for secure communication between clients and nodes. For more information see Lucene Apache guidelines.

SolrCloud mode

Solr can be configured in a standalone mode or in a SolrCloud (distributed) mode. Solr standalone runs a single Solr server on a single node. You can configure it to distribute collection data across multiple servers by setting up primary and replica nodes for replication. Standalone mode is simpler to install and maintain than the SolrCloud.

The SolrCloud mode uses ZooKeeper to handle synchronization and failover among a set of Solr servers. For enterprise systems, configuring SolrCloud offers advantages such as:

  • Metadata and index replication among servers.
  • Failover.
  • Load balancing.
  • Distributed queries.

For a single node, the complexity of installing and maintaining SolrCloud increases as you increase the number of Solr instances to handle the distributed nature of the system. The best practice is to use SolrCloud mode for all enterprise production deployments and for non-production environments. Added benefits include:

  • Running a SolrCloud allows you to use ZooKeeper to manage configuration changes across all Solr instances.
  • Running a single-node SolrCloud is good practice for running multiple node configurations.
NoteIf you choose not to use SolrCloud, contact Hitachi Vantara Lumada and Pentaho Support Portal for best processes to ensure your Data Catalog data is backed-up on a secondary server that acts as a failover system in case the primary Solr instance fails.

Storage

You can choose to configure Solr or SolrCloud to store replicas either on local storage or on HDFS. Storing replicas on HDFS allows you to manage Data Catalog as another client of HDFS rather than identifying and managing storage on individual nodes. Configuring Solr to use local storage has a performance advantage over storing indexes on HDFS. The installation instructions provided supports both methods.

Summary of the options

This article assumes Solr is installed on a separate node from Data Catalog. All references to Solr are based on this assumption. If your environment is setup differently, please take that into consideration before following the guidelines for altering Solr configurations.

For all distributions, the best practice is to use Solr 8.4.1 or higher. Note that CDH and CDP require Solr 8.4.1 or higher while HDP can use Lucidworks 8.4 or higher.

For optimal performance, we recommend that the Solr shards be stored on local SSD drives rather than in HDFS. Make sure that the disk has enough space to contain the collection and that other applications are not competing for disk space. In general, local storage is preferred over HDFS storage unless you need the features of HDFS.

Use this chart to determine the recommended configuration for your environment:

CDH/CDPHDPEMRAzureMapR
KerberosSame as HDFSSame as HDFSN/AN/AN/A
ModeSolrCloudSolrCloudSolrCloud or StandaloneSolrCloud or StandaloneSolrCloud or Standalone
StorageHDFSHDFS or localFilesystemFilesystemFilesystem

The following topics describe how to create a Solr collection for Data Catalog with various configurations:

When you have created and validated the Solr collection, continue with the installation of Data Catalog.