Lumada Data Catalog provides an integrated repository and indexing solution using Apache Solr™. This solution eliminates the complexity of synchronizing a stand-alone database with Apache Lucene™ indexes, while providing expanded opportunities for managing repository metadata at scale and for integrating Data Catalog into Hadoop cluster failover systems.
Lumada Data Catalog is certified for Solr 8.4.1 version, which can be downloaded from the Apache archives website. Solr versions higher than 8.4.1 must be certified before usage. Contact the Hitachi Vantara Lumada and Pentaho Support Portal for compatibility and certification requests.
Choosing your configuration
Because Solr is available in different versions in different Hadoop distributions and because of the variety of installation configurations supported, you have some choices to make to determine the correct Solr setup for your environment. After some testing across various configurations, Hitachi Vantara offers the following best practices for selecting the best configuration for your needs and environment.
Solr on your distribution
The Hadoop distribution support for Solr varies.
- CDH/CDP includes Solr in the baseline of services.
- HDP can accommodate Solr as a managed service when added to an existing HDP installation.
- AWS-EMR and Microsoft Azure do not have any Solr integration.
Generally, Data Catalog recommends that if you have Solr running on your cluster, you should use your existing configuration. To support Data Catalog, add a collection to the existing setup.
Data Catalog uses Solr's integrated repository for storing catalog search-related resource metadata and aggregate data. Although Data Catalog does not store any raw data in the Solr repository, securing access to Solr should be in accordance with the policies employed for Hadoop storage at the organization level. As a general rule, if you require HDFS to be secure, then you should secure Solr as well.
Solr version 8.2.0 and later is secure by default. For Solr versions prior to version 8.2.0, take the following steps to mitigate security vulnerabilities:
- Configure the network settings to only allow trusted traffic from the Lumada Data Catalog Application Server and
Lumada Data Catalog Metadata Server
to communicate with Solr, especially with the
- Edit the solrconfig.xml file to configure all
DataImportHandlerusages with an invariants section that specifies an empty string for the
Solr can be included under your cluster's Kerberos security umbrella. If you are protecting your HDFS with Kerberos, then Solr needs to participate in the Kerberos ecosystem. Installing Solr outside of Kerberos may cause incompatibilities and performance issues.
Solr can be configured in a standalone mode or in a SolrCloud (distributed) mode. Solr standalone runs a single Solr server on a single node. You can configure it to distribute collection data across multiple servers by setting up primary and replica nodes for replication. Standalone mode is simpler to install and maintain than the SolrCloud.
The SolrCloud mode uses ZooKeeper to handle synchronization and failover among a set of Solr servers. For enterprise systems, configuring SolrCloud offers advantages such as:
- Metadata and index replication among servers.
- Load balancing.
- Distributed queries.
For a single node, the complexity of installing and maintaining SolrCloud increases as you increase the number of Solr instances to handle the distributed nature of the system. The best practice is to use SolrCloud mode for all enterprise production deployments and for non-production environments. Added benefits include:
- Running a SolrCloud allows you to use ZooKeeper to manage configuration changes across all Solr instances.
- Running a single-node SolrCloud is good practice for running multiple node configurations.
You can choose to configure Solr or SolrCloud to store replicas either on local storage or on HDFS. Storing replicas on HDFS allows you to manage Data Catalog as another client of HDFS rather than identifying and managing storage on individual nodes. Configuring Solr to use local storage has a performance advantage over storing indexes on HDFS. The installation instructions provided supports both methods.
Summary of the options
This article assumes Solr is installed on a separate node from Data Catalog. All references to Solr are based on this assumption. If your environment is setup differently, please take that into consideration before following the guidelines for altering Solr configurations.
For all distributions, the best practice is to use Solr 8.4.1 or higher. Note that CDH and CDP require Solr 8.4.1 or higher while HDP can use Lucidworks 8.4 or higher.
For optimal performance, we recommend that the Solr shards be stored on local SSD drives rather than in HDFS. Make sure that the disk has enough space to contain the collection and that other applications are not competing for disk space. In general, local storage is preferred over HDFS storage unless you need the features of HDFS.
Use this chart to determine the recommended configuration for your environment:
|Kerberos||Same as HDFS||Same as HDFS||N/A||N/A||N/A|
|Mode||SolrCloud||SolrCloud||SolrCloud or Standalone||SolrCloud or Standalone||SolrCloud or Standalone|
|Storage||HDFS||HDFS or local||Filesystem||Filesystem||Filesystem|
The following topics describe how to create a Solr collection for Data Catalog with various configurations:
- Solr on CDH and CDP: creating a collection with solrctl.
- Solr on HDP: creating a collection on Lucidworks or Apache Solr.
- Installing Lumada Data Catalog on MapR: creating a collection on Lucidworks or Apache Solr.
- Installing Lumada Data Catalog on Amazon EMR: installing Apache Solr and creating a collection.
- Installing Lumada Data Catalog on Azure HDInsight: installing Apache Solr and creating a collection.
When you have created and validated the Solr collection, continue with the installation of Data Catalog.