Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Solr on HDP

Parent article

Hortonworks HDP provides two forms of Apache Solr:

  • Solr Search in the form of Lucidworks Solr
  • Ambari Infra to support Apache Atlas™ and Apache Ranger™
If you have Lucidworks Solr running on your cluster, you can use it to host the Lumada Data Catalog repository. If not, you should install and configure Apache Solr 8.4.1.

Ambari Infra is not supported for use by non-HDP components.

Lumada Data Catalog is certified for Solr version 8.4.1. Solr versions higher than 8.4.1 must be certified before using. Contact Hitachi Vantara Lumada and Pentaho Support Portal for compatibility and certification requests. Solr 8.4.1 can be downloaded from the Apache archives website.

Choosing your configuration

After you have installed either Lucidworks or Apache Solr, you will need to decide if you want to run with SolrCloud and with local or HDFS storage. This section offers a summary of how to make these decisions that in a HDP environment. For a more general and complete description of the choices and trade-offs, see Apache Solr configuration.

Kerberos security

Solr can be included under your cluster's Kerberos security umbrella. If you are protecting your HDFS with Kerberos, then Solr needs to participate in the Kerberos ecosystem. Installing Solr outside of Kerberos may cause incompatibilities and performance issues. See Configure Solr for Kerberos.

SolrCloud mode

Solr can be configured in a standalone mode or in a SolrCloud (distributed) mode. Solr standalone runs a single Solr server on a single node. You can configure it to distribute collection data across multiple servers by setting up primary and replica nodes for replication. Standalone mode is simpler to install and maintain than the SolrCloud.

The SolrCloud mode uses ZooKeeper to handle synchronization and failover among a set of Solr servers. For enterprise systems, configuring SolrCloud offers advantages such as:

  • Metadata and index replication among servers.
  • Failover.
  • Load balancing.
  • Distributed queries.

For a single node, the complexity of installing and maintaining SolrCloud increases as you increase the number of Solr instances to handle the distributed nature of the system. The best practice is to use SolrCloud mode for all enterprise production deployments and for non-production environments. Added benefits include:

  • Running a SolrCloud allows you to use ZooKeeper to manage configuration changes across all Solr instances.
  • Running a single-node SolrCloud is good practice for running multiple node configurations.
NoteIf you choose not to use SolrCloud, contact Hitachi Vantara Lumada and Pentaho Support Portal for best processes to ensure your Data Catalog data is backed-up on a secondary server that acts as a failover system in case the primary Solr instance fails.
Storage

You can choose to configure Solr or SolrCloud to store replicas either on local storage or on HDFS. Storing replicas on HDFS allows you to manage Data Catalog as another client of HDFS rather than identifying and managing storage on individual nodes. Configuring Solr to use local storage has a performance advantage over storing indexes on HDFS. Data Catalog supports both methods.

Requirements for the Data Catalog Solr collection

The Data Catalog Solr collection has the following best practices:

  • One shard

    A best practice is to use a single shard. If you use multiple shards, you must restart the Solr server whenever the collection schema changes. The server restart is required because Data Catalog changes the collection schema when custom properties are added to objects in the catalog. The benefit of using multiple shards does not outweigh the risk of having the two shards getting out of sync.

  • Replication factor of two

    If you are storing the collection on HDFS, the Solr server indices replication factor is separate from the HDFS replication factor. A replication factor of two causes two copies of the index files to be stored in two different locations. If you are using SolrCloud, your cluster should have at least two running servers.

CautionThe Data Catalog service user must have full access to the collection.

Installing Solr

Follow these instructions to install Apache Solr in your Hadoop environment. If you already have Solr installed, you must either configure a service user for Solr or perform the following steps as a user with privileges to install an application on the cluster. Once completed, proceed to the following task, Verify that Solr is running.
NoteTo perform this installation, you must be a user with permissions to install applications on the cluster. If your cluster is Kerberized, the Solr service user needs the ability to validate from each of the servers where Solr will run.

Perform the following steps to install Solr 8.4.1:

Procedure

  1. Create a Solr principal in your Key Distribution Center (KDC).

  2. Create a keytab for the Solr user for each host where Solr will run.

    By default, the keytab is named and stored as /opt/solr/conf/solr.keytab
  3. Create a keytab for Solr to validate HTTP requests.

  4. Copy the keytabs to all the hosts running Solr, then configure them so they are owned by the user "solr" and have read-only permissions.

    These steps are described in the Hortonworks documentation, Configure Kerberos for SolrCloud.
  5. Download and expand the recommended Solr 8.4.1 installation package on the node to be the first location for Solr.

    This example uses /opt for the installation location. The location you choose must be available on each node that you run a Solr server.
    $ cd /opt
    $ wget http://archive.apache.org/dist/lucene/solr/8.4.1/solr-8.4.1.tgz
    $ tar -xf solr-8.4.1.tgz
  6. Repeat the Solr installation for each node where Solr will be running.

Verify that Solr is running

Whether you are using a new Solr installation or a supported version that is already installed, you should know how to manage the service and where it stores its data.

Procedure

  1. View the status of your Solr distribution. Enter the following command on the node where Solr is installed:

    • Lucidworks: $ /opt/lucidworks-hdpsearch/solr/bin/solr status
    • Apache Solr 8.4.1: $ /opt/solr-8.4.1/bin/solr status
  2. If it is running, stop the Solr server by entering the following command:

    $ <Solr Install Dir>/bin/solr stop -all

Configure Solr for HDFS storage

To configure Solr for HDFS storage, you must configure the storage location and generate configuration files. If you choose to use local storage, skip this section and go to Configure Solr for local storage.

Configure the HDFS storage location

Perform the following steps to configure Solr for HDFS storage:

Procedure

  1. Create a storage location on HDFS and change its ownership to the Solr service user by entering the following commands:

    $ sudo -u hdfs hadoop fs -mkdir /user/solr
    $ sudo -u hdfs hadoop fs -chown solr /user/solr
    
    NoteThe default HDFS storage location is located in the Solr user directory /user/solr or directly at the file system root directory /solr.
  2. Switch the user to the Solr service user by entering the following command: $ sudo su solr

  3. Copy the default configuration files on the first Solr node using the following command:

    $ cd <Solr Install Dir>)
    $ cp -r server/solr/configsets/_default server/solr/configsets/wdconfig
    
    NoteThe default installation directory is /opt/solr-8.4.1.

Generate configuration files for HDFS storage

You must modify two default configuration files to configure the storage: the managed-schema file and the solrconfig.xml file.

Perform the following steps to generate configuration files for Data Catalog.

Procedure

  1. Open the copy of the managed-schema file in the <Solr Install Dir>/server/solr/configsets/wdconfig/conf/ managed-schema directory with any text editor and change the code as follows:

    1. Add a _root_ fieldType as shown in the following code:

      <field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
      <field name="_root_" type="string" indexed="true" stored="false"/>
    2. Comment out the copyField source="*" entry.

    3. Locate the <fieldType name="text_general element and add the following code below the text_general definition:

      <fieldType name="text_with_special_chars"class="solr.TextField"positionIncrementGap="100">      
      <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>          
                <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="0" generateWordParts="1"
          splitOnNumerics="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>          
                 <filter class="solr.LowerCaseFilterFactory"/>      
      </analyzer>      
      <analyzer type="query">
           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
                <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
                <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="0" generateWordParts="1"
            splitOnNumerics="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
                <filter class="solr.LowerCaseFilterFactory"/>
            </analyzer>  
      </fieldType>
    4. Save and close the file.

  2. Open the copy of the solrconfig.xml file in the <Solr Install Dir>/server/solr/configsets/wdconfig/conf directory with any text editor and change the code as follows:

    1. Replace the default NRTCachingDirectoryFactory with HdfsDirectoryFactory and update the URL to the HDFS location where the Solr collection is stored as shown in the following sample:

      <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
         <str name="solr.hdfs.home">hdfs://namenode:8020/user/solr</str>
         <bool name="solr.hdfs.blockcache.enabled">true</bool>
         <int name="solr.hdfs.blockcache.slab.count">1</int>
         <bool name="solr.hdfs.blockcache.direct.memory.allocation">false</bool>
         <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
         <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
         <bool name="solr.hdfs.blockcache.write.enabled">false</bool>
         <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
         <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
         <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
      </directoryFactory>
    2. (Optional) If you have a Kerberized environment, you must add Kerberos settings to the directoryFactory by changing the principal to match the Solr service user and verifying that the keytab file location is correct.

      <directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory">
         <str name="solr.hdfs.home">hdfs://namenode:8020/user/solr</str>
         <bool name="solr.hdfs.blockcache.enabled">true</bool>
         <int name="solr.hdfs.blockcache.slab.count">1</int>
         <bool name="solr.hdfs.blockcache.direct.memory.allocation">false</bool>
         <int name="solr.hdfs.blockcache.blocksperbank">16384</int>
         <bool name="solr.hdfs.blockcache.read.enabled">true</bool>
         <bool name="solr.hdfs.blockcache.write.enabled">false</bool>
         <bool name="solr.hdfs.nrtcachingdirectory.enable">true</bool>
         <int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb">16</int>
         <int name="solr.hdfs.nrtcachingdirectory.maxcachedmb">192</int>
         <bool name="solr.hdfs.security.kerberos.enabled">true</bool>
         <str name="solr.hdfs.security.kerberos.keytabfile">/opt/solr/conf/solr.keytab</str>
         <str name="solr.hdfs.security.kerberos.principal">solr/localhost.localdomain@HADOOP.COM</str>
      </directoryFactory>
    3. Set the lockType to HDFS as shown here: <lockType>hdfs</lockType>

    4. Turn off the spellcheck facility as shown here: <str name="spellcheck">off</str>

    5. Change the hard autoCommit timeout value to 15000 and the soft autoCommit timeout value to 10000.

  3. Save and close the file.

  4. Restart Solr.

  5. Use the following commands to create a local storage location for a Solr node that is only accessible by the Solr service user. Copy the Solr configuration files to that location as shown in the following sample:

    $ mkdir ~/ldc-solr-node
    $ chmod 700 ~/ldc-solr-node
    $ cp <Solr Install Dir>/server/solr/solr.xml ~/ldc-solr-node
    $ cp <Solr Install Dir>/server/solr/zoo.cfg ~/ldc-solr-node
    $ cp -r <Solr Install Dir>/server/solr/configsets/wdconfig ~/ldc-solr-node/
    NoteThis location is used when starting the Solr server and must be available on all nodes running Solr. For HDFS, only the configuration files are stored in this location. The data files are stored in HDFS.
  6. Repeat the previous step for any additional Solr nodes.

    NoteOnly the solr.xml and zoo.cfg files should be added to the additional nodes. ZooKeeper copies its configuration files to all the nodes listed in the ensemble.

Configure Solr for local storage

Perform this task to configure Solr for local storage. If you have configured HDFS storage, go to Configure Solr for Kerberos.

Procedure

  1. Switch the user to the Solr user. Enter the following command: $ sudo su solr

  2. Copy the default configuration files in the Solr installation directory using the following command: $ cp -r server/solr/configsets/_default server/solr/configsets/wdconfig

Generate configuration files for local storage

You must modify two default configuration files to configure the storage: the managed-schema file and the solrconfig.xml file.

Perform the following steps to generate the configuration files for Data Catalog:

Procedure

  1. Open the copy of the solrconfig.xml file in the server/solr/configsets/wdconfig/conf/ directory with any text editor and change the code as follows:

    1. Use the default directory factory:

      <directoryFactory name="DirectoryFactory"
          class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}">
      </directoryFactory>
      
    2. Turn off the spellcheck utility: <str name="spellcheck">off</str>

    3. Save and close the file.

  2. Open the copy of the managed-schema file in the server/solr/configsets/wdconfig/conf/ directory with any text editor and change the code as follows:

    1. Add a _root_ field type as shown in the following sample:

      <field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
      <field name="_root_" type="string" indexed="true" stored="false"/>
      
    2. Comment out the copyField source entry as shown in the following sample: <!-- <copyField source="*" dest="_text_"/> -->

    3. Find the <fieldType name = "text_general " ... element and add the following code below the "text_general" definition as shown in the following sample:

      <fieldType name="text_with_special_chars" class="solr.TextField" positionIncrementGap="100"> 
          <analyzer type="index"> 
              <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
              <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> 
              <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="0" generateWordParts="1" 
                  splitOnNumerics="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/> 
              <filter class="solr.LowerCaseFilterFactory"/> 
          </analyzer> 
          <analyzer type="query"> 
              <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
              <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> 
              <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> 
              <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="0" generateWordParts="1" 
                  splitOnNumerics="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/> 
              <filter class="solr.LowerCaseFilterFactory"/> 
          </analyzer> 
      </fieldType>
    4. Save and close the file.

  3. Restart Solr.

  4. Create a local storage location for a Solr node that is accessible only by the owner. Copy the Solr configuration files to that location using the following commands:

    $ mkdir ~/ldc-solr-node
    $ chmod 700 ~/ldc-solr-node
    $ cp <Solr Install Dir>/server/solr/solr.xml ~/ldc-solr-node
    $ cp <Solr Install Dir>/server/solr/zoo.cfg ~/ldc-solr-node
    $ cp -r <Solr Install Dir>server/solr/configsets/wdconfig ~/ldc-solr-node/
    NoteChoose a location that is available on all nodes where Solr is running. Use this location when starting the Solr server.
  5. Repeat the previous step for any additional Solr nodes.

    NoteOnly the solr.xml and zoo.cfg files should be added to the additional nodes. ZooKeeper copies its configuration files among the nodes in the ensemble.

Configure Solr for Kerberos

Solr uses Java Authentication and Authorization Service (JAAS) to authenticate requests. This service is configured in a JAAS login configuration file. The Kerberos details such as the keytab name and location and the location of the JAAS configuration file should be available to each Solr instance upon startup. The steps involved in completing the integration include the following:

  • Create the JAAS configuration file (solr_jass.conf).
  • Restart Solr on all hosts.
  • Update your client computer with the krb5.conf details generated on the Solr host krb5.conf file so you can access the Solr admin.

These steps are described in Hortonworks documentation https://solr.apache.org/guide/8_4/kerberos-authentication-plugin.html.

Create the Solr collection

The Data Catalog collection must be created using the configuration files that you updated in either the Generate configuration files for HDFS storage or the Generate configuration files for local storage tasks.

Perform the following steps to create a collection:

Procedure

  1. Use the following commands to start the first Solr server listening at port 8983 with a local instance of ZooKeeper at port 9983:

    NoteYou can get the ZooKeeper ensemble string from Ambari.
    $ <Solr Install Dir>/bin/solr start -cloud -p 8983 -z <zookeeper_ensemble> -m 8g -s <local_solr_storage_location>
  2. Use the same command on each of the other Solr nodes to start the additional Solr instances:

    $ <Solr Install Dir>/bin/solr start -cloud -p 8983 -z <zookeeper_ensemble> -m 8g -s <local_solr_storage_location>
  3. Upload the customized configuration files to ZooKeeper using the following command:

    $ <Solr Install Dir>/server/scripts/cloud-scripts/zkcli.sh -zkhost <zookeeper_ensemble> \
                                                               -cmd upconfig \
                                                               -confname wdconfig \
                                                               -confdir <solr_install_location>/server/solr/configsets/wdconfig/conf
  4. Create the collection using the following command:

    $ <Solr Install Dir>/bin/solr create -c wdcollection -shards 1 -replicationFactor 2 -n wdconfig -p 8983

  5. Validate that the collection is accessible to the Data Catalog service user.

    Log in to the Solr admin screen to verify that the collection was created on the Solr node port 8983 and is accessible to the Data Catalog service user.

Validate Data Catalog Solr collection compatibility

You can verify that the fieldType is installed by:

  • Running the following command:

    curl ‘http://localhost:8983/solr/wdcollection/schema/fieldtypes/text_with_special_chars’

-OR-

If you receive a 404 status error that no such path exists, such as in the sample message below, then consult your system administrator or our support team at Hitachi Vantara Lumada and Pentaho Support Portal.

"No such path /schema/fieldtypes/text_with_special_chars"
If the fieldType is successfully installed, the status field returns a 0 (zero).