Solr on CDH and CDP

Last updated
Save as PDF

These instructions assume that you are a root user installing Lumada Data Catalog on CDH in a Kerberos-controlled cluster with the recommended Solr version installed and set up in accordance with the CDH distribution. Data Catalog uses the solrctl utility to access an existing SolrCloud installation. Although CDH 6.1.1 offers an integrated Solr 7.5, Data Catalog requires Solr 8.4.1 to be installed separately.

CDP 7.1.x offers an integrated Solr 8.4.1. See the following articles from Cloudera for details on the following topics:

Best practices for the Data Catalog Solr collection

The Data Catalog collection has the following best practices:

One Shard
A best practice is to use a single shard. If you use multiple shards, you must restart the Solr server whenever the collection schema changes. The server restart is required because Data Catalog changes the collection schema when custom properties are added to objects in the catalog. The benefit of using multiple shards does not outweigh the risk of having the two shards becoming out of sync.
Replication factor of two
If you are storing the collection on HDFS, the Solr server indices replication factor is separate from the HDFS replication factor. A replication factor of two results in two copies of the index files to be stored in two different locations. If you are using SolrCloud, your cluster should have at least two running servers.

CautionThe Data Catalog service user must have full access to the collection.

Creating a Solr collection on CDH

You can choose either a standalone Solr or an existing Solr deployment for Data Catalog. If you want to use an existing Solr deployment, you must complete the instructions in the following tasks to make changes needed for Data Catalog.

Prerequisites

You need the following information to create the Data Catalog collection and to configure access to that collection in the Data Catalog installation:

SolrCloud connection string.
Data Catalog configuration name. For example, ldcconfig.
Data Catalog collection name. For example, ldccollection.
ZooKeeper coordination service ensemble (<ZK_ensemble>) . You can get the ensemble list from the Cloudera Manager configuration for ZooKeeper or from the Solr Admin page by selecting Cloud Tree. The ensemble may include one or multiple host names, for example:
```
sfo01.acme.com:2181,sfo02.acme.com:2181,sfo03.acme.com:2181/solr
```
Data Catalog instance directory location on the Data Catalog server (<ldc_instancedir>). You can set this location inside the Data Catalog installation directory, for example: ldc/ldccollection_instance_directory.
The steps specific to the Kerberos configuration are identified; if your system is NOT secured with Kerberos, alternative steps are described.

Set Solr access to Data Catalog

In addition to the required Data Catalog service user role, you must also create a Solr service user role to access the Data Catalog. Perform the following steps to set up Solr access to the Data Catalog:

Procedure

Verify that SolrCloud is installed and running.
Verify that you can access the Solr Admin page.
The URL will be similar to http://<myhost.example.com>:8983/solr as described in Cloudera's documentation for deploying Cloudera search.
Create a Solr service user role that can access the Data Catalog collection and has the Collection=admin->action=* privilege.
1. Log in as a user with applicable privileges to create a new role such as wd_collection_role.
  Typically, this user is the Solr service user. This user should have the Collection=admin->action=* privilege.
2. Use the following commands to create the wd_collection_role:
```
$ solrctl sentry --create-role wd_collection_role
$ solrctl sentry --add-role-group wd_collection_role <primary group for ldcuser>
```

Use the following commands to grant query and update privileges to the wd_collection_role role:

$ solrctl sentry --grant-privilege wd_collection_role 'Collection=wdcollection->action=Query'
$ solrctl sentry --grant-privilege wd_collection_role 'Collection=wdcollection->action=Update'

Create an instance directory

Use the following commands to create an instance directory:

$ solrctl instancedir --generate <waterline_instancedir>
$ cd <waterline_instancedir>/config/

Generating configuration files

Generate Solr configuration files for Data Catalog by modifying the schema.xml and solrconfig.xml files created in the instance directory.

Modify the schema file

Perform the following steps to modify the schema.xml file in the instance directory:

Procedure

Comment out the following fields in the nested document section:

<field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
<field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
<field name="weight" type="float" indexed="true" stored="true"/>
<field name="price" type="float" indexed="true" stored="true"/>
<field name="popularity" type="int" indexed="true" stored="true" />
<field name="inStock" type="boolean" indexed="true" stored="true" />
<field name="store" type="location" indexed="true" stored="true"/>

Comment out the following fields in the Common metadata section:

<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="subject" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="comments" type="text_general" indexed="true" stored="true"/>
<field name="author" type="text_general" indexed="true" stored="true"/>
<field name="keywords" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true"/>
<field name="resourcename" type="text_general" indexed="true" stored="true"/>
<field name="url" type="text_general" indexed="true" stored="true"/>
<field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="links" type="string" indexed="true" stored="true" multiValued="true"/>

Comment out the following field in the highlighting document content section:

<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

Comment out the following fields in the non-tokenized section:

<field name="manu_exact" type="string" indexed="true" stored="false"/> 
<field name="payloads" type="payloads" indexed="true" stored="true"/>

Comment out the following copyField command entries:

<copyField source="cat" dest="text"/>
<copyField source="name" dest="text"/>
<copyField source="manu" dest="text"/>
<copyField source="features" dest="text"/>
<copyField source="includes" dest="text"/>
<copyField source="manu" dest="manu_exact"/>
<copyField source="price" dest="price_c"/> 
<copyField source="title" dest="text"/> 
<copyField source="author" dest="text"/>
<copyField source="description" dest="text"/>
<copyField source="keywords" dest="text"/>
<copyField source="content" dest="text"/>
<copyField source="content_type" dest="text"/>
<copyField source="resourcename" dest="text"/>
<copyField source="url" dest="text"/>
<copyField source="author" dest="author_s"/>

Add a new fieldType definition below the existing <fieldType name="text_general" definition:

<fieldType name="text_with_special_chars" class="solr.TextField" positionIncrementGap="100"> 
    <analyzer type="index"> 
        <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> 
        <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="0" generateWordParts="1" 
            splitOnNumerics="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/> 
        <filter class="solr.LowerCaseFilterFactory"/> 
    </analyzer> 
    <analyzer type="query"> 
        <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
        <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> 
        <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> 
        <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="0" generateWordParts="1" 
            splitOnNumerics="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/> 
        <filter class="solr.LowerCaseFilterFactory"/> 
    </analyzer> 
</fieldType>

Modify the XML file for Solr configuration

Modify the solrconfig.xml in the instance directory as follows:

Procedure

(Optional) If you are using HDFS to store Solr replicas, you must set the directoryFactory to HDFS as shown below:

<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:org.apache.solr.core.HdfsDirectoryFactory}">

Enable or add the schemaFactory class for dynamic schema REST APIs.

NoteIf you are using an older CDH version that has the default schemaFactory class definition set to ClassicIndexSchemaFactory, comment it out.

Add the following definition for schemaFactory class if it is missing:

<schemaFactory class="ManagedIndexSchemaFactory">
   <bool name="mutable">true</bool>
   <str name="managedSchemaResourceName">managed-schema</str>

NoteMake sure there is only one schemaFactory class definition.

Update the hard and soft autoCommittimeouts as shown in the following code:

<autoCommit>
   <maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
   <openSearcher>false</openSearcher>
</autoCommit>

<!-- softAutoCommit is like autoCommit except it causes a
     'soft' commit which only ensures that changes are visible
     but does not ensure that data is synced to disk.  This is
     faster and more near-realtime friendly than a hard commit.
-->
<autoSoftCommit>
   <maxTime>${solr.autoSoftCommit.maxTime:10000}</maxTime>
</autoSoftCommit>

Update the ping/healthcheck request handler properties list as shown in the following code:

<!-- ping/healthcheck -->
  	<requestHandler name="/admin/ping" class="solr.PingRequestHandler">
	<lst name="invariants">
  	<str name="q">solrpingquery</str>
	</lst>
       <lst name="defaults">
  	<str name="echoParams">all</str>
	  <str name="df">id</str>
       </lst>
         <!-- An optional feature of the PingRequestHandler is to configure the
      handler with a "healthcheckFile" which can be used to enable/disable
      the PingRequestHandler. relative paths are resolved against the data dir     -->
      <!-- <str name="healthcheckFile">server-enabled.txt</str> -->
      </requestHandler>

Configuring Solr for Kerberos

Solr uses Java Authentication and Authorization Service (JAAS) to authenticate requests. This service is configured in a JAAS login configuration file. The Kerberos details such as the keytab name and location and the location of the JAAS configuration file need to be available to each Solr instance on startup. The steps involved in completing the integration include the following:

Create the JAAS configuration file (solr_jass.conf).
Restart Solr on all hosts.
Update your client computer with the krb5.conf details generated on the Solr host krb5.conf file so you can access the Solr admin.

Upload the new configuration to ZooKeeper

Use the following command to associate the schema.xml and solrconfig.xml configuration files you have customized to the instance directory and upload that information to ZooKeeper: $ solrctl --zk <ZK_ensemble> instancedir --create wdconfig <waterline_instancedir>

To identify the ZK ensemble when you do not have access to the Solr Administration page, use the following CURL command and look for the term 'DzkHost=' in the output: curl -q --negotiate -u : http://<Your Solr Host>:8983/solr/admin/info/system?wt=json.

Restart Solr

Restart Solr using the Cloudera manager and ensure there are no errors in the Solr admin page.

Create the collection

Execute the following commands to create a new collection that is accessible to the Data Catalog service user.

Procedure

Use the following command to create a new collection with one shard and two replicas, with a maximum number of two replicas on the same node:
$ solrctl --zk <ZK_ensemble> collection --create wdcollection -c wdconfig -s 1 -r 2 -m 2
Use the following command to validate that the collection is accessible to the Data Catalog service user:
```
$ sudo su ldcuser
$ solrctl collection --list
```
The wdcollection displays in the list of collections returned by the command. The collection also displays on the Solr Administration page.

Validate the Data Catalog Solr collection compatibility

Verify that the fieldType is installed by using the following command:

curl ‘http://<solr-host>:8983/solr/wdcollection/schema/fieldtypes/text_with_special_chars’

The output from the curl command is:

{
  "responseHeader": {
    "status":404,
    "QTime":5},
  "error": {
    "metadata": [
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg": "No such path /schema/fieldtypes/text_with_special_chars",
    "code":404}
}

NoteYou can also verify the fieldType with your browser at this address: http://<solr-hostt>:8983/solr/wdcollection/schema/fieldtypes/text_with_special_chars

If you receive a 404 status error that no such path exists, such as in the sample message below, then consult your system administrator or our support team at Hitachi Vantara Lumada and Pentaho Support Portal.

"No such path /schema/fieldtypes/text_with_special_chars"

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com.

Best practices for the Data Catalog Solr collection

Creating a Solr collection on CDH

Prerequisites

Set Solr access to Data Catalog

Create an instance directory

Generating configuration files

Modify the schema file

Modify the XML file for Solr configuration

Configuring Solr for Kerberos

Upload the new configuration to ZooKeeper

Restart Solr

Create the collection

Validate the Data Catalog Solr collection compatibility