Solr on CDH and CDP
These instructions assume that you are a root user installing Lumada Data Catalog on CDH in a Kerberos-controlled
cluster with the recommended Solr version installed and set up in accordance with the CDH distribution. Data Catalog uses the
solrctl
utility to access an existing SolrCloud installation. Although CDH 6.1.1 offers an integrated Solr 7.5, Data Catalog requires Solr 8.4.1 to be installed separately.
CDP 7.1.x offers an integrated Solr 8.4.1. See the following articles from Cloudera for details on the following topics:
Best practices for the Data Catalog Solr collection
The Data Catalog collection has the following best practices:
One Shard
A best practice is to use a single shard. If you use multiple shards, you must restart the Solr server whenever the collection schema changes. The server restart is required because Data Catalog changes the collection schema when custom properties are added to objects in the catalog. The benefit of using multiple shards does not outweigh the risk of having the two shards becoming out of sync.
Replication factor of two
If you are storing the collection on HDFS, the Solr server indices replication factor is separate from the HDFS replication factor. A replication factor of two results in two copies of the index files to be stored in two different locations. If you are using SolrCloud, your cluster should have at least two running servers.
Creating a Solr collection on CDH
You can choose either a standalone Solr or an existing Solr deployment for Data Catalog. If you want to use an existing Solr deployment, you must complete the instructions in the following tasks to make changes needed for Data Catalog.
Prerequisites
You need the following information to create the Data Catalog collection and to configure access to that collection in the Data Catalog installation:
- SolrCloud connection string.
- Data Catalog configuration name. For example,
ldcconfig
. - Data Catalog collection name. For example,
ldccollection
. - ZooKeeper coordination service ensemble
(<ZK_ensemble>)
. You can get the ensemble list from the Cloudera Manager configuration for ZooKeeper or from the Solr Admin page by selecting . The ensemble may include one or multiple host names, for example:sfo01.acme.com:2181,sfo02.acme.com:2181,sfo03.acme.com:2181/solr
- Data Catalog instance directory location on the Data Catalog server
(<ldc_instancedir>)
. You can set this location inside the Data Catalog installation directory, for example: ldc/ldccollection_instance_directory.The steps specific to the Kerberos configuration are identified; if your system is NOT secured with Kerberos, alternative steps are described.
Set Solr access to Data Catalog
Procedure
Verify that SolrCloud is installed and running.
Verify that you can access the Solr Admin page.
The URL will be similar to http://<myhost.example.com>:8983/solr as described in Cloudera's documentation for deploying Cloudera search.Create a Solr service user role that can access the Data Catalog collection and has the
Collection=admin->action=*
privilege.Log in as a user with applicable privileges to create a new role such as
Typically, this user is the Solr service user. This user should have thewd_collection_role
.Collection=admin->action=*
privilege.Use the following commands to create the
wd_collection_role
:$ solrctl sentry --create-role wd_collection_role $ solrctl sentry --add-role-group wd_collection_role <primary group for ldcuser>
Use the following commands to grant query and update privileges to the
role:wd_collection_role
$ solrctl sentry --grant-privilege wd_collection_role 'Collection=wdcollection->action=Query' $ solrctl sentry --grant-privilege wd_collection_role 'Collection=wdcollection->action=Update'
Create an instance directory
Use the following commands to create an instance directory:
$ solrctl instancedir --generate <waterline_instancedir> $ cd <waterline_instancedir>/config/
Generating configuration files
Generate Solr configuration files for Data Catalog by modifying the schema.xml and solrconfig.xml files created in the instance directory.
Modify the schema file
schema.xml
file in the instance directory:Procedure
Comment out the following fields in the nested document section:
<field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/> <field name="name" type="text_general" indexed="true" stored="true"/> <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/> <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/> <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" /> <field name="weight" type="float" indexed="true" stored="true"/> <field name="price" type="float" indexed="true" stored="true"/> <field name="popularity" type="int" indexed="true" stored="true" /> <field name="inStock" type="boolean" indexed="true" stored="true" /> <field name="store" type="location" indexed="true" stored="true"/>
Comment out the following fields in the Common metadata section:
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/> <field name="subject" type="text_general" indexed="true" stored="true"/> <field name="description" type="text_general" indexed="true" stored="true"/> <field name="comments" type="text_general" indexed="true" stored="true"/> <field name="author" type="text_general" indexed="true" stored="true"/> <field name="keywords" type="text_general" indexed="true" stored="true"/> <field name="category" type="text_general" indexed="true" stored="true"/> <field name="resourcename" type="text_general" indexed="true" stored="true"/> <field name="url" type="text_general" indexed="true" stored="true"/> <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/> <field name="last_modified" type="date" indexed="true" stored="true"/> <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
Comment out the following field in the highlighting document content section:
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>
Comment out the following fields in the non-tokenized section:
<field name="manu_exact" type="string" indexed="true" stored="false"/> <field name="payloads" type="payloads" indexed="true" stored="true"/>
Comment out the following copyField command entries:
<copyField source="cat" dest="text"/> <copyField source="name" dest="text"/> <copyField source="manu" dest="text"/> <copyField source="features" dest="text"/> <copyField source="includes" dest="text"/> <copyField source="manu" dest="manu_exact"/> <copyField source="price" dest="price_c"/> <copyField source="title" dest="text"/> <copyField source="author" dest="text"/> <copyField source="description" dest="text"/> <copyField source="keywords" dest="text"/> <copyField source="content" dest="text"/> <copyField source="content_type" dest="text"/> <copyField source="resourcename" dest="text"/> <copyField source="url" dest="text"/> <copyField source="author" dest="author_s"/>
Add a new fieldType definition below the existing
<fieldType name="text_general" definition
:<fieldType name="text_with_special_chars" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="0" generateWordParts="1" splitOnNumerics="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="0" generateWordParts="1" splitOnNumerics="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
Modify the XML file for Solr configuration
solrconfig.xml
in the instance directory as follows:Procedure
(Optional) If you are using HDFS to store Solr replicas, you must set the
directoryFactory
to HDFS as shown below:<directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:org.apache.solr.core.HdfsDirectoryFactory}">
Enable or add the
schemaFactory
class for dynamic schema REST APIs.NoteIf you are using an older CDH version that has the defaultschemaFactory
class definition set toClassicIndexSchemaFactory
, comment it out.Add the following definition for
schemaFactory
class if it is missing:<schemaFactory class="ManagedIndexSchemaFactory"> <bool name="mutable">true</bool> <str name="managedSchemaResourceName">managed-schema</str>
NoteMake sure there is only oneschemaFactory
class definition.Update the hard and soft
autoCommit
timeouts as shown in the following code:<autoCommit> <maxTime>${solr.autoCommit.maxTime:15000}</maxTime> <openSearcher>false</openSearcher> </autoCommit> <!-- softAutoCommit is like autoCommit except it causes a 'soft' commit which only ensures that changes are visible but does not ensure that data is synced to disk. This is faster and more near-realtime friendly than a hard commit. --> <autoSoftCommit> <maxTime>${solr.autoSoftCommit.maxTime:10000}</maxTime> </autoSoftCommit>
Update the
ping/healthcheck
request handler properties list as shown in the following code:<!-- ping/healthcheck --> <requestHandler name="/admin/ping" class="solr.PingRequestHandler"> <lst name="invariants"> <str name="q">solrpingquery</str> </lst> <lst name="defaults"> <str name="echoParams">all</str> <str name="df">id</str> </lst> <!-- An optional feature of the PingRequestHandler is to configure the handler with a "healthcheckFile" which can be used to enable/disable the PingRequestHandler. relative paths are resolved against the data dir --> <!-- <str name="healthcheckFile">server-enabled.txt</str> --> </requestHandler>
Configuring Solr for Kerberos
Solr uses Java Authentication and Authorization Service (JAAS) to authenticate requests. This service is configured in a JAAS login configuration file. The Kerberos details such as the keytab name and location and the location of the JAAS configuration file need to be available to each Solr instance on startup. The steps involved in completing the integration include the following:
- Create the JAAS configuration file (
solr_jass.conf
). - Restart Solr on all hosts.
- Update your client computer with the
krb5.conf
details generated on the Solrhost krb5.conf
file so you can access the Solr admin.
Upload the new configuration to ZooKeeper
Use the following command to associate the schema.xml and solrconfig.xml configuration files you have customized to the instance directory
and upload that information to ZooKeeper: $ solrctl --zk <ZK_ensemble> instancedir --create wdconfig <waterline_instancedir>
To identify the ZK ensemble when you do not have access to the Solr
Administration page, use the following CURL command and look for the term 'DzkHost='
in the output: curl -q --negotiate -u : http://<Your Solr
Host>:8983/solr/admin/info/system?wt=json
.
Restart Solr
Restart Solr using the Cloudera manager and ensure there are no errors in the Solr admin page.
Create the collection
Procedure
Use the following command to create a new collection with one shard and two replicas, with a maximum number of two replicas on the same node:
$ solrctl --zk <ZK_ensemble> collection --create wdcollection -c wdconfig -s 1 -r 2 -m 2
Use the following command to validate that the collection is accessible to the Data Catalog service user:
$ sudo su ldcuser $ solrctl collection --list
The
wdcollection
displays in the list of collections returned by the command. The collection also displays on the Solr Administration page.
Validate the Data Catalog Solr collection compatibility
Verify that the fieldType is installed by using the following command:
curl ‘http://<solr-host>:8983/solr/wdcollection/schema/fieldtypes/text_with_special_chars’
The output from the curl command is:
{ "responseHeader": { "status":404, "QTime":5}, "error": { "metadata": [ "error-class","org.apache.solr.common.SolrException", "root-error-class","org.apache.solr.common.SolrException"], "msg": "No such path /schema/fieldtypes/text_with_special_chars", "code":404} }
If you receive a 404 status error that no such path exists, such as in the sample message below, then consult your system administrator or our support team at Hitachi Vantara Lumada and Pentaho Support Portal.