Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Search ranking with query boosting

Parent article

The search engine in Lumada Data Catalog lets you control the emphasis given to filters in search queries.

By default, search results are ordered by hit strength or scores. The stronger the score, the higher the result appears in a search. This score is calculated using the complex Apache Lucene scoring model algorithm applied by the search engine with the following conventions:

  • A token matched in the Name field receives a higher score than a token that matches in the Description field, which in turn has a higher score than a token that matches in the Comment field.

    For example, if you search for "Region", all the documents containing "Region" in their name appear higher in the search results than the documents that contain "Region" in their description. The documents with "Region" in the comments alone appear last in the search results.

  • For a search including multiple search terms, such as words separated by spaces or phrases separated by spaces, a result matching two search terms receives a higher score than results matching a single search term. A quoted phrase search does not search for independent words in the phrase.

    For example, if you search for "Customer" and "New York", search results containing both "Customer" and "New York" receive a higher score than results containing either "Customer" or "New York" search terms alone.

    However, documents containing either "New" or "York" are not included in the search criteria and do not appear in the result.

  • You can also use Relevance strength to influence the scoring of custom properties.

    For example, you can configure a custom property "LDC ID" to have a higher score than the default property Name, or weigh a custom property "Folio Number" lower than the defaults Name and Synonyms, but higher than the default Description. For more information, see Score boosting.

  • You can use Popularity rating and Usage level to resolve ties in scoring. For example, if two results calculate to the same scoring, then the results that have a higher popularity rating appear before the results with a lower popularity rating.

Apache Lucene scoring model

Ranking of documents depends on relevance, which refers to how well a document or set of documents contains the information you need. Scores are computed and assigned to each document resulting from a query. Scoring depends on the way documents are indexed.

The following factors affect the Apache® Lucene® Scoring model:

  • Term frequency

    The measure of how often a keyword appears in a document. The document scores higher the more frequently the term appears.

  • Inverse document frequency

    The measure of how often a keyword appears across the index. A term's contribution toward the document score is higher the more rare the term is across the document.

  • Coordination factor

    The measure of how many terms mentioned in the query appear in document. The higher the coordination factor, the higher the document score.

  • Field length

    The measure of the importance of a term with respect to the total number of terms in a field. This value penalizes documents with longer field values.

Lucene scoring formula

To learn more about the Apache Lucene scoring function, see Class Similarity.

Score boosting

You can use Apache® Lucene® to boost document scores in the following ways:

  • Index-time boosting

    You can apply index-time boosting either at the field or document level when you add documents.

  • Query-time boosting

    You can apply query-time boosting at the field level when you create a query.

Data Catalog uses query-time boosting at the field level to boost scores for documents whenever you trigger a query.

To attach or manipulate scores for a resource search query or field search query, you can modify the boostConf.json configuration file found in WLD Install Dir/app-server/conf. A default boostConf.json file looks like the following:

Default boostConf.json file

This configuration file lists the default Resource properties, Field properties and Tag properties. Each property has a weight value associated with it, and the default value is 1. You can manipulate these weight values to either boost or suppress the search index score.

NoteThese weights are floating point values that can have negative (-) or positive (+) values.

For example, you may want to prioritize search results for a search query "Employee" such that all resources that have the default property reported_names as "Employee" appear higher up in the search. You would set the weight value of the default property reported_name to a higher value, such as 10. Ideally, this value would boost the score of the property to get the expected result, listing resources with reported_name as "Employee" higher up in the search results.

Score computing also uses other attributes as described in Apache Lucene scoring model, above. To get the expected results, this process needs to be an iterative one with trial and error to arrive at an optimum value. This iterative process is called the tuning phase. See Tuning scores for details.

You also can apply score boosting to Custom Properties, as shown in the following example.

Custom properties

Tuning scores

Lumada Data Catalog search needs to iterate multiple times to arrive at the optimum score for a resource because Apache® Lucene® score computing considers multiple attributes, such as Term frequency, Inverse document frequency, Coordination factor, and Field length. In this one-time process, Data Catalog "learns" and adapts future searches that add value to your business.

Two configuration properties in the Application Server's configuration.json file enable and disable and aid in this tuning phase. They are:

  • ldc.metadata.solrquery.boost.debug

    This property enables the Apache Solr debug mode (Tuning phase) when set to true. The default value is false.

  • ldc.metadata.boostsearch.result.log.limit

    This property controls the number of solr documents included in the tuning phase. The default value is 10.

NoteRemember to restart the Application Server whenever you update the configuration.json file.

After you enable tuning, tune Data Catalog to boost the score value of the description resource property.

For example, you can change the description of the T4.csv and T5.csv resources to include the word "Employee" at different places in their description fields. You can also add the value "Employee" to the custom property employee on the resource TrainingRec.csv.

With a default setting of weight value (+)1, a search for the keyword "Employee", lists T4.csv as the first resource. If you alter the weight values to favor the value "Employee" of the custom property employee, you notice that the search for the same keyword "Employee" now lists TrainingRec.csv as the first resource.

Boosting example

In the Tuning phase, with the solr.debug mode set to true, the wd-ui.log file lists all the information pertaining to score computation for the number of documents specified by the log.limit property.

The following is sample output for a search query after altering the weight values of the resource properties:

com.ldc.web.service.audit.AuditingServiceImpl - Success: Basic Search: SEARCH INDEX {"searchPhrase":"Employee","facetSelections":null,"entityScope":["data_resource"]}
com.ldc.web.service.search.SearchServiceImpl - ***Saved MSR to session
com.ldc.metadataservice.SolrDocumentResultSet - --------------------------------------------------------------
com.ldc.metadataservice.SolrDocumentResultSet -  Boost scores Configured
com.ldc.metadataservice.SolrDocumentResultSet - --------------------------------------------------------------
com.ldc.metadataservice.SolrDocumentResultSet - scores for Resource
com.ldc.metadataservice.SolrDocumentResultSet - name --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - description --- -10.0
com.ldc.metadataservice.SolrDocumentResultSet - imported_resource_description --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - resource_path --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - file_format --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - resource_field_names --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - resource_field_display_names --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - resource_field_descriptions --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - reported_names --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - reported_display_names --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - reported_descriptions --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - imported_field_descriptions --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - resource_tags --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - resource_field_tags --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - conversation_descriptions --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - review_descriptions --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - employee --- 10.0
com.ldc.metadataservice.SolrDocumentResultSet - func_domain --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - influx_frequency --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - info_category --- -1.0
com.ldc.metadataservice.SolrDocumentResultSet - scores for Field
com.ldc.metadataservice.SolrDocumentResultSet - name --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - displayName --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - description --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - reported_name --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - reported_display_name --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - reported_description --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - imported_field_description --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - field_path --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - resource_path --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - resource_tags --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - resource_field_tags --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - scores for Tags
com.ldc.metadataservice.SolrDocumentResultSet - name --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - tag_name --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - description --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - synonyms --- 1.0
com.ldc.metadataservice.SolrDocumentResultSet - ---------------------------------------------------------------
com.ldc.metadataservice.SolrDocumentResultSet - Search Results
com.ldc.metadataservice.SolrDocumentResultSet - ---------------------------------------------------------------
com.ldc.metadataservice.SolrDocumentResultSet - Total number of Search documents found: 19, Start: 0, 
                                                          Max document score obtained for search: 18.020535,
com.ldc.metadataservice.SolrDocumentResultSet - Printing TOP 10 search result documents info


#============
#DOCUMENT #1 |
#============
12 Sep 2019 02:12:24.199 [qtp1156060786-122] INFO  com.ldc.metadataservice.SolrDocumentResultSet - SolrDocument{type=resource_header, id=0c9e2452e536640debe0b0a2bf39e2c3#0000, 
avg_rating=0.0, execution_status=Profiled, execution_status_facet=Profiled, resource_state=AVAILABLE, data_set_member=false, name=TrainingRec.csv, source=raf43856f5ee8e4bca, row_count=100, 
field_count=7, file_size=4963, file_format=text/csv, resource_type=resource_type_hdfs_file, resource_path=/user/ldcsvc/Joe/DO_Demo/EmpSkill_Analytics/Data/TrainingRec.csv, 
resource_parent_path=/user/ldcsvc/Joe/DO_Demo/EmpSkill_Analytics/Data, resource_owner=ldcsvc, time_of_resource_access=1561662683287, time_of_resource_change=1561662683704, 
resource_origin=[raf43856f5ee8e4bca], sensitivity=HIGH, file_format_facet=text/csv, resource_type_facet=resource_type_hdfs_file, resource_origin_facet=[raf43856f5ee8e4bca], 
source_facet=raf43856f5ee8e4bca, employee=Employee, employee__facet=Employee, func_domain=Marketing, func_domain__facet=Marketing, influx_frequency=Monthly, influx_frequency__facet=Monthly, 
info_category=Managed Investments, info_category__facet=Managed Investments, resource_field_names=[duration, city, rating, proj, employee, startdate, idos], 
resource_field_tags=[ra1570097de94f4910, ra6b8c6a3e1d464c0f, rae7fcf5a868b14a71, 30eeb7ea-ce8e-4365-af44-32c815b806b0, ra9510c762afea4310], 
resource_field_tag_facets=[ra1570097de94f4910, ra6b8c6a3e1d464c0f, rae7fcf5a868b14a71, 30eeb7ea-ce8e-4365-af44-32c815b806b0, ra9510c762afea4310], 
resource_field_tag_states=[SUGGESTED], resource_field_tag_states_facet=[SUGGESTED], virtual_folders=[rab3af9e40d323470c, ra2e0c90897b354a6a], 
virtual_folders_facet=[rab3af9e40d323470c, ra2e0c90897b354a6a], score=18.020535} <<== NOTE THE SCORE

#============
#DOCUMENT #2 |
#============
12 Sep 2019 02:12:24.199 [qtp1156060786-122] INFO  com.ldc.metadataservice.SolrDocumentResultSet - SolrDocument{type=resource_header, id=1f659c03abb92fd9f18739021052fde7#0000, 
avg_rating=0.0, execution_status=Profiled, execution_status_facet=Profiled, resource_state=AVAILABLE, data_set_member=false, name=prospect.csv, source=rab00e916a825f433f, row_count=4145, 
field_count=32, file_size=1585662, file_format=text/csv, resource_type=resource_type_hdfs_file, resource_path=/user/QBST/QBR/demo-data/pub/finance/businessbanking/prospect.csv, 
resource_parent_path=/user/QBST/QBR/demo-data/pub/finance/businessbanking, resource_owner=wlddev, time_of_resource_access=1565735059443, time_of_resource_change=1565735059469, 
resource_origin=[rab00e916a825f433f], sensitivity=MEDIUM, file_format_facet=text/csv, resource_type_facet=resource_type_hdfs_file, resource_origin_facet=[rab00e916a825f433f], 
source_facet=rab00e916a825f433f, resource_field_names=[SIC Code, Address 1: Postal Code, Owner, Banking Division, First Name, Product, Middle Name, Birthday, Gender, E-mail Address 1, 
Salutation, Annual Revenue, Telephone 1, Address 1: Country, Address 1: Latitude, Customer Name, No. of Employees, Status, Territory, Prospect Number, Address 1: Longitude, Created On, 
Address 1: State/Province, Net Income, Address 1: City, Status Reason, Last Name, Fax, Address 1: Line 3, Mobile Phone, Address 1: Line 1, Address 1: Line 2], 
resource_field_tags=[419b10b9-b168-449e-9f2b-eb094ad4c056, ra3e1ae03f01d44d71, 49a57862-d058-478f-9517-6e70eb03d03a, rab438dfd5bb7b4aaf, 2cb6eb58-1692-4d4a-bb01-d1bc92c0e6ef, 
b13482f9-7027-4d97-a4b9-ea79003e3272, 920783df-e304-43aa-84ff-026a6a1f4a39, 734d6e34-69dc-43ea-bdf8-a86f8b9e42c5], 
resource_field_tag_facets=[419b10b9-b168-449e-9f2b-eb094ad4c056, ra3e1ae03f01d44d71, 49a57862-d058-478f-9517-6e70eb03d03a, rab438dfd5bb7b4aaf, 2cb6eb58-1692-4d4a-bb01-d1bc92c0e6ef, 
b13482f9-7027-4d97-a4b9-ea79003e3272, 920783df-e304-43aa-84ff-026a6a1f4a39, 734d6e34-69dc-43ea-bdf8-a86f8b9e42c5], resource_field_tag_states=[SUGGESTED], resource_field_tag_states_facet=[SUGGESTED], 
virtual_folders=[ra65c787de81ab4076], virtual_folders_facet=[ra65c787de81ab4076], score=2.2761066} <<== NOTE THE SCORE

#============
#DOCUMENT #3 |
#============
12 Sep 2019 02:12:24.199 [qtp1156060786-122] INFO  com.ldc.metadataservice.SolrDocumentResultSet - SolrDocument{type=resource_header, id=3a4d5195f7a272d324ae2e2f569547d1#0000, 
avg_rating=0.0, execution_status=Profiled, execution_status_facet=Profiled, resource_state=AVAILABLE, data_set_member=false, name=HQ_EMPS.txt, source=rab00e916a825f433f, row_count=31, field_count=11, 
file_size=2769, file_format=text/tsv, resource_type=resource_type_hdfs_file, resource_path=/user/QBST/QBR/demo-data/pub/hr/HQ_EMPS.txt, resource_parent_path=/user/QBST/QBR/demo-data/pub/hr, 
resource_owner=wlddev, time_of_resource_access=1565735065086, time_of_resource_change=1565735065095, resource_origin=[rab00e916a825f433f], sensitivity=HIGH, file_format_facet=text/tsv, 
resource_type_facet=resource_type_hdfs_file, resource_origin_facet=[rab00e916a825f433f], source_facet=rab00e916a825f433f, resource_field_names=[STATUS, TERMINATION_DATE, EMPLOYEE_ID, 
HIRE_DATE, RETURN_DATE, BIRTH_DATE, TITLE_OF_CORTESY, TITLE, LAST_NAME, FIRST_NAME, SSN], resource_field_tags=[419b10b9-b168-449e-9f2b-eb094ad4c056, rae7fcf5a868b14a71, rab438dfd5bb7b4aaf, 
49a57862-d058-478f-9517-6e70eb03d03a, ra3e1ae03f01d44d71, 29576c0a-3f39-4766-8495-3431a5e2144d, b328a5a7-5374-486b-884d-5b0e2d10fa28, 734d6e34-69dc-43ea-bdf8-a86f8b9e42c5], 
resource_field_tag_facets=[419b10b9-b168-449e-9f2b-eb094ad4c056, rae7fcf5a868b14a71, rab438dfd5bb7b4aaf, 49a57862-d058-478f-9517-6e70eb03d03a, ra3e1ae03f01d44d71, 29576c0a-3f39-4766-8495-3431a5e2144d, 
b328a5a7-5374-486b-884d-5b0e2d10fa28, 734d6e34-69dc-43ea-bdf8-a86f8b9e42c5], resource_field_tag_states=[SUGGESTED], resource_field_tag_states_facet=[SUGGESTED], virtual_folders=[ra65c787de81ab4076], 
virtual_folders_facet=[ra65c787de81ab4076], score=2.2761066} <<== NOTE THE SCORE

#============
#DOCUMENT #4 |
#============
12 Sep 2019 02:12:24.199 [qtp1156060786-122] INFO  com.ldc.metadataservice.SolrDocumentResultSet - SolrDocument{type=resource_header, id=3a9f49664f8b7818dc40d806ab41659a#0000, 
avg_rating=0.0, execution_status=Profiled, execution_status_facet=Profiled, resource_state=AVAILABLE, data_set_member=false, name=W_BONUSES.txt, source=rab00e916a825f433f, row_count=30, 
field_count=2, file_size=330, file_format=text/tsv, resource_type=resource_type_hdfs_file, resource_path=/user/QBST/QBR/demo-data/pub/hr/W_BONUSES.txt, resource_parent_path=/user/QBST/QBR/demo-data/pub/hr, 
resource_owner=wlddev, time_of_resource_access=1565735065150, time_of_resource_change=1565735065565, resource_origin=[rab00e916a825f433f], sensitivity=HIGH, file_format_facet=text/tsv, 
resource_type_facet=resource_type_hdfs_file, resource_origin_facet=[rab00e916a825f433f], source_facet=rab00e916a825f433f, resource_field_names=[EMPLOYEE_ID, EMPLOYEE_BONUS], 
resource_field_tags=[rae7fcf5a868b14a71], resource_field_tag_facets=[rae7fcf5a868b14a71], resource_field_tag_states=[SUGGESTED], resource_field_tag_states_facet=[SUGGESTED], 
virtual_folders=[ra65c787de81ab4076], virtual_folders_facet=[ra65c787de81ab4076], score=2.2761066} <<== NOTE THE SCORE

The image below highlights the difference in the scores computed in each scenario, before query boosting and after query boosting.

Before and after query boost

Once the search results are satisfactory for your business use case, disable the solr.debug mode in the Application Server's configuration.json file and restart the Application Server. Your search results are now customized to your boost values.

Incremental Apache Lucene faceting

The combination of Apache® Solr faceting and incremental Apache Lucene® faceting enhances the speed of Lumada Data Catalog searches.

For users with Native access to resources, Data Catalog search retrieves Lucene facets for search results as they are reported, as opposed to a two-phase approach in which the entire catalog is queried for resources followed by a separate facets query on the search results. However, for users with Metadata access, a single query to Solr facets improves search performance by skipping list permission checks on the resources.

The initial page retrieves approximately 20% of the results, displaying them, while continuing to retrieve the remaining results in the background. The numbers in the facets change in real time, based on the permission check progress. A progress bar in the top of the Search page indicates the permission check progress.

The search continues in the background as long as you do not navigate away from the Search page, including drilling into a resource from the search results or using the Back key on the browser. Navigating away from the Search page halts the current search.

CautionIf you navigate away from the Search page, the results from the previous search are lost.