Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Searching Data Catalog

Parent article

You can run a keyword search across the metadata in Lumada Data Catalog in three ways:

  • Basic Search

    You can use the search box on the main toolbar to trigger a global keyword search through all the resources in the cluster.

  • Saved Search

    You can reuse one of your five most recent searches that display when you click in the Search data catalog box by selecting that entry from the list.

  • Advanced Search

    You can select Advanced Search, which displays after the saved searches when you click in the Search data catalog box.

Search menu overview

Searching with keywords

When you enter search keywords, Lumada Data Catalog performs two separate searches and combines the results:

  • A search is performed for full or partial path names, such as /user/hudson/analysis/trend-2016.csv or analysis.
  • A search is performed for all other metadata and sample data, such as the names of files, fields, tables, tags, and origins, and the content of tag descriptions and origin descriptions.

Entering keywords into the Search data catalog box returns matching resources or fields, which are determined by the following attributes:

  • Case sensitivity

    Searches are case-insensitive except for path names. To find resources by their location in HDFS or S3, use the case represented in the file system.

  • Wildcard characters

    When you type a word in the search box, Data Catalog’s search engine scans the query for the wildcard asterisk (*) character in the search for metadata containing your keyword. The search results produced depend on the wildcard character’s position around the keyword, as shown in the table below. If you enter multiple words in the search box, each word is included as an independent term, as shown in the table below:

    Search TextDescriptionFinds
    fooStrict equals search foo

    ^foo*

    foo*

    Starts with “foo” and searches include any number of succeeding characters.foo, food, foodbar

    *bar$

    *bar

    Ends with “bar” and searches include any number of preceding characters. Escobar, Zanzibar, foodbar, bar
    food barOR search on each wordfood, bar, foodbar, food_bar
    “food bar”Strict multi-word equals searchfood bar
  • Quotes

    If you enclose keywords or phrases in double quotation marks, Data Catalog searches for the exact phrase.

  • Special characters

    If a resource or tag name contains special characters, such as $, @, &, and so forth, you must use a backward slash (\). For example, if your tag name is finance@USA$, then enter nce@USA\$ in the search box to find it.

    If resource name or tag name contains multiple special characters in its name, place the backward slash (\) in front of any special character. For example, if your tag name is Park@Avenue#, and this tag is associated to any resource at resource/field levels, your search for Park\@Avenue# or Park@Avenue\# are both valid.

    However, if a resource or tag name contains the hyphen (-) special character, you must use a forward slash (/) to escape the hyphen. For example, searching for Q1-2016 returns no results, but escaping the hyphen with a forward slash (Q1/-2016), returns Q1 2016 and Q1-2016.

    If you are using a field search, you do not need to use slashes to escape special characters.

    Note

    When using a Resource search, combine special characters with text.

Path name searches

Path names searches can match on any part of a path name and are case-sensitive.

Lumada Data Catalog compares the search text to its list of all the path names for all the resources in the catalog. The search is performed as a string comparison where a file, table, or folder is returned when the keyword matches any part of the resources' path.

For example, part matches the file /data/transactions/part-r-00000/data. The keyword actions would match the same file. However, Part would not match the file because while path name searches can match any part of a path containing the search term, they are also case sensitive.

Other metadata searches

Lumada Data Catalog compares the search text to the names of files, fields, tables, tags, and origins, and the content of tag descriptions and origin descriptions.

When building the keyword search indexes for these items, Data Catalog ensures the metadata values match one or more keywords to any complete token in the index. White space and characters such as single and double quotation marks, question marks, parentheses, carets (^), pound signs (#), colons, periods, hyphens, and commas indicate the end of a word and are otherwise ignored. Words including underscores are not broken across the underscore.

The search is not case sensitive.

For example, risk matches the field name Risk Band. The keyword RISK has the same match behavior as risk, except in path names, which are case-sensitive.

However, risk would not match the field Risk_Band because the token is considered the entire phrase risk_band. To find matches that include the keyword somewhere in the tokenized name, you can use the asterisk wildcard character before, after, or both before and after the keyword. The keyword risk* would match the field Risk_Band.

Keyword searches (other than path names):

  • Are case-insensitive
  • Match complete words
  • Accept the asterisk wildcard character to indicate any preceding or following characters

Characters such as plus sign (+), minus sign (-), ampersand (&), vertical bar (|), exclamation mark (!), carets (^), tilde (~), colon (:), and other special characters are treated as delimiters and are ignored in the search.

If you enter risk-band, the search behaves the same as if you entered risk band.

Refining search results

If you do a keyword search from the toolbar or from the Advanced Search page, it returns results from the entire cluster, including files and fields that directly match the search criteria. You can use the keyword search and facets in the left pane of the search results to further refine these results.

You may notice that global search results return matched files and all of the fields in those files. When you refine the results, only the fields that directly match the refinement remain in the results. Here's an example of how this works:

You enter restaurant in the toolbar search. The search results show:

  • The files that have restaurant in their name or a file-level tag or tag description.
  • All the fields in the matched files.
  • The fields that match restaurant in their name, a field-level tag or tag description, or the sample data in that field.

If you refine the search results by entering cuisine in the keyword search in the left pane, the middle pane changes to show:

  • Only the files that have both restaurant and cuisine in the file name or file-level tag name or descriptions.
  • Only the fields that have both restaurant and cuisine in the field name, field-level tag or tag description, or sample data.

Unlike the original global search, no fields show simply because they were associated with a matched file. For example, if the global search results on restaurant matched a file inspections.csv with detailed address information, including a field with the tag US State, then all of the fields in the file inspections.csv appear in the global search results. When the results are refined by the keyword cuisine, the files and fields that now show directly match the keyword cuisine. The fields in the inspections.csv file that do not match cuisine directly are not shown.

Search details

Lumada Data Catalog search results are organized into self-contained panels to maximize resource insight at a single glance. Each result panel contains key information organized for easy viewing, such as path details, description, and type.Example search result

  • Sensitivity

    Icon indicates sensitivity of the returned data.

  • File metadata

    Contains the following file metadata parameters:

    • File Size
    • Fields
    • Records
    • Origin(s)
    • Last Modified timestamp
  • State

    Shows resources that are available for browsing, resources that are marked for deletion in Solr, and resources that are no longer available for processing.

  • Description

    Plain text describing the resource, if available.

  • Resource Tags

    Number of overflow tags that you can view by clicking the numeric link.

  • Resource type and path

    List of file type and the path to the file location.

Search result details also indicate resource popularity metrics like average overall rating with total ratings and total posts.

Basic search

A basic search performs a global keyword search that lists the total number of results for the search term or terms entered, and groups those results in tabs for Resources facets and Fields facets.Example of search results facets

  • Resources

    List of resources that match the search term (including resource name, path, fields, tags or tag associations)

  • Fields

    List of fields that match the search term (including resource path, fields, tags or tag associations)

Lumada Data Catalog provides built-in facets for Resources and for Fields that can further filter the search results.

View search results in the Resource tab

Use the Resource tab in the search results to see the list of built-in resource facets on the search results:Example of Resource tab in the search results

Procedure

  1. Click the Open facet settings (gear icon) in the upper left corner.

    The built-in facets appear as Available Facets in the Facets Settings dialog box. By default, the search results show all the built-in facets.
  2. To limit the search results to a chosen set of facets, select the check boxes next to the facets and use the right arrow button to move the selected facets from the list of Available Facets to the list of Visible Facets.

  3. (Optional) Select a facet and use the up or down arrow buttons to change the order in which the Visible Facets appear on the search results page.

  4. Click OK to show the facets in the search results.

    NoteOnly the facets that have values display on the facets pane in the search results. Empty facets, even if selected in the Visible Facets, do not display.

View search results in the Fields tab

Use the Fields tab in the search results to see the list of built-in facets on the search results:Example of Fields tab in the search results

Procedure

  1. Click the Open facet settings (gear icon) in the upper-left corner.

    The built-in facets appear as Available Facets in the Facets Settings dialog box. By default, the search results show all the built-in facets.
  2. To limit the search results to a chosen set of facets, select the check boxes next to the facets and use the right arrow button to move selected facets from the list of Available Facets to the list of Visible Facets.

  3. (Optional) Select a facet and use the up or down arrow buttons to change the order in which the Visible Facets appear on the search results page.

  4. Click OK to show the facets in the search results.

    NoteOnly the facets that have values display on the facets pane in search results. Empty facets, even if selected in the Visible Facets, do not display.

Advanced search

AttentionThe Including/Excluding tag(s) fields are available in version 6.1 and later.

Like basic search, you can use keywords in an advanced search of Data Catalog. However, instead of just filtering out the search results as in basic search, you can apply filters before searching to limit the search itself. Search results are bound by your user access control permissions.

To perform an advanced search, click in the Search data catalog box and then click Go to Advanced Search.

Example of advanced search

Enter a keyword or keywords, then define the filters that you want to apply for your search. For example, in the Resources tab you can limit your search to the virtual folder BankRetail, or in the Fields tab you can limit your search to the string data type. After selecting the desired Entity type, click Apply filters and search.

NoteYou cannot use both Resources and Fields within the same search query. Use a separate query for each Entity type to display the applicable results.

Search using facets

Lumada Data Catalog crawls the data cluster to discover information from files, Hive tables, and fields inside of files and tables. It groups that information in facets to make it easy for you to search for files or tables with specific characteristics. The facets are categorized as follows:

  • File format

    Search for a specific file format.

  • Resource type

    Search through a specific resource type.

  • Data source

    Search in a specific data source.

  • Virtual folder

    Search in a specific virtual folder.

  • Processing status

    Search within a specific resource status (like search only profiled resources).

Selecting more than one value inside the same facet includes files that match either value (OR). Selecting more than one value in multiple facets includes files that match both values (AND). If keywords are also specified, the search results match both keywords and facet choices.

NoteData Catalog builds its search results from information collected during a batch process on the cluster called profiling. If files are not already profiled, you will not see results from those files when you search.

By default, Data Catalog provides the following facets:

Resource Level Facets
FacetDescriptionNotes
File formatData format of the file content.Data Catalog profiles sequence files; however, the file content type is marked as the format in which each record is formatted (Avro, JSON, delimited text, or XML).
Resource typeData source type (HDFS/Hive).If files have not been profiled, their source is identified as UNKNOWN.
Resource sizeSize of the resource.Size facet ranges are inclusive of the start value and exclusive of the end value. For example, the range 1 MB - 1 GB includes 1 MB files up to 999 MB files.
Resource originAll files marked with the selected origin and any files with lineage relationships that lead to a file marked with the selected origin.Results include files with confirmed (accepted) lineage relationships and relationships suggested by Data Catalog.
Data sourceThe parent data source where the resource is located.HDFS, Hive, MySQL, etc.
Virtual folderThe virtual folder where the resource belongs.The virtual folder is assigned to a user by the administrator and can map to any data source.
Resource tagThe resource tags associated with the resource.
Resource tag association stateLists the resource tag association state of the resources matching the search term.
Field tagThe field tags associated with the resource.
Field tag association stateLists the field tag association state of the resources matching the search term, along with the number Accepted, Rejected, and Suggested.
Processing statusOutcome of profiling.Folders appear as processed or unprocessed. Files and tables appear as profiled if most or all of the data profiled successfully. If profiling was attempted but not successful, files and tables are marked as 'profile failed'. Files or tables with data formats that Data Catalog does not support are marked as recognized or unrecognized based on the format.
SensitivityComputed metadata attribute that identifies the sensitivity of the resource.Sensitivity is based on the highest sensitivity level of any tag (field or resource) associated or suggested on the resource.
Resource stateState of the resource, such as 'Available'.
Field-Level Facets
FacetDescriptionNotes
Data typeData type for field value as formatted in the file.Many file format types specify only String data types. This search does not use Data Catalog discovered type results. For example, for a JSON formatted file, you may see strings, integers, decimals, or Boolean values here but not dates, depending on the type information that is present in the JSON file.
Field tagThe field tags associated with the resource.
Field tag association stateLists the field tag association state of the resources matching the search term, along with the number Accepted, Rejected and Suggested.
CardinalityThe number of unique values in a column.Affected by whether or not a file was fully profiled or sampled.
SelectivityWhether the resource is repetitive or unique.
DensityThe number of non-null values in a column.Affected by whether or not a file was fully profiled or sampled.
Data sourceThe parent data source where the resource is located.HDFS, Hive, MySQL, etc.
Virtual folderThe virtual folder where the resource belongs.The virtual folder is assigned to a user by the administrator and can map to any data source.
SensitivityComputed metadata attribute that identifies the sensitivity of the resource.Sensitivity is based on the highest sensitivity level of any tag (field or resource) associated or suggested on the resource.

In addition to using keywords and facets, you can also apply tag-based filters to include or exclude tags and tag children to perform conjunctive and disjunctive searches in an advanced search.

AttentionThe Including/Excluding tag(s) fields are available in version 6.1 and later.
  • Including tag(s)

    Enter tag names you want to include in your search. Only selected tag resources and fields are fetched; and, when Include child tags is checked, the children are also returned.

    NoteBusiness entities are blocked from the Include child tags feature. If a business entity tag is selected, search results will not include its children.
  • Excluding tag(s)

    Enter tag names you want to exclude in your search. All tag resources and fields are fetched unless excluded; and, when Exclude child tags is checked, all children are returned unless excluded.

For example, when you search for the keyword "Personnel Info" and include the tag US_State, the search results are limited to resources matching the keyword and having the tag (suggested or accepted) US_State. By including or excluding child tags, individual states tagged with US_State can also be filtered.

As with a global search, the available facets visible to the user can further filter out the advanced search results.

Search using Advanced Search

The options that you use when performing an Advanced Search determine the scope of your returned results.
AttentionThe "Including tags" and "Excluding tags" features are available in version 6.1 and later.

Perform the following steps to search using Advanced Search.

Procedure

  1. Click in the Search data catalog field, and then click Go to Advanced Search.

    Advanced Searcy Form page The Advanced Search Form page opens.
  2. Enter your search term or terms in the Keywords field.

  3. Select the Entity type that you want to search:

    • Resources: to search resources.
    • Fields: to search fields.
  4. (Optional) Enter a tag or tags in the Including Tag(s) field from the dropdown menu, and select the Include child tags check box if you want to include child tags in your search.

    Selected included tags appear on the Advanced Search Form page.
  5. (Optional) Enter a tag or tags in the Excluding Tag(s) field from the dropdown menu, and select the Exclude child tags check box if you want to exclude child tags from your search.

    NoteIf the Including Tag(s) and Excluding Tag(s) fields contradict each other, then Excluding Tag(s) takes precedence.
    Selected excluded tags appear on the Advanced Search Form page.
  6. (Optional) Depending on your selected Entity type, apply facets:

    ResourcesYou can search any or all of these resource facets:
    • Data source
    • Virtual folder
    • Resource Type
    • File format
    • Processing status
    FieldsYou can search any or all of these field facets:
    • Data source
    • Virtual folder
    • Data Type
    • Field tag association state
  7. Click Apply filters and search.

Results

A list of resources or fields matching your search criteria are returned. Depending on your permissions some data may be unavailable for viewing.

NoteYou cannot switch between the Resources tab and the Fields tab on the search results list page. You must perform a separate query to display the results.

Using a custom search query

AttentionThis feature is available in version 6.1 and later.

You can write expressions to perform queries on searchable property resources or fields using Lumada Data Catalog (LDC) search language. The syntax you use must specify the property or properties to be compared and the operator type wanted for the query. After you have entered the filter string, and following internal validation of the code, the search is executed. Displayed search results depend upon your access control permissions.

To perform a custom search, click in the Search data catalog field then click Go to Advanced Search and select the Custom Search Query tab.

Custom Query page

LDC search language

The following table lists the custom query operators in the LDC search language.

OperatorDescription
eqEqual to
neNot equal to
coContains
swStarts with
ewEnds with
gtGreater than (supports date-in-string formats)
geGreater than or equal to (supports date-in-string formats)
ltLess than (supports date-in-string formats)
leLess than or equal to (supports date-in-string formats)
ORLogical OR conjunction between two filters, matches if either contains the criteria
ANDLogical AND conjunction between two filters, matches if both contain the criteria
notNegation of the eq, ne, co, gt, ge, lt, le, OR, and AND operators
is nullEmpty
is not nullNot empty

You should observe these rules when using Custom Search Query:

  • Only searchable properties can be queried. Searches can have a mix of properties. Searches on strings are case sensitive and matches happens on the exact value of the field. Searches on fields can be text_general, and text_with_special_chars, which are case in-sensitive, and matches happen on the terms generated by Solr for the given field value.
  • Ranges can be given for numbers and time variables with a combination of greater than and less than operators.
  • Multiple statements can be given with a combination of AND and OR operators.
  • Statements can be segregated using parenthesis ().
  • Provide file sizes in bytes.
  • The following date-in-string formats are supported:
    • dd/MM/yyyy hh:mm:ss
    • dd-MM-yyyy hh:mm:ss
    • dd-MM-YYYY
    • dd/MM/YYYY
Custom query examples

Syntax examples of custom queries and their meanings are provided below.

Syntax exampleMeaning
name eq "data.json"Find a name that equals data.json
name co "json"Find a name that contains json.
name sw "s"Find a name that starts with s.
time_of_creation gt "12-03-2020"Find a time_of_creation that is greater than the date 12-03-2020.
time_of_creation ge "12-03-2020"Find a time_of_creation that is greater than or equal to the date 12-03-2020.
file_size lt 1700000Find a file_size that is less than 1.7 MB.
file_size le 1700000Find a file_size that is less than or equal to 1.7 MB.
file_size gt 1700000 and file_size lt 2000000Fetch records with a file_size range of 1.7 MB to 2.0 MB.
file_type eq "CSV" and (name co "example" or name co "myfile")Fetch data with file type equal to csv and with the name containing example or myfile.
NoteAll returned file sizes are given in bytes.

Search using a custom query

Use the LDC search language when creating your queries.

Perform the following steps to search for tags using a custom query:

Procedure

  1. Click in the Search data catalog field, and then click Go to Advanced Search.

    The Advanced Search Form page opens.
  2. Select the Custom Search Query tab.

    Custom query example The Custom Query page opens.
  3. Choose the Entity type that you want to search:

    • Resources: to search resources.
    • Fields: to search fields.
  4. Enter your query syntax in Custom Query text box.

    If necessary, click Reset to clear any incorrect entry and reset the query text box.
  5. Click Run Search.

    The validity of the filter string is checked and then the search is performed.

Results

A list of resources or fields matching your search criteria are returned. Depending on your permissions some data may be unavailable for viewing.

NoteYou cannot switch between the Resources tab and the Fields tab on the search results list page. You must perform a separate query to display the results.

Filtering search results by resource and by field

By default, resources that match the search criteria themselves or contain fields that match the search criteria appear in search results. By clicking the Fields tab, you can change this view to the list of fields that match the search criteria.

NoteSwitching between the Resources and Fields tabs on an Advanced Search result is not allowed. Because Advanced Search results are fetched as a pre-filtered search, either for resources or for fields, you do not have the option to switch between the Resources and Fields tabs for these results.

Example of resource field facets

Long facets

If there are more than five facets in any category, you can click View More to display all the facets in a separate dialog box. These are referred to as long facets. Example of long facets

If you select multiple facets in the same category, the resulting filtered search list is an OR filter of the selected facets. If you select multiple facets from different categories, the result is AND filtering.

For example, if you select facets US_State or US_City from the Field Tags category, the resulting list is the OR filtering displaying resources with field tags US_State or US_City. If you also select the facet Accepted from the Field Tag Association State category, the resulting list displays only resources that have field tags US_State or US_City in the Accepted Field Tag Association State.

Sorting search results

Search results can also be sorted by Relevance, Name, and Rating. For single field searches only, you can sort by Confidence.Options for sorting search results

Export your findings to a CSV file

You can export your findings as a report to a CSV file to analyze offline or share with others. The process of exporting includes two parts: you first generate a report of your findings, then download the report to a CSV file. You can generate reports from virtual folders, single resource view details, search results, or the Data Rationalization dashboard.
AttentionThis feature is available in version 6.1 and later.

Perform the following steps to export your findings to a CSV file.

Procedure

  1. Click Export as CSV or Export Table as CSV to start generating the export data.

    The Export CSV Settings dialog box displays.
  2. Select which properties to export for each resource. Click Select All to include all the properties listed.

  3. Click Export to generate the data.

    After the CSV data values are successfully generated, a confirmation message appears in the header with an exports link.
  4. If you are ready to download the generated information at this point, click exports in the header message.

    The Exports page opens. This page provides a summary of your exported reports, including the report name, the report type (from where is was generated), the generation interval, and the report size. Any report listed here is automatically deleted within seven days from the time the report is generated.
    NoteIf you want to wait until later to download the generated CSV data, you can access the Exports page through the Exports option in your User Profile menu.
  5. From the reports table, click More actions, and then select Download report.

    Exports pageThe generated CSV file is downloaded to the location specified for the Path to exports configuration property during your installation of Data Catalog. See Managing configurations if you need to reconfigure the Path to exports property to a different location.
  6. (Optional) To delete a report, click More actions, and then select Delete report.

Results

After the report is downloaded, you can access the CSV file offline from the Path to exports location and share it with others.

Customized resource facets tutorial

Custom resource properties along with search dimensions form custom facets. The following tutorial is intended for users who want to use custom facets to search Lumada Data Catalog data. The search dimensions set for a user show up as custom facets in the search results pane.

NoteCustom facets only apply to custom roles. Because search dimensions and custom facets cannot be defined for the default Data Catalog roles, custom facets do not show up in the search results facets pane for default roles.

For example, your administrator has granted you the custom analyst role NorCal_Analyst for processing claims in the northern California region. Your administrator has set up the following conditions:

  • A Claims custom property group.
  • Custom properties called Claim_Status, Claims_Region, and Claim_Code in the Claims custom property group.
  • The custom properties are limited to pre-filter the values Open and Pending for Claim Status and NorCal for Claims Region.

The following image shows where the pre-filters and custom facets appear on the page. Example of custom facets in search results

Procedure

  1. If you search for Customer, the search results list the resources that match the keyword and are limited to the pre-filter values for the search dimensions in our example. They are also filtered by resources that have the custom property value NorCal for Claims Region and Open for Claims Status.

  2. If you choose not to apply the pre-filters on their search results, perform the following steps:

    Example of turning off pre-filter facets
    1. Click the Open facet settings (gear icon) in the upper-left corner of the Resource tab to open the Facet Settings dialog box.

    2. Select No for Apply pre-filter values? in the Facet Settings dialog box.

    3. Click OK.

    The search results refresh to list resources that previously did not show in the pre-filtered search dimensions results. You can also reset the pre-filter by selecting Yes for the Apply pre-filter values? option.