Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Tagging resources and fields

Parent article

If your user profile is granted the Associate Business Terms permission, you can use Data Catalog to identify data by associating a business term with a specific folder, file, table, or field. You can associate any number of terms with an item. After a term is used to mark an item, you can use the term name to search for the item. You can also select terms to help you find items associated with those terms.

Contact your Data Catalog administrator for access to the following permissions:

  • To view business terms, your user profile requires the View Business Terms permission.
  • To create new terms, your user profile requires the Manage Business Terms permission.

Term association confidence cutoff for fields

When Data Catalog suggests a term association for a field, it assigns the association a score or weight during term propagation. The weight is calculated as a confidence cutoff percentage, with higher scores indicating a closer match to the criteria used to propagate the term. A confidence cutoff of 100% is a strong match to the association criteria. The confidence cutoff calculation depends on following dimensions:

  • Overlapping values
  • Overlapping tokens (individual words)
  • Overlapping patterns
  • Matching field names
  • Other matching terms
  • Matching numeric range
  • Matches on numeric properties
  • Quantile, standard deviation, cardinality, boundaries, and mean

Each dimension contributes differing amounts to the overall weight. The overall weight is calculated to emphasize high-quality matches against field data and to reduce low-quality matches. Terms propagate on some data better than for other data. For example, tagging free text fields such as social media messages do not propagate well. Tagging product descriptions or other standardized text or codes propagate more smoothly.

Managing terms

The following sections help you to manage tagging by identifying the correct business term to use for a data type. The content also shows you how to accept or reject suggested term associations and how to tune term properties for efficient term propagation and association.

View existing terms and term associations

The Business Glossary shows all glossaries and terms for which you have access based on your role.

To see the associations that contribute to discovery or seed terms and rejected term associations, select the term and look at the Key Metrics card on the Summary tab.

The counts under Associated Data Elements for a term indicate its number of suggested, accepted, rejected, and seeded term associations.

Search for terms using Advanced Search

You can perform a search for terms using Advanced Search. This search can be helpful when you are tuning term properties for efficient term propagation.

Perform the following steps to search for terms using Advanced Search.

Procedure

  1. Click in the search box and then click Advanced Search.

    If the search box is not visible, click Search in the left menu bar.The Advanced Search page opens.
  2. Select the Entity type that you want to search.

    • Resources to search resources.
    • Fields to search fields.
  3. (Optional) Click in the Including Term(s) field and enter a term or select one from the drop-down list that displays. You can enter multiple terms. Select the Include child terms check box if you want to include child terms in your search.

    The selected terms appear in the Including Terms(s) field.
  4. (Optional) Click in the Excluding Term(s) field and enter a term or select one from the drop-down list that displays. You can enter multiple terms. Select the Exclude child terms check box if you want to exclude child terms from your search.

    NoteIf the Including Term(s) and Excluding Term(s) fields contradict each other, then Excluding Term(s) takes precedence.
    Selected excluded terms appear on the Advanced Search Form page.
  5. (Optional) Apply facets as needed, such as specifying a resource type or data type.

  6. Click Apply filters and search.

    A list of resource terms or field terms matching your search criteria is returned. The results for a single term searches are displayed by decreasing relevance level by default, but you can filter them in the following ways: decreasing or increasing relevance, ascending or descending name order, and decreasing or increasing average rating.

Search for terms using Business Glossary

Perform the following steps to search for terms in the catalog using the Business Glossary.

Procedure

  1. Click Glossary in the left navigation menu.

    The Business Glossary page opens.
  2. On the Business Glossary page, enter your search in the Find box and press Enter.

Results

The search returns resources with the selected term and any child terms.

Search for terms using global search

Perform the following steps to search for terms in the Manage Glossary page:

Procedure

  1. On the Home page, enter a term to search for in the Search data catalog box and press Enter or click Search.

    The search results appear.
  2. Click the name of the term to select it.

Results

The term's glossary appears with the selected term highlighted. From here, you can manage the terms.

Searching in nested terms

While using the Business Glossary page, you can refine your search within nested terms. If a glossary contains a nested term, a Search icon (magnifying glass) appears to the right of the nested term displayed in the Business Glossary page.

To confine your search to that nested term, click the Search icon and enter a search term or phrase into the Search Glossary box.

Tagging a resource

You can associate an existing term with a folder, file, or table. If your user profile is configured with the Manage Business Terms permission, you can create a new term to associate with a resource. You need to have permission to access the glossary in which you want to add the term.

NoteUnlike field terms, resource terms do not participate in term propagation.

You can tag folders, member resources, files, views, or tables from the browser or from the search results view as described in the following sections:

Tag a member resource

Perform the following steps to tag a member resource.

Procedure

  1. From the browse or search results view, locate then select the member resource that you want to tag.

  2. Click the More actions icon and select Add term from the drop-down menu.

    Optionally, from the field view level for a collection, you can select +Add Term Assocation to open the Add a Term dialog box.
  3. In the Add a Term dialog box, select the action used to add the term.

    Add a TermAdds an existing term to the member resource.
    Create a new termAdds a new term to the member resource.
  4. Enter the term name in the Term name field.

    If you choose to create a new term, enter the term name in the New term name field, and optionally a term description in the Term description field.

    NoteTerm names can be up to 256 characters long and can contain any character, except the dot (.) as it is used to denote term hierarchy. Term descriptions can have up to 512 characters.
  5. Click Add.

    The member resource is tagged.

Tag a file or table

Perform the following steps to tag a file or table.

Procedure

  1. From the browse or search results view, locate then select the file or table that you want to tag.

  2. On the report.csv tab in the field-level view, click the More actions icon and select Add term from the drop-down menu.

    The Add a Term dialog box opens.
  3. In the Add a Term dialog box, select the action used to add the term.

    Add a TermAdds an existing term to the folder.
    Create a new termAdds a new term to the folder.
  4. Enter the term name in the Term name field.

    If you choose to create a new term, enter the term name in the New term name field, and optionally a term description in the Term description field.

    NoteTerm names can be up to 256 characters long and can contain any character, except the dot (.) as it is used to denote term hierarchy. Term descriptions can have up to 512 characters.
  5. Click Add.

    The file or table is tagged.

Tag a field

Perform the following steps to tag a field in the Resource Detail view.

Procedure

  1. From the browse or search results view, locate then select the folder that you want to tag.

  2. In the Field Properties pane, click +Add Term Association for the selected resource.

    The Add a Term dialog box opens.
  3. In the Add a Term dialog box, select the action used to add the term.

    Add a TermAdds an existing term to the folder.
    Create a new termAdds a new term to the folder.
  4. Enter the term name in the Term name field.

    If you choose to create a new term, enter the term name in the New term name field, and optionally a term description in the Term description field.

    NoteTerm names can be up to 256 characters long and can contain any character, except the dot (.) as it is used to denote term hierarchy. Term descriptions can have up to 512 characters.
  5. Click Add.

    The field is tagged.

Tag unstructured data

You can use a business term to tag your unstructured data using a regular expression.

NoteFor tagging unstructured data, you should run a Data Profiling job on the data instead of the Business Term Discovery job.

Perform the following steps to tag a term in unstructured data using a regular expression:

Procedure

  1. Click Glossary in the left navigation menu.

    The Business Glossary page opens.
  2. Select the term you want to tag in the glossary list or click Add New to add a new tag.

  3. Click the Settings tab.

  4. Click the Identify Term By field and select Regular Expression.

    The Regular Expression field displays.
  5. Enter the regular expression you want to use in your search in the Regular Expression field.

    Do not use a ^ for “starts with” or $ for “ends with” in the regular expression or Data Catalog will fail to find most of the mentions of the term.
    NoteThe regular expression you enter in the Regular Expression field must adhere to the regular expression logic enforced by the Java regular expression engine.
  6. In the Test Data field, enter data that matches the regular expression and click Test.

  7. Scroll down to Discovery Setting and Asset Type and select the Unstructured check box.

    NoteIf you do not select the Unstructured check box, Data Catalog will not run the job for the term.

    You can also select the Structured check box if you would like to find the term in structured data.Unstructured check box

  8. (Optional) Under Scan Documents, you can select Stop after [ ] matches and specify a number to stop the scan after a specified number of regular expression-based matches.

  9. Click Apply Changes.

  10. Open the Data Canvas and run the Data Profiling Combo job on your data, or the Data Profiling job if the Format Discovery and Schema Discovery jobs already ran on the data.

Results

If after using these steps you find that unstructured data is not being tagged, verify the following actions:
  • The Unstructured check box is selected for the data.
  • A Data Profiling or Data Profiling Combo job ran for the data specified.
  • The regular expression adheres to Java regular expression logic.

Tagging collections

Collections are a set of files with a similar schema and format. When files are grouped as a collection, you can manage the terms for that set of files from the collection as the single representation of all the data in all the files.

For terms assigned to individual files before they become part of the collection, do the following:

  • Add the accepted term associations found in files to the collection as suggested terms (unless they are already part of the collection as accepted terms).
  • Treat as rejected from the collection any term associations that were rejected.

When a file is part of a collection, Lumada Data Catalog no longer suggests term associations for the individual collection members. However, any existing accepted terms in the collection members continue to be considered in term propagation for that term. You should manage terms for all files from the collection rather than make new term associations in the individual files.

NoteWhile collections can be tagged at resource level and at field level, the member resources cannot be tagged once they have been identified as collection members, and any existing term associations are aggregated to the collection root.

Tag a collection

Perform the following steps to tag a collection.

Procedure

  1. Navigate to the top level folder of the collection you want to tag and click the arrow to expand the folder.

    All the fields will be shown as an ordered list.
  2. Select any field, and on the Summary tab, click the Add Term link.

  3. In the Add a Term dialog box, select Create a new term.

  4. Fill in the fields, including New term name, then click Add.

    The collection is now tagged with the name you entered.

Accepting or rejecting suggested term associations

You can accept or reject suggested term associations in Lumada Data Catalog.

  • Double-click a suggested term association in a resource field or in field-level search results to accept it.
  • Single click to open the Association window where you can select the More actions icon to display the drop-down menu to accept or reject the term association.

Accept a term association

Perform the following steps to accept a term association.

Procedure

  1. Navigate to a resource field or perform a field-level search for a term.

  2. Click the Glossary tab.

  3. Click the More Actions menu at the end of the row for the business term, and select Accept Association from the drop-down menu.

Results

The term association is accepted and the Status column updates to display ACCEPTED.

Reject a term association

Perform the following steps to reject a term association.

Procedure

  1. Navigate to a resource field or perform a field-level search for a term.

  2. Click the Glossary tab.

  3. Click the More Actions menu at the end of the row for the business term and select Reject Association from the drop-down menu.

Results

The term association is rejected and the Status column updates to display REJECTED.

Remove a term association

Perform the following steps to remove a rejected term association.
NoteYou can only remove a rejected term association. You cannot remove a term association that was accepted or suggested.

Procedure

  1. Navigate to a resource field or perform a field-level search for a term.

  2. Click the Glossary tab.

  3. Click the More Actions menu at the end of the row for the business term and select Remove Association from the drop-down menu.

Results

The term association is removed.

Change the data used as the seed for term discovery

When a term uses the Value method of automatic tagging, Data Catalog suggests term associations based primarily on how well field values match the values of the fields tagged and accepted, or the quality of the seed value of the term association. The first term associated with a field is automatically used for term discovery; if you want to include additional fields in the "seed" data and metadata for term discovery, you can mark that term association as a seeded term association.

Terms for some data propagate more precisely than other data. For example, if you tag a field that contains a product code made up of letters, numbers, and punctuation, such as “CSP-2201A”, Data Catalog can precisely identify other fields with similarly constructed data. However, if you tag a field that contains free text (such as the text field in a social media feed) or numeric values (such as rainfall depth values), Data Catalog may find false positives when attempting to match the data. Consider defining a business rule or regular expression for a term when you want to tag data with a specific text pattern.

Follow the steps below to add or remove a term association as a seed:

Procedure

  1. Click Glossary in the left navigation menu and use the left navigation tree to locate the term.

    On the Summary tab, icons in the Key Metrics card indicate the suggested, accepted, rejected and seed associations.
  2. On the Summary tab, click View All on the Business Terms card.

    Data that is tagged as seed data is marked as Enabled in the Seed column.
  3. Evaluate the seed term associations to make sure they form a representative set of data to use for tagging.

  4. Stop a seed term from participating in term discovery by clicking the Actions icon at the end of the business term row and selecting Do not use as seed.

  5. Add a term association as a seed by opening the term association and selecting Use field data in Term discovery.

Next steps

Rerun term discovery.

Change the confidence cutoff for a term

If a field matches with a score below the threshold value or confidence cutoff, then the term association does not appear in the catalog.

Perform the following steps to change the confidence cutoff for a term.

Procedure

  1. Click Glossary in the left navigation menu and select the term for which you want to change the confidence cutoff.

  2. In the Settings tab, scroll down to the Confidence Lower Limit and Confidence Upper Limit fields.

  3. Set the value to the score that Data Catalog will use as the minimum and maximum confidence or cutoff for term association suggestions.

  4. Click Apply Changes to save.

Next steps

If you want only changes in the data and terms to be updated, rerun the business term discovery job with the Incremental Profiling check box selected.

If you want to clear the existing suggestions and start from scratch, rerun the business term discovery job with the Incremental Profiling check box unselected.

Create a value term

Perform the following steps to create a value term. If it is the first term created in a glossary, by default it is created as a parent term.
NoteThe permissions for the glossary control which users will see the term.

Procedure

  1. Click Glossary in the left navigation menu.

    The Business Glossary page opens.
  2. Click Add New and select Term.

    The Create Business Term dialog box displays.
  3. Enter a name for the term.

    • If you did not already select a glossary, click the arrow in the Glossary field and select the glossary you want to contain the term.
    • Select the parent term if desired.
    If you do not specify a parent term in the term settings, the term is created as a parent term.
  4. Click Create.

Next steps

  • As a best practice, you should add a term description after the term is created.
  • Run the Business Term Discovery job, which can propagate and create term suggestions.

Create a regular expression term

By default, terms are created as Value terms. Use this procedure to change an existing term to a Regular expression term. Term associations will be suggested based on the number of values in the field that match the regular expression.

When creating your glossaries and terms, the built-in terms may provide good examples of what you can do using regular expressions and length limits. All terms configured with regular expression rules are listed in the Rules tab of the Data Canvas view of the term. You can review the regular expressions for built-in terms, and you can disable built-in terms from propagating. However, you cannot edit the regular expression for a built-in term.

Perform the following steps to create a regular expression term.

Procedure

  1. Click Glossary in the left navigation menu and select the term you want to modify.

  2. On the Details tab, select Regular Expression from the drop-down list in the Identify Term By field.

  3. Enter the regular expression in the Regular Expression field.

  4. Enter test data to validate that the regular expression matches the data as you expect.

  5. In the Min Length and Max Length fields, enter the minimum and maximum number of characters against which Data Catalog should apply this expression.

    These values help Data Catalog optimize processing so that it doesn't spend time on data that is not likely to match the regular expression.
  6. In the Min Confidence field, set the minimum threshold value for the term.

    This is a threshold value to indicate how many of the field's values need to match the pattern for the term to be associated with the field. For example, if you expect each value in a field to match, set the cutoff at 90% or higher. If you want Data Catalog to suggest a term if the field has any values that match the regular expression, set the Min Confidence percentage very low.
  7. Click Apply Changes.

Next steps

Run the Business Term Discovery job, which can propagate and create term suggestions.

Controlling automated discovery and learning

You can stop term association changes from impacting the algorithm for term discovery in Data Catalog. When learning is turned off, accepting and rejecting term associations no longer has an impact on how value terms are evaluated. Consider turning off learning when you are satisfied that terms are added to new data appropriately.

Turn automated discovery and learning off or on

Follow the steps below to control automated discovery and learning. By default, both settings are turned on when a term is created.

NoteYou must have permission to manage business terms to proceed.

Procedure

  1. Click Glossary in the left navigation menu and select the term you want to manage.

    The business term summary is displayed.
  2. Click the Details tab to view the settings for Automated Discovery and Keep Learning.

  3. Click a setting to change it.

    • Switch the Automated Discovery option to the right to allow Data Catalog to suggest term associations, or switch it to the left to turn it off.
    • Switch the Keep Learning option to the right to improve automated tagging with analysis, or switch it to the left to turn it off.
  4. Click Apply Changes to save the updated settings or click Cancel to discard your changes.

Delete a term

Use the following steps to delete a term:

Procedure

  1. Click Glossary in the left navigation menu and select the term that you want to delete.

  2. Click Actions under the left navigation tree and then select Remove.

    A confirmation box appears.
  3. In the Please Confirm field, type yes and click Confirm.

Built-in terms

In addition to terms that you can add, Lumada Data Catalog has a set of predefined terms. These terms fall into two categories:

  • Regular expressions

    Data, such as United States ZIP codes and phone numbers, are tagged by matching data with regular expressions for these values.

  • Reference data

    Field data, such as countries, the names of US states, and first and last names, are tagged by matching the signature of known data. This reference data is static. You cannot include data from seed fields to alter the term algorithm of the built-in terms.

Built-in terms are predefined and propagated when you run a term job after collecting discovery metadata for catalog resources. Built-in terms cannot be changed. If you do not want to use the built-in terms provided for tagging your data, you can turn off automatic term propagation for these terms. See Turn automated discovery and learning off or on for details.

Suggested terms are associated with fields that match the reference data and patterns such as the data in the following table.

Suggested terms
Countries: full nameSalutation
Countries: 3-letter abbreviationUS Address
Email addressUS City
First NameUS County
Global CityUS Phone Number
IP AddressUS Social Security Number, Numeric
Last NameUS Social Security Number, Delimited
Major Credit Card NumberUS State Abbreviation
Occupation US States
People NamesUS ZIP Code: NNNNN and NNNNN-NNNN

Regex use case: National Identifiers

An excellent use case for regular expression (regex) terms is discovering national identifiers. In this example there is a requirement for identifying and discovering national identifiers like passport or national ID for a country or set of countries. Data Catalog can help discover such fields with the help of regex terms.

Follow the steps below to discover fields using regex terms.

Procedure

  1. Create a glossary named National_Identifiers.

  2. Create regex terms for the national identifier and/or passport that needs to be discovered by defining the valid regex.

  3. Run term discovery on the data.

Results

Data Catalog's term discovery now identifies the national identifier fields and suggests the term associations.