Skip to main content

Pentaho+ documentation is moving!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Getting started with business terms and term propagation

Parent article

Business terms or terms are business tags that are attached to Lumada Data Catalog data units (virtual folders) and resources (tables, files, and fields).

You can use terms to identify sets of data so you can use the term names as search terms. Data Catalog can assist you in tagging the data in your catalog. When you tag a field in a data resource, Data Catalog automatically propagates that term through your data environment and identifies associations with similar fields in other files and tables that match the data you tagged. This process, called business term discovery, uses data and metadata about the fields you tag to identify matching fields. Note that terms on resources (files and tables) are not propagated, whereas terms on fields are propagated.

Terms can also be created with regular expression rules to match data without requiring an initial term association.

Business glossary

A single resource or field can be tagged by multiple terms associated with different business divisions. Data Catalog provides a way to organize and separate the terms grouped by business divisions in the form of business glossaries. Glossaries also enforce role-based access control to terms and glossaries. The administrator can limit the visibility of terms within a glossary to a specified set of users, depending on those users' Data Catalog roles.

Learn more about roles and terms in Managing business glossaries.

Business terms

In Data Catalog, you can put business terms on any kind of data asset or resource, or any field inside a resource. The level (resource-level or field-level) that you associate with a term determines how that term behaves. For more information on business terms, see Create a new business term.

Resource-level terms

Terms associated with a resource are often isolated tags for the resource. Use these terms as search terms to find files, tables, or directories that have been individually tagged. These terms are part of the metadata for the resource. They can be used to enrich the metadata for the resource, such as identifying or categorizing file contents, or to drive other types of processing, such as access control or processing stages.

For information on managing resource terms, see Tagging resources and fields.

Field-level terms

Terms associated with fields in a given resource are referred to as field-level terms. These terms can be propagated across the catalog.

The field-level term types include:

  • Built-in terms

    Data Catalog provides pre-built terms for common data patterns, such as credit card numbers and address components.

  • Custom terms or User-created terms

    You can create field-level terms to identify data patterns across your catalog. There are two ways to create your own field-level terms:

    • Value terms

      When you manually term a field, the data and metadata for the field contribute to a rule for Data Catalog to suggest similar data to mark with the same term.

    • Regular expression (Regex) terms

      You can specify or define a term with a regular expression that describes the data you want Data Catalog to mark with the term.

  • Reference terms

    You can use the data from built-in terms or other custom-created terms as seeds for new terms. Such built-in terms and custom terms become reference terms. For more information, see Reference terms.

  • Business entity

    A business entity is a group of terms with context and has its own specifications for use. For more information, see Business entities for details.

To learn more about managing field-level terms, see Tagging resources and fields.

Tour the Business Glossary page

The Business Glossary page displays the glossaries available to you in the left panel. When you select a glossary or term in this panel, you can view summary information, details, and associated rules for the glossary or term on the page. Select the corresponding button for the view you want.

  • Summary view

    In the Summary view for a glossary or term, you can view the following information. Note that some information is only visible for either glossary or term.

    FeatureDescription
    Term name or Glossary nameName of the term or glossary in your catalog. Select the pencil icon if you want to edit this name. For terms, you can select the flag icon to set the sensitivity of the term.
    Description Add or edit the description of the glossary or term. For example, you may want to describe the purpose of the specified term for your users.
    Statistics (for a glossary)This section provides the total number of terms, business entities, and data elements in the glossary.
    Key Metrics This panel provides important information at a glance, including the following:
    • Sensitivity (for a term)

      Indicates the sensitivity level of the term.

    • Associated Data Elements

      Includes the number of each type of data element association that can be made within the glossary. You can drill down to view the associated data elements by clicking their icons.

      • blue question mark icon: Suggested associations.
      • green check mark icon: Accepted associations
      • red X icon: Rejected associations
      • scope icon: Seeded associations
    StatusThis section indicates the current status of the selected glossary or term.
    • Last Update

      A timestamp indicating when the glossary or term was last updated.

    • building iconBusiness Entity (for a parent term)

      Allow users to create a business entity using this term. Select the check box to turn on this permission. Clear the check box to turn off this permission.

    • anchor iconAnchor (for a child term)

      Allow users to set the term as an anchor term. Select the check box to turn on this permission. Clear the check box to turn off this permission.

    Style (for a glossary)Displays the icon and color associated with the glossary, if any. Select Change to edit the associated icon or color.
    Glossary Properties (for a term)Displays the associated glossary and its properties.
    • Glossary

      The name of the glossary.

    • Icon & Color

      The icon and color used to indicate terms in the glossary.

    Properties Displays the properties of the glossary or term.
    • Created By

      The username of the user who created the item.

    • Reference Term (for term)

      An option to create a reference term.

    Workflow
    • Author

      The user who initiated the workflow on the current asset.

    • Summary

      Summary of the workflow.

    • Assignee

      The user to whom the workflow is currently assigned.

    • Status

      The status of the Workflow. (Draft, In Review, To Approval, Approved)

    • View Workflow Details

      Click to view the workflow details.

    • View Changes

      Click to view the changes.

  • Settings tab

    On the Settings tab, you can see more information about the glossary or term.

    For a glossary, you see a table containing all the terms in the glossary. From here, you can select the columns of information to display about the terms, show filters for viewing term information, and add or delete terms.

    For a term, you can

    • set whether Automated Discovery, Keep Learning, Free Text (only for structured data), and Anonymous Data (only for unstructured data) to on or off,
    • choose whether to identify a term by a Value or a Regular Expression and see data about the fields associated with the term.
    • filter the Seed Field and Rejected Field items and select the columns of information to display the details. Additionally, you can perform operations like Do not use as Seed for Seed Field items (only for Identify Term by Value) and Use in Learning or Do not Use in Learning for Rejected Field items.
    • select the Asset Type that is Structured, Unstructured, or both (only for Identify Term by Regular Expression).
    • set a range of confidence percentages under the Discovery Setting. Suggestions with a lower confidence percentage than the Confidence Lower Limit value will be ignored by automatic discovery. Suggestions with a higher confidence percentage than the Confidence Upper Limit value will be automatically accepted.
  • Workflow tab

    On the Workflow tab, you can see Active Workflows and Workflow history.

  • Associations tab

    On the Associations tab, you can view associations of business terms (only accpeted or suggested) with data assets and perform operations like Use as Seed or Remove as Seed, and Do not Use In Learning.

    You can also add, view, and delete associations of business terms with another business term. For more information, refer Managing associations.

Built-in terms

In addition to terms you add, Data Catalog has a set of predefined terms that it propagates throughout the cluster. These terms fall into two categories:

  • Regular expressions

    Data, such as United States phone numbers and zip codes, are tagged by matching data with regular expressions that describe data typical for these values.

  • Reference data

    Field data, such as countries, states of the United States, and first and last names, are tagged by matching the signature of known data. The reference data is static; you cannot include data from seed fields to alter the term algorithm of the built-in terms.

The built-in terms cannot be changed. If you don't want to use the provided built-in terms for tagging your data, you can turn off automatic term propagation for these terms.

Built-in terms are propagated when you run a term job after collecting discovery metadata for catalog resources. See Managing jobs to manage value terms.

Reference terms

Reference terms are terms that act as reference for referring terms that use the seed data or regular expression definition of the reference term for seeding their term discovery. Reference terms are indicated in Data Catalog by an arrow. Only terms with seeds or regular expression definitions are considered reference terms.

There are a few points to remember when managing reference terms. They can be from the same glossary or can be from different glossaries, such as built-in term glossaries. Additionally, they can have a many-to-one relationship where one reference term is used by many referring terms. Lastly, as a best practice, you should avoid cyclic term references because Data Catalog does not check for them.

Reference terms and business entities

Reference terms are used in Business entities, but are not limited to just business entities. Non-business entity member terms can assign other qualifying terms as reference. Suggestions for term associations vary:

  • If reference terms are defined for business entity members and a successful term discovery is performed, then only the associations for the business entity member (referred) term are suggested for fields. The original reference term suggestions are removed for such fields.
  • Reference terms that are part of a business entity have an increased confidence percentage if Data Catalog has discovered the context of the business entity.
  • When reference terms are defined for terms that do not belong to a business entity, both term associations (one for the referring term and one for the original reference term) are shown.

Value terms and seed data

When you tag a field in a file or table, Data Catalog adds the same term to other fields in files and tables that match the data you tagged. It uses the first term association created to seed the term discovery process. It uses data and metadata about the field you tag to identify matching fields, then suggests the same term for those fields.

The suggested term appears as a dot-outlined button which you can click to accept or reject.

The score, shown as a percentage, that appears with the suggested term indicates how closely this field matches the data and metadata of the original field.

Data Catalog compares the seed field's metadata and sample data with the metadata and sample data for all other fields included in the catalog. When it finds other fields with similar metadata or data, it calculates a score to represent how closely a new field matches the tagged field. For example, a field with the same name and data type might score 90%, while a field with the same data type and overlapping values, but a different field name might score only 75%. A field that is a complete copy would score 100%. Data Catalog shows only suggested terms for fields that match over a calculated threshold or a confidence cutoff.

NoteIn Search, suggested terms are treated the same as accepted terms. When you search on a term, you see all the fields that contain accepted or suggested terms. However, only accepted terms are exported to external applications such as Apache Atlas.

Value term scoring components

Term association scoring includes weighted contributions of major and minor scoring dimensions. For a suggested field to appear, the fields must match on one or more of the following major criteria:

  • Overlapping values.
  • Overlapping tokens from the values, which are individual words in the field values that may match, even if the complete strings do not match.
  • Overlapping patterns, such as most of the sample values consisting of two words as in a person's name, or the sample data includes text formatted as dates.
  • Matching field names.
  • Standard deviation.
  • Regular expression counts.
  • Other matching terms.
  • Major scoring dimension: Must have at least one match for a field to be suggested.

As a minor scoring dimension, a suggested field may appear if the field match is less than 5% for each matching area. For example, they contain matches on numeric properties such as quantities, cardinality, boundaries, or mean.

There may be cases where you have anonymous values and term associations. In such cases, Data Catalog looks for overlapping data, overlapping tokens, and data patterns to match fields. However, to gain quality results, Data Catalog filters out common words and numbers that would match too often. For example, words such as "a", "an", and "the" are removed from the list of tokens. Also, individual letters ("a", "b", "c") and integers with 3 or fewer digits are removed.

If the field data in either the seed or the field to be matched includes many or mostly non-distinct values, Data Catalog is unlikely to make correct term associations.

Value tagging seed fields

Data Catalog uses the sample data values from the fields from specific term associations to add to the term discovery process. The data from those fields is used when Data Catalog calculates the term association scoring. When you use a term for the first time to mark a field, Data Catalog sets that field to be a seed: the data in this field contributes to term discovery. You can add additional fields to use in term discovery. You can also replace the original seed with other term associations if you find better seed fields.

Term associations marked for use as seeds in term discovery are indicated by the dot next to the term name, as shown in the following image.

Term association with seed

As a best practice, choose good seed terms. Make sure the data in the field is comprehensive and representative of the data you want tagged. If needed, replace a seed term if the field has any of the following conditions:

  • Null values.
  • Incorrect or non-representative values.
  • A subset of the data you would expect to be tagged.

Best practices for value tagging

Field data works well when field values are distinct terms and when sample data is a good representation of the full set of data. Strive for relatively low cardinality lists of repeated values such as product names or marital status values. Similar or overlapping data works well, too.

There are cases where discovery may not work well for terms:

  • Numeric codes that are not distinct patterns. For example, 9-digit account numbers may be confused for US Social Security numbers.
  • Numeric values such as temperatures, financial values, and counts.
  • Cases where the term name requires contextual information. For example, first and last names may work well for tagging, but term discovery cannot distinguish between customer names and vendor names.

Curating value terms

The first step for tagging your catalog is to manually add terms to fields in the data you handle frequently.

Here is a process guideline:

Procedure

  1. Browse to data you use.

  2. Add terms to the fields.

    Use names that match how you would expect users to identify the data. You might consider whether a particular field is a good representative of the data you want Data Catalog to tag.
  3. Run batch term processing.

    This batch job is run by an administrator, typically on a daily schedule. It applies new and updated terms to existing data and applies all terms to new data. During times of high tagging activity, you might want to run this job more frequently.
  4. Search on a specific term to review how it was propagated.

    You can use the Advanced Search, Glossary, or Catalog view to see fields tagged with a specific term.
  5. Review the term associations and use the profile information for the field to validate if the term is correct.

  6. Accept correct term associations and reject incorrect term associations.

    Accepting and rejecting terms changes the term discovery algorithm for the term. To best train Data Catalog to perform the optimal term propagation, accept or reject one or two term associations at a time. Exhaustively removing all incorrect term associations is not as effective for improving the term discovery algorithms.
    NoteTo be able to accept or reject term associations for most resources, users need to be assigned the Associate Business Terms permission.
  7. Rerun term processing.

Next steps

Repeat the curation process if needed.

CautionTerms for some data propagate more precisely than for other data.

If you tag a field that contains a product code made up of letters, numbers, and punctuation, such as “CSP-2201A,” Data Catalog can precisely identify other fields with similarly constructed data. However, if you tag a field that contains free text, such as a text field in a social media feed, or numeric values, such as rainfall depth values, Data Catalog may find false positives when attempting to match the data. See Term tuning scenarios for more information about troubleshooting terms.

As a best practice, consider defining a regular expression for a term when you want to tag data that has a specific text pattern.

Tuning your term associations

In Data Catalog, the method of linking the created seed terms with similar fields in other resources is called term association. After defining the seed term and running the term discovery process, Data Catalog identifies the fields that match the seed term as term associations. Term associations also include a matching score, which determines the percentage match to the seed data. If the value is 90%, then this term association is a 90% match to the seed term and indicates a high probability of data similarity between the seed field and the associated field.

You have several methods for tuning the tagging process.

Accept or reject term associations

When a term association is accepted or rejected, Data Catalog determines which component of the data or metadata matched (or didn't match) and adjusts how that component contributes to the overall score.

Each acceptance of a term incrementally increases the weight of the score component while each rejection of a term incrementally lowers the weight of the score component. When there are three more rejections for a term than there are acceptances, the affected component of the score is removed completely.

Setting fields that contribute to data discovery

The first manual term association is automatically used to seed term discovery. There may be times when you want to add or change which field is used as the seed.

For example, if a field includes a high volume of null values, the term discovery process includes the null as part of the expected data. If you don't want to include null values in the data that Data Catalog identifies, then replace the field containing null values with a field that has a cleaner version of the data.

Consider modifying the seed term association in the following cases:

  • Exclude an association when the field includes data that is dirty or unrepresentative. For example, if most data values are null, exclude such fields from seeding the term discovery.
  • Include more associations to include additional values. For example, you may have claim numbers from different zone offices that adhere to different patterns, but still need to be identified as one business term. This types of field requires term associations from both resources to be marked as seed data for term discovery.

Change the confidence cutoff of the term associations

Each term has a confidence cutoff. If a field matches with a score below the threshold value or confidence cutoff, then the term association does not appear in the catalog.

The confidence cutoff is set based on a set of scoring components applicable for seed fields. For example, if the seed data is classified as anonymous, Data Catalog sets the confidence cutoff high. For string values with mostly fixed lengths, such as alpha-numeric codes, the confidence cutoff is set low.

By default, Data Catalog determines the confidence cutoff. The default term association confidence cutoff is set to 40% for low, 60% for normal, and 80% for high. The default term association confidence cutoff can be configured using the following Discovery configuration properties on the local agent:

  • Min low default score for value tags
  • Min normal default score for value tags
  • Min high default score for value tags

Term tuning scenarios

Refer to the following scenarios to troubleshoot issues with term tuning.

False positive term associations

False positive term associations occur when a term is applied to fields that do not match the original field.

False positive term associations occur because Data Catalog is finding connections between the suggested fields and the seed field that are not what was intended when the original field was tagged. The most probable cause is that elements of the fields' data or metadata are matching for criteria that they are not intended to match. Data Catalog needs input from you to determine which data or metadata components correspond to the intended matches between the seed field and the correct field matches.

To resolve false positive term associations, use the following solutions in the order provided:

  1. Reject individual term associations.

    Reject one or two of the incorrect term associations and re-run the term propagation job. Rejecting term associations changes the term algorithm scoring. Data Catalog reviews the matching components identified for the rejected term association, determines which components scored the highest on this "bad" association, and lowers the weight in the term discovery algorithm for the high-scoring components.

    The next term job removes all the suggested terms for this term and reapplies them based on the updated algorithm. As a best practice, iterate through this process more than once: it takes three rejections of the same component before Data Catalog removes this component from the scoring entirely. See Accepting or rejecting suggested term associations for more information.

  2. Validate the sample data.

    Review the sample data for the field used as a seed field. The sample data may reveal that the term has null or incorrect values, which can distort the sample data so that it is no longer representative of the "good" data in the field. The distorted data may cause the wrong fields to be suggested.

    For example, if your seed data includes customer email addresses and many of the records are empty or have a default value of "not provided", then the sample data set for the field will show "null" or "not provided" as frequent values. Data Catalog tries to match fields that also include these values. If other optional fields such as "age" and "marital status" also use "not provided" when there is no value, Data Catalog may see matches against this other data.

    For this issue, identify another field to use as a seed for term association. For example, find one of the suggested terms that is correct and use it as a second seed or use it to replace the primary seed. The next term job will remove all the suggested terms for this term and reapply them based on the updated algorithm that uses the new sample data. See Setting fields that contribute to data discovery for more information.

  3. Adjust the confidence cutoff.

    If you find that term associations with lower scores tend to be incorrect, you can adjust the confidence cutoff for the term so that term associations that score less than the threshold value are not added. Data Catalog sets the initial confidence cutoff based on the distribution of term association scores across the catalog.

    NoteIf you set the confidence cutoff above 50%, term association matches must include more than one "major" component to successfully show up as matched.

    If you notice that the false positive associations for a given term often occur for fields that are also correctly tagged with a second term, consider increasing the term weight for the correct term. This adjustment takes advantage of the Data Catalog scoring rule for two terms on the same field. If the scores for the two terms differ by more than 20 points, the lower-scoring term association is dropped. See Change the confidence cutoff of the term associations for more information.

False negative term associations

False negative term associations occur when a term is not applied to fields where the data does match the original field data. The field data or metadata does not reflect the right data or metadata characteristics. Data Catalog does not recognize the correlation that you can see between the seed field and the expected matching fields. For example, you can see that the field data matches, but there's no match based on data values and other major scoring components. This situation is common when fields contain free text values that are mostly unique, such as notes, titles, messages, other free text that are not related by key words.

The problem could be with the seed field itself or with the fields that are not matching properly. Here are the most common reasons why false negatives occur:

  • No seed is set for term discovery.

    Check to make sure that a seed exists for the term. If there is no seed term association, then you won't see any suggested term associations for the term.

  • Incomplete seed data.

    Data Catalog collects sample values and compares data from each field in the catalog using algorithms, which analyze whether field data overlaps with the seed data for the term. If Data Catalog determines that data is not overlapping, this component of the term score is eliminated. If a field lacks matches on any other of the major components, the term won't be associated with the field.

To resolve false negative term associations, accept additional suggested term associations. Accepting more term associations reinforces the positive elements of Data Catalog's term discovery algorithm. For example, if a term association is accepted for a field with a different field name, Data Catalog adds the additional field name to the list of matching field names to look for on the next term run.

  1. Replace the field used as the seed field.

    You might review the sample data for the field being used as a seed field. Null or incorrect values can distort the sample data so that it is no longer representative of the "good" data in the field. The distorted data may cause Data Catalog not to match fields based on the sample data or the distribution of values in the sample data. For example, if your seed data includes customer email addresses and many of the records are empty or have a default value of "not provided", then the sample data set for the field will show "null" or "not provided" as frequent values. Data Catalog tries to match fields that also include these values and may fail to match any fields.

  2. Add additional seed fields.

    You can select more than one term associations to use in term propagation. Data Catalog builds the sample data for the seed by aggregating the term signature from each sample, which may help when data is not necessarily unique across fields. Increasing the pool of sample values allows Data Catalog to find matches between the seed and other fields.

  3. Create your own reference data to use as a seed field.

    If you do not have good examples of the data you are trying to find, make a dummy file with data that matches the values, tokens, or pattern of the data you want to match and use that field as the seed for the term.

High confidence cutoff

By default, Data Catalog determines the confidence cutoff for a term. If that score is too high, Data Catalog can miss some possible matches.

To resolve high confidence cutoff issues, change the confidence cutoff for the term. Lowering the confidence cutoff for a term allows Data Catalog to show additional term associations. See Change the confidence cutoff of the term associations for more information.

Regular expression terms

Data Catalog supports automatic tagging using the regular expression to match field values. Term associations are suggested based on the number of values in the field that match the regular expression.

ImportantData Catalog uses the JDK regular expression (regex) engine to process regular expressions. Make sure your regular expressions conform to JDK regex syntax so Data Catalog can process them accurately.

As an example, if your company has a specific code they use to keep track of part, you could create a REGEX term for the part ID by using a regular expression for the term association.

Regular expression tagging rules include a threshold value (known as the confidence cutoff) to indicate how many of a field's values need to match to tag the field. By default, the threshold is set for 80%. If you want to consider cases where fewer instances match the regular expression, then lower the confidence cutoff for the term.

When Data Catalog considers regular expression terms for propagating, it considers term name matches against the field name as part of the scoring. The field name alone is not enough to associate a regular expression term with a field, but a matching partial or complete field name can add to the overall score of the term association.

Some of the built-in terms use regular expressions to identify field data, such as phone number, credit card numbers, email addresses, and IP addresses.

Using regular expression terms and value terms

When you know all the possible values or value patterns that you want tagged with a selected term, use a regular expression term. Although regular expression terms match a specific pattern, you can increase flexibility by lowering the confidence cutoff to match fields that include data that does not match the pattern, such that if your data turns out to be “dirtier” than you expect, consider lowering the confidence cutoff for the term.

When values are not exact or when many values can be acceptable for a field, use a value term. Value tagging uses sample data and field metadata to score matches in addition to sample data. The value tagging algorithm can identify matching fields with a broad range of similarities and can evolve as your data changes.

As a best practice, use value tagging first. If a value term does not perform well after basic tuning, then consider creating a regular expression term.

Evaluating regular expressions against sample data

When Data Catalog performs the discovery profiling for data, it matches all regular expression term patterns against all the data in each field. When regular expression terms are added after discovery metadata exists, the tagging job uses 2000 sample values to determine whether regular expression terms match data.

For most cases, using sample data is a useful substitute for the full data set. The sample values represent the most frequent values in the data. If the most frequent values match the regular expression pattern, then it is likely that the full set of data also matches the regular expression pattern.

The sample data may not be an applicable substitute for the full set of data when the data cardinality is very high and there is a lot of variety across the data. These conditions are too difficult for a random sample of data values to produce a representative sample of the entire data set. If you see these conditions, consider re-evaluating the regular expression terms against all the data in the catalog.

Defining regular expressions

If you need to know more about regular expressions, explore an online tutorial such as regexone.com. You can also examine the regular expressions specified for the built-in terms. When putting together your own tagging rules, the built-in terms may provide good examples of what you can do using regular expressions and length limits.

NoteYou can review the regular expressions for built-in terms and you can disable built-in terms from propagating. However, you cannot edit the regular expression for a built-in term.

Some basic notations include the following:

  • Match the beginning of the data value: ^
  • Match the end of a string: $
  • Match any digit: d or [0-9]
  • Match any letter: [a-zA-Z]
  • Repetition of a pattern {n} where “n” is the number of repetitions
  • Indicate what does not match: (?!<pattern>)

Regular expression term use case: Discover national identifiers

An excellent use case for regular expression (regex) terms is discovering national identifiers. For example, if you have a requirement for identifying and discovering national identifiers like passport or national ID for a country or set of countries, then Data Catalog can use regex terms to help discover such fields.

Follow the steps below to discover fields using regex terms.

Procedure

  1. Create a glossary named National_Identifiers.

  2. Create regex terms for the national identifier and or passport ID that needs to be discovered by defining the valid regex.

  3. Run term discovery on the data.

    Learn how to run a discovery job in Managing jobs.

Results

Data Catalog's term discovery now identifies the national identifier fields and suggests the term associations.

Thesaurus

Data Catalog term discovery jobs use name matching as criteria for suggesting field-term associations. Term discovery compares the name of a candidate field with the field names of already-curated fields, the field names associated with the term, and with the term name itself. In addition to direct word matching, special rules are applied from a thesaurus recorded in the file thesaurus.json.

The thesaurus is a collection of rules for synonyms, antonyms, common words, and exceptions. A Data Catalog service user with administrative privileges can customize the thesaurus to fine-tune the data discovery process, by providing words that the discovery process should find and words it should ignore.

The thesaurus file, thesaurus.json, is in the <LDC-HOME>/agent/conf directory, and is organized as a series of JSON entries, using standard JSON syntax.

Each entry contains:

  • Name
  • List of values
  • Attributes

Attributes can contain the following elements:

  • Category: type of JSON entry, such as synonyms, antonyms, common words, and exceptions.
  • Visible: not currently used
  • Enable: not currently used

All thesaurus entries are normalized as follows:

  • All separators from the list defined in the ldc.discovery.tag.name_matching_separators configuration property, such as "defaultValue": "(){}[]:;_ .,/", are removed.
  • All entries are treated as lowercase.
  • All non-alpha characters from the end of the name are removed.

Normalization rules apply to all thesaurus names and values.

  • Synonyms

    Synonyms are different names that the discovery process matches, including group name and group values.

    In the following example, "firstnames" match "firstname", "givenname", "forename", and "fname".

    "firstnames": 
            { "value": "firstname,givenname,forename,fname", "category": "SYNONYMS",
            "visible": true, "enable": true}
     ,
  • Antonyms

    Antonyms are a collection of opposites. Antonyms are names that do not match even if the learning algorithm determines they are synonyms, or even if you accept term associations with fields containing those words. In the following example, "tea" will not match "coffee", "milk", or "soda".

    "tea":
         { "value": "coffee,milk,soda", "category": "ANTONYMS", "visible": true, "enable":
            true }
     ,
  • Commons

    Commons are names without context that can mean anything, such as "field", "code", or "number". When matching complex names, Data Catalog cannot match names that are on the Commons list. In the following example, Data Catalog does not match "state", "states", "address", or "location".

    "common_geo":
    
            { "value": "state,states,address,location", "category": "COMMON", "visible": true,
            "enable": true }
     ,
  • Exceptions

    Exceptions are names that are ignored for term discovery. In the following example, Data Catalog treats all the names in the value entry as exceptions, including "count", "field", and "filler", so they are ignored.

    "exception_common_fields":
    { "value": "count,field,filler,col,name,status,code,colname,sysname,c,
    colz,cols,desc,description,num,number, dt,comment,type,owner", "category": "EXCEPTIONS", "visible": true, "enable": true}
    ,

    Exceptions normally are ambiguous names that can be used as the name for fields. Not having exceptions for those names may produce a high volume of false positives for attribute analysis. Exceptions are different from Commons in that Data Catalog never matches a name in the Exceptions list, but it does match a name in the Commons list if the name is part of a complex name. For example, if "id" is on the Exceptions list, then Data Catalog does not match it in "customer.id", but it does match if "id" is in the Commons list.

    For display purposes, the thesaurus can have multiple groups of exceptions, such as "exception_common_fields" and "exception_ids". Data Catalog treats all exception lists as one list, and they are not used in algorithms. Separation into multiple groups is for user convenience only.