Skip to main content
Hitachi Vantara Lumada and Pentaho Documentation

Tags and tag propagation

Parent article

Tags are business labels that are attached to Lumada Data Catalog data units (Virtual Folders, Datasets) and resources (tables, files, and fields).

Tags identify data sets so you can use the tag names as search terms. Before you are discouraged ("I have millions of files and fields! Tagging sounds like a lot of work!") you should know that Lumada Data Catalog helps you put tags on the data in your catalog.

When you tag a field in a data resource, Data Catalog automatically propagates that tag through your data lake and identifies associations with similar fields in other files and tables that match the data you tagged. The process of propagating tags across the catalog is called tag discovery. It uses data and metadata about the fields you tag to identify matching fields.

It is important to note that tags on resources (files and tables) are not propagated, whereas tags on fields are propagated.

Tags can also be created with regular expression rules to match data without requiring an initial tag association.

Tag domain

A single resource or field can be tagged by multiple tags associated with different business units. Data Catalog provides a way to organize and separate the tags grouped by business units in the form of Tag Domains. Tag Domains also enforces role based access control feature to tags and tag domains by letting the administrator limit the visibility to tags within a domain to certain set of users depending on their Data Catalog role.

While only admins can create a tag domain and assign them to roles, a steward role can create tags and tag associations, but an analyst can only associate existing tags in the tag domain assigned to their role. Refer to the role functionality information in Managing tag domains.

Tags

Data Catalog lets you put tags on any kind of data asset or "resource" or any field inside a resource. The level that you associate a tag determines how the tag behaves:

  • Resource-level tags
  • Field-level tags

Resource-level tags

Tags associated with a resource are "one-off" labels for the resource. Use these tags as search terms to find files, tables, or directories that have been individually tagged. These tags are part of the metadata for the resource. They can be used to enrich the metadata for the resource such as identifying or categorizing file contents, or to drive other processing such as access control or processing stages.

For information on managing resource tags refer to Tagging resources and fields in the User Guide.

Field-level tags

Tags associated with fields in a given resource are referred to as field-level tags. These tags can be propagated across the catalog.

The field-level tag types are as follows:

  • Built-in tags

    Data Catalog provides pre-built tags for common data patterns, such as credit card numbers and address components.

  • Custom tags or User-created tags

    You can create field-level tags to identify data patterns across your catalog. There are two ways to create your own field-level tags:

    • Value tags: When you manually tag a field, the data and metadata for the field contribute to a rule for Data Catalog to suggest similar data to be tagged with the same tag.
    • Regular expression (Regex) tags: You can specify or define a tag with a regular expression that describes the data you want Data Catalog to mark the tag.
  • Reference tags

    You can use the data from Built-in tags or other custom created tags as seeds for new tags. Such Built-in tags and custom tags become reference tags. For more details, refer to Reference tags.

  • Business entity

    A business entity is a group of tags with context. This is a special case of tags. For more details, refer to Business entities.

For information on managing field tags refer to Tagging resources and fields.

Built-in tags

In addition to tags you add, Data Catalog has a set of predefined tags that it propagates throughout the cluster. These tags fall into two categories:

  • Regular expressions

    Data, such as United States phone numbers and ZIP Codes, are tagged by matching data with regular expressions that describe data typical for these values.

  • Reference data

    Field data such as countries, states of the United States, and first and last names are tagged by matching the signature of known data. The reference data is static: you cannot include data from seed fields to alter the tag algorithm of the built-in tags.

The built-in tags cannot be changed. If you don't want to use the provided built-in tags for tagging your data, you can turn off automatic tag propagation for these tags.

Built-in tags are propagated when you run a tag job after collecting discovery metadata for catalog resources. See Managing jobs to manage value tags.

Reference tags

Reference tags are tags that act as reference for referring tags, in that the referring tags use the seed data or regular expression definition of the reference tag for seeding their tag discovery. Reference tags are indicated in the Lumada Data Catalog user interface by an arrow angled northeast.

It is essential that only tags with seeds or regular expression definitions are considered reference tags.

Reference tags are used in Business entities, but are not just limited to business entities. Non-business entity member tags can assign other qualifying tags as reference. However, the manner in which tag associations are suggested vary:

  • When reference tags are defined for business entity members, after successful tag discovery only the associations for the business entity member (referred) tag are suggested for fields. The original reference tag suggestions are removed for such fields.
  • Reference tags that are part of a business entity have an increased confidence percentage if Data Catalog has discovered the context of the business entity.
  • When reference tags are defined for tags that do not belong to a business entity, both tag associations (one for the referring tag and one for the original reference tag) are shown.
Points to Remember

  • Reference tags can be from the same domain or can be from different tag domains, such as built-in tag domains.
  • A reference tag can be a value tag (has a seed), or a regular expression tag.
  • Many-to-One relationships (one reference tag to many referring tags) are allowed.
  • Avoid cyclic tag references because Lumada Data Catalog does not check for them.

Value tags and seed data

When you tag a field in a file or table, Data Catalog automatically adds the same tag to other fields in files and tables that match the data you tagged. It uses the first tag association created to "seed" the tag discovery process. It uses data and metadata about the field you tag to identify matching fields then suggests the same tag for those fields.

The suggested tag appears as a dot-outlined button which you can click to accept or reject.

Tag discovery

The score (generally as a percentage) associated with a suggested tag indicates how closely this field matches the data and metadata of the original field.

Data Catalog compares the seed field's metadata and sample data with the metadata and sample data for all other fields included in the catalog. When it finds other fields with similar metadata or data, it calculates a score to represent how closely a new field matches the tagged field. For example, a field with the same name and data type might score 90 percent, while a field with the same data type and overlapping values but a different field name might score only 75 percent. A field that is a complete copy would score 100 percent. Data Catalog shows only suggested tags for fields that match over a calculated threshold.

NoteIn Search, suggested tags are treated the same as accepted tags. When you search on a tag, you see all the fields that have accepted or suggested tags. However, only accepted tags are exported to external applications such as Cloudera Navigator or Apache Atlas.

Value tag scoring components

Tag association scoring includes weighted contributions for the following characteristics:

  • Major scoring dimension: Must have at least one match for a field to be suggested.
    • Overlapping values
    • Overlapping “tokens” from the values (individual words in the field values that may match, even if the complete strings do not match)
    • Overlapping patterns (for example, most of the sample values consist of two words as in a person's name, or the sample data includes text formatted as dates)
    • Matching field names
    • Standard deviation
    • (Regular expression counts)
    • Other matching tags
  • Minor dimension: Match is less than 5 percent for each matching area.
    • Matches on numeric properties such as quantiles, cardinality, boundaries, mean

NoteAnonymous values and tag associations are handled as follows:

Data Catalog looks for overlapping data, overlapping tokens, and data patterns to match fields. However, Data Catalog filters out words and numbers that would match too often. For example, words such as "a", "an", and "the" are removed from the list of tokens; individual letters ("a", "b", "c") are removed; and, integers with 3 or fewer digits are removed.

If the field data in either the seed or the field to be matched includes many or mostly non-distinct values, Data Catalog is unlikely to make correct tag associations.

Value tagging seed fields

Data Catalog uses the sample data values from the fields from specific tag associations to add to the tag discovery process. The data from those fields is used when Data Catalog calculates the tag association scoring. When you use a tag for the first time to mark a field, Data Catalog automatically sets that field to be a seed: the data in this field contributes to tag discovery. You can add additional fields to be used in tag discovery; you can replace the original seed with other tag associations if you find better seed fields.

You can tell that a tag association is marked for use as seed in tag discovery when you see the dot next to the tag name, as shown below.

Tag association with seed

It is important to choose good seed tags. Make sure the data in the field is representative of the data you want tagged. Consider replacing a seed field if the field has any of the following conditions:

  • Null values
  • Incorrect or not representative values
  • A subset of the data you would expect to be tagged

When value tagging works well and when it does not

Field data works well when field values are distinct terms and when sample data is a good representation of the full set of data: relatively low cardinality lists of repeated values such as product names or marital status values. Similar or overlapping data works well too.

There are cases where discovery may not work well for tags:

  • Numeric codes that are not distinct patterns. For example, 9-digit account numbers may be confused for US Social security numbers.
  • Numeric values such as temperatures, financial values, and counts.
  • Cases where the tag name requires contextual information. For example, first and last names tag well; however, tag discovery can not distinguish customer names from vendor names.

Curating value tags

The first step for tagging your catalog is to manually add tags to fields in data you are familiar with.

Here is a process guideline:

Procedure

  1. Browse to data sets you are familiar with.

  2. Put tags on fields.

    Use names that match how you would expect users to identify the data. You might consider whether this particular field is a good representative of the data you want Data Catalog to tag.
  3. Run batch tag processing.

    This is a batch job run by an administrator, typically run on a daily schedule. It applies new and updated tags to existing data; it also applies all tags to new data. During times of high tagging activity, you might want to run this job more frequently.
  4. Search on a specific tag to review how it was propagated.

    You can use the Advanced Search, Glossary, or Catalog view to see fields tagged with a specific tag.
  5. Review the tag associations and use the profile information for the field to validate if the tag is correct.

  6. Accept correct tag associations, and reject incorrect tag associations.

    These actions change the tag discovery algorithm for the tag. To best train Data Catalog to perform the right tag propagation, accept or reject one or two tag associations at a time; exhaustively removing all incorrect tag associations is not as effective for improving the tag discovery algorithms. To be able to accept or reject tag associations, users need to have at least a Steward role for the resources and tags involved. Accepting a tag
  7. Rerun tag processing.

Next steps

Repeat the curation process if needed.

CautionTags for some data propagate more precisely than other data.

If you tag a field that contains a product code made up of letters, numbers, and punctuation, such as “CSP-2201A”, Data Catalog can precisely identify other fields with similarly constructed data. However, if you tag a field that contains free text (such as a text field in a social media feed) or numeric values (such as rainfall depth values), Data Catalog may find false positives when attempting to match the data.

Consider defining a regular expression for a tag when you want to tag data with a specific text pattern.

Managing tag associations using value tag tuning

Data Catalog terms the method of linking the created (seed) tags with similar fields in other resources - the tagging process, as tag association. After defining the "seed" tag and running the tag discovery process, Data Catalog identifies the fields that match the seed tag as tag associations. Tag associations also have a value attached to it, which determines the percentage match to the seed data. If the value is 90%, it means that this tag association is a 90 percent match to the seed tag and has high chances of data similarity between the seed field and the associated field.

The controls available for tuning the tag process are as follows.

Accept or reject tag associations

When a tag association is accepted or rejected, Data Catalog determines which component of the data or metadata matched (or didn't match) and adjusts how that component contributes to the overall score. Each "accept" incrementally increases the weight of the score component while each "reject" incrementally lowers the weight of the score component. When there are 3 more rejects for a tag than accepts, the affected component of the score is removed completely.

Set which fields contribute to data discovery

The first manual tag association is automatically used to “seed” tag discovery. There may be times when you want to add or change which field is used as the seed.

For example, if a field includes a high volume of null values, the tag discovery process includes the null as part of the expected data; if you don't want to include null values in the data that Data Catalog identifies, it would improve tag discovery to replace the field containing null values with a field that has a cleaner version of the data.

Consider modifying the seed tag association in these cases:

  • Exclude an association when the field includes data that is dirty or unrepresentative. For example, if most data values are null, exclude such fields from seeding the tag discovery.
  • Include additional associations to include additional values. For example, claim numbers from different zone offices that may adhere to different patterns but still need to be identified as one business tag will require associations from both resources to be marked as seed data for tag discovery.

Change the confidence cutoff of the tag associations

Each tag has a confidence cutoff. If a field matches with a score below the threshold value, then the tag association is not shown in the catalog. By default, Data Catalog determines the confidence cutoff. The confidence cutoff is set based on which of the scoring components are appropriate for seed fields.

For example, if the seed data is classified as "anonymous", Data Catalog sets the confidence cutoff to "high". For string values with mostly fixed lengths (such as alpha-numeric codes), the confidence cutoff is set to "low".

The default tag association confidence cutoff is set to 40% for low, 60% for normal, and 80% for high. These defaults can be configured in conf/configuration.json.

NoteConfidence cutoff cannot be 0.

Value tag tuning scenarios

Refer to the following scenarios to troubleshoot issues with tag tuning.

False positive tag associations

False positive tag associations are when a tag is applied to fields that don't match the original field.

False positive tag associations occur because Data Catalog is finding connections between the suggested fields and the seed field that are not what was intended when the original field was tagged. There are many possible reasons for this, but the basic reason is that elements of the fields' data and/or metadata are matching that are not what was intended to match. Data Catalog needs input from you to determine which data or metadata components correspond to the intended matches between the seed field and the correct field matches.

To resolve false positive tag associations, use the following solutions, which are listed in the order you should try them:

  1. Reject individual tag associations: Reject one or two of the incorrect tag associations and re-run the tag propagation job. Rejecting tag associations changes the tag algorithm scoring. It reviews the matching components identified for the rejected tag association, it determines which components scored the highest on this "bad" association, and lowers the weight in the tag discovery algorithm for the high-scoring components.

    The next tag job removes all the suggested tags for this tag and reapplies them based on the updated algorithm. It's likely that you'll have to iterate through this process more than once: it takes 3 rejects of the same component before Data Catalog removes this component from the scoring entirely. Refer to Accepting or rejecting suggested tag associations in User Guide.

  2. Validate the sample data: Review the sample data for the field being used as a seed field. If the sample data shows that the tag has null or incorrect values, this can distort the sample data so that it is no longer representative of the "good" data in the field. The distorted data may cause the wrong fields to be suggested.

    For example, if your seed data includes customer email addresses and many of the records are empty or have a default value "not provided", the sample data set for the field will show "null" or "not provided" as being frequent values. Data Catalog will try to match fields that also include these values. If other optional fields such as "age" and "marital status" also use "not provided" when there is no value, Data Catalog may see matches against this other data. If this is an issue, identify another field to use as a seed for tag association: for example, find one of the suggested tags that is correct and use it as a second seed or use it to replace the primary seed. The next tag job will remove all the suggested tags for this tag and reapply them based on the updated algorithm that uses the new sample data. Refer to Change the seed tag association.

  3. Adjust the confidence cutoff: If you find that tag associations with higher scores tend to be correct and tag associations with lower scores tend to be incorrect, you can adjust the confidence cutoff for the tag so that tag associations that score less than the threshold value are not added. Data Catalog sets the initial confidence cutoff based on the distribution of tag association scores across the catalog.
    NoteIf you set the confidence cutoff above 50%, tag association matches must include more than one "major" component to successfully show up as matched.

    If you notice that the false positive associations for a given tag often occur for fields that are also correctly tagged with a second tag, consider increasing the tag weight for the correct tag. This adjustment takes advantage of the Data Catalog scoring rule for two tags on the same field. If the scores for the two tags differ by more than 20 points, the lower scoring tag association is dropped. Refer to Change the confidence cutoff of the tag associations.

False negative tag associations

False negative tag associations are when a tag is NOT applied to fields whose data does match the original field data.

False negative tag associations occur because expected matches don't correspond to matches among the components to the fields' data and/or metadata. Data Catalog isn't "seeing" the correlation that you can see between the seed field and the expected matching fields. This is caused when the field data or metadata doesn't reflect the right data or metadata characteristics: the problem could be with the seed field itself or with the fields that aren't matching properly. There are a number of reasons this can occur:

  • No seed is set for tag discovery. Check to make sure that a seed exists for the tag. If there is no seed tag association, then you won't see any suggested tag associations for the tag.
  • Incomplete seed data. Data Catalog collects sample values and does comparisons for data from each field in the catalog using proprietary algorithms. It uses these tools to analyze whether field data overlaps with the seed data for the tag. If Data Catalog determines that data doesn't overlap, this component of the tag score is eliminated. If a field doesn't match on any other of the major components, the tag won't be associated with the field.

    What would a situation look like where Data Catalog doesn't tag the field, but we (humans) know that it should match? For example, you can tell that the field data matches, but there's no match based on data values and other major scoring components also don't match. Typically, these fields would be free text values that are mostly unique: notes, titles, messages, other free text that aren't related by key words.

To resolve false negative tag associations, accept additional suggested tag associations. Accepting more tag associations reinforces the positive elements of Data Catalog's tag discovery algorithm. For example, if a tag association is accepted for a field with a different field name, Data Catalog adds the additional field name to the list of matching field names to look for on the next tag run.

  1. Replace the field used as the seed field. You might review the sample data for the field being used as a seed field. If the sample data shows that the tag has null or incorrect values, this can distort the sample data so that it is no longer representative of the "good" data in the field. The distorted data may cause Data Catalog not to match fields based on the sample data or the distribution of values in the sample data. For example, if your seed data includes customer email addresses and many of the records are empty or have a default value "not provided", the sample data set for the field will show "null" or "not provided" as being frequent values. Data Catalog will try to match fields that also include these values and may fail to match any fields.
  2. Add additional seed fields. You can select more than one tag associations to be used in tag propagation. Data Catalog builds the sample data for the seed by aggregating the tag signature from each sample. This may help when data is not necessarily unique across fields so increasing the pool of sample values might allow Data Catalog to find matches between the seed and other fields.
  3. Create "reference data" to use as a seed field. If you don't have good examples of the data you are trying to find, make a dummy file with data that matches the values, tokens, or pattern of the data you want to match and use that field as the seed for the tag.
High confidence cutoff

By default, Data Catalog determines the confidence cutoff for a tag. If that score is too high, Data Catalog can miss some possible matches.

To resolve high confidence cutoff issues, change the confidence cutoff for the tag. Lowering the confidence cutoff for a tag allows Data Catalog to show additional tag associations. Refer to Change the confidence cutoff of the tag associations.

Regular expression tags

If you have data that can be described using a regular expression, Data Catalog supports automatic tagging using the regular expression to match field values. Tag associations are suggested based on the number of values in the field that match the regular expression.Regular expression tag creation

As an example, if your company has a specific code they use to keep track of part, you could create a REGEX tag for the part ID by using a regular expression for the tag association.

Regular expression tagging rules include a threshold value (the confidence cutoff) to indicate how many of a field's values need to match for the field to be tagged. By default, the threshold is set for "high" (80% by default) meaning that 80% of the values in the field must match the pattern for the tag to be associated with the field. If you want to consider cases where fewer instances match the regular expression, you should lower the confidence cutoff for the tag.

When Data Catalog considers regular expression tags for propagating, it considers tag name matches against the field name as part of the scoring. The field name alone is not enough to associate a regular expression tag with a field, but a matching partial or complete field name can add to the overall score of the tag association.

Some of the built-in tags use regular expressions to identify field data: phone number, credit card numbers, email addresses, IP addresses.

When to use regular expression tags over value tags

When you know all the possible values or value patterns that you want tagged with a particular tag, use a regular expression tag. Although regular expression tags match a specific pattern, you can mitigate that rigidity by lowering the confidence cutoff to match fields that include data that does not match the pattern. That is, if your data turns out to be “dirtier” than you expect, consider lowering the threshold acceptable for the tag.

When values are not exact or when many values can be acceptable for a field, use a value tag. Value tagging uses sample data and field metadata to score matches in addition to sample data. The value tagging algorithm can identify matching fields with a broad range of similarities and can evolve as your data changes. Use value tagging first; if a value tag does not perform well enough after basic tuning, then consider creating a regular expression tag.

Evaluating regular expressions against sample data

When Data Catalog performs the discovery profiling for catalog data, it matches all regular expression tag patterns against all the data in each field. When regular expression tags are added after discovery metadata exists, the tagging job uses the 2000 sample values to determine whether regular expression tags match data.

For most cases, using sample data is a useful substitute for the full data set: the sample values represent the most frequent values in the data; if the most frequent values match the regular expression pattern, then it is likely that the full set of data would also match the regular expression pattern.

The sample data may not be an appropriate substitute for the full set of data when the data cardinality is very high and there is a lot of variety across the data. In these conditions, it is difficult for a random sample of data values to produce a representative sample of the entire data set. If you see these conditions, consider re-evaluating the regular expression tags against all the data in the catalog.

Defining regular expressions

Not familiar with “Regular expressions”? Try an online tutorial such as regexone.com. You can also look at the regular expressions specified for some of the built-in tags.

NoteWhen putting together your own tagging rules, the built-in tags may provide good examples of what you can do using regular expressions and length limits. You can review the regular expressions for built-in tags, and you can disable built-in tags from propagating; however, you cannot edit the regular expression for a built-in tag.

Some basic notations are:

  • Match the beginning of the data value: ^
  • Match the end of a string: $
  • Match any digit: d or [0-9] Or any letter: [a-zA-Z]
  • Repetition of a pattern {n} where “n” is the number of repetitions
  • Indication of what does not match: (?!<pattern>)

Regex use case to discover national identifiers

An excellent use case for regex tags is discovering national identifiers. Say there is a requirement for identifying and discovering national identifiers like passport or national ID for a country or set of countries. Data Catalog can help discover such fields with the help of regex tags.

Follow the steps below to discover fields using regex tags.

Procedure

  1. Create a Tag domain named National_Identifiers.

  2. Create regex tags for the national identifier and/or passport that needs to be discovered by defining the valid regex.

    Regex tags use case
  3. Run Tag discovery on the data.

Results

Data Catalog's tag discovery now identifies the national identifier fields and suggests the tag associations.