Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Collections

Parent article

A collection is a virtual representation of a group of files that are organized as a directory hierarchy in a data source. These collections are useful to represent a unified view of repetitive data structures.

For example, you can have your system set to perform daily scans and store the text files named date_scan.log files which use a dd-mmm-yyyy format for the data. Instead of having to analyze the individual files for business value metadata, all the log files are represented in a single collection view. The metadata is automatically aggregated during the schema discovery process for easier analysis. When you add files to any of the directories that are part of the collection, you must run schema and profile discovery again to reflect the newly added data in the collection.

Collections are not created assets but inferred assets. Collections are recognized during Data Catalog's schema discovery process based on the following requirements:

  • Three or more files must exist in a directory or in each subdirectory.
    NoteThe number of files needed to recognize a collection is set by the configuration property ldc.discovery.smallest.collection.size.
  • The files must be the same file type.
  • The files must have the same schema.

Collections are designated by the Collections icon Collections icon.

Collection hierarchy

When a folder becomes a collection, the files inside the folder no longer appear individually in search results. Instead, search results show a single representation of all the files. The collection can be made up of files in a single folder or files in many folders all under a single top-level folder. The following diagram depicts an example resource structure hierarchy that Lumada Data Catalog identified as a collection.Example collection structure hierarchy

Viewing collections

You can view collections by clicking Browse and then Virtual Folders on the Data Catalog menu. The contents of the virtual folder display with corresponding icons for members or collections. Click a Collections icon to open the collection details page for that collection.Collection details page showing the aggregate of individual member record counts

Resources that are part of a collection do not appear individually in the Browse menu. To browse the members of a collection, you must navigate using the View as list menu option on the collection. To view the list of files in the collection, click the More actions icon and select View as list from the menu that displays. The individual files are members of the collection.

View of collection members

Although collections are virtual resources, they can be browsed and profiled like any other resource in Data Catalog. The single resource view of a collection is similar to any other resource in Data Catalog with the exceptions noted below:

  • The total number of records shown for a collection is an aggregate of the number of records of the individual member resources of the collection.
  • Field tags can be applied only at the collection level. Once a resource is identified as a collection member, you cannot tag fields separately and the members do not individually participate in the tag discovery. This condition also applies if a collection member is a member of a dataset.
  • All tags, including any existing tags on the resources, are aggregated at the collection root, the cumulative representative view of the resources as collection.
  • Collection members can be added as dataset members, but collections cannot be added as dataset members because they are a cumulative representation of a group of resources.