Skip to main content

Pentaho+ documentation is moving!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

Manage collections

Parent article

You can use a collection to represent a unified view of repetitive data structures. A collection is a virtual representation of a group of files that are organized as a directory hierarchy in a data source.

For example, you can have your system set to perform daily scans and store the text files named date_scan.log files that use a dd-mmm-yyyy format for the data. Instead of analyzing the individual files for business value metadata, you can explore a single collection view that represents all the log files with the metadata automatically aggregated during the schema discovery process for easier analysis. Collections are not created assets but inferred assets.

NoteWhen you add files to any of the directories that are part of the collection, you must run the Data Profiling Combo job to reflect the newly added data in the collection. If you have a vast amount of data, you may want to run the Format Discovery, Schema Discovery, and Data Profiling jobs separately so any errors can be caught and corrected without delaying the other processing jobs.

Collections are recognized during Data Catalog's schema discovery process based on the following requirements:

  • Three or more files must exist in a directory or in each subdirectory.
    NoteThe number of files needed to recognize a collection is set by the configuration property ldc.discovery.smallest.collection.size.
  • The files must be the same file type.
  • The files must have the same schema.

If you wish, you can skip the discovery of collections by specifying -skipCollectionDiscovery true in the Enter Paramenters field when running Schema discovery. You can also run a Collection discovery job by itself.

Collection hierarchy

When a folder becomes a collection, the files inside the folder no longer appear individually in your search results. When you search your catalog, collection results are shown as a single representation of all the files. The collection can be made up of files in a single folder or files in many folders all under a single top-level folder, as shown in the following example of the resource structure hierarchy that Lumada Data Catalog identified as a collection.Example collection structure hierarchy

Data Catalog runs collection discovery internally after identifying a set of resources as a collection at the HDFS schema discovery stage. Data Catalog also identifies multi-level collections in the collection discovery process. When you add files to one of the directories identified as part of the collection, you must run the collection schema and profile discovery to reflect the newly added data in the collection.

You can use the Summary view of a collection as you would use any other resource in Data Catalog with the few following exceptions:

  • The total number of records shown for a collection is an aggregate of the number of records of the individual member resources of the collection.
  • The field tags can be applied only at the collection level. After a resource is identified as a collection member, its fields cannot be tagged separately, and the members do not individually contribute to business term discovery.

View a collection

Although collections are virtual resources, you can explore and profile collections like any other resource in Data Catalog. The view in the Details tab of a collection is similar to other resources in Data Catalog with the following exceptions:
  • The total number of records shown for a collection is an aggregate of the number of records of the individual member resources of the collection.
  • Business terms can be applied only at the collection level. After a resource is identified as a collection member, you cannot tag business terms separately and the members do not individually participate in the business term discovery.
  • All business terms, including any existing terms on the resources, are gathered at the collection root, the aggregated representative view of the resources as a collection.

Perform the following steps to view the list of files in a collection:

Procedure

  1. Click Data Canvas in the left navigation menu.

    The Explore Your Data page opens.
  2. Navigate to a collection through virtual folders in Explore Your Data. The contents of the virtual folder display with corresponding icons for members or collections.

  3. Click the collection to highlight it.

  4. Click the Details tab.

    Resources that are part of a collection appear as Contained Items.