Skip to main content

Pentaho+ documentation has moved!

The new product documentation portal is here. Check it out now at docs.hitachivantara.com

 

Hitachi Vantara Lumada and Pentaho Documentation

MongoDB onboarding and profiling example video

You can onboard MongoDB data by connecting it to a Data Catalog data source, then triggering a Data Catalog profiling job. The following example is a video of how to set up a MongoDB data source and profile it in Data Catalog:

Steps used in MongoDB onboarding example video

You can use these instructions to follow along in the MongoDB example video.

Adding a data source

If your role has the Manage Data Sources privilege, perform the following steps to create data source definitions.

Specify MongoDB data source identifiers

Perform the following steps to identify your MongoDB data source within Data Catalog:

Procedure

  1. Click Management in the left toolbar of the navigation pane.

    The Manage Your Environment page opens.
  2. Click Data Source then Add Data Source, or Add New then Add Data Source.

    The Create Data Source page opens.
  3. Specify the following basic information for the connection to your data source:

    FieldDescription
    Data Source NameSpecify the name of your data source. This name is used in the Data Catalog interface. It should be something your Data Catalog users recognize.

    NoteNames must start with a letter and must contain only letters, digits, and underscores. White spaces in names are not supported.
    Description (Optional)Specify a description of your data source.
    AgentSelect the Data Catalog agent that will service your data source. This agent is responsible for triggering and managing profiling jobs in Data Catalog for this data source.
    Data Source TypeSelect the database type of your source. You are then prompted to specify additional connection information based on the file system or database type you are trying to access.
  4. Specify the following additional connection information based on the MongoDB resource you are trying to access:

    FieldDescription
    Configuration MethodSelect URI as the configuration method.
    Source PathEnter the MongoDB database path. For example, the default database path for MongoDB is /data/db.
    URLEnter the MongoDB server URL, for example, mongodb://localhost:27017.
    Username and passwordEnter your username and password to connect to the MongoDB server.

Test and add your data source

After you have specified the detailed information according to your data source type, test the connection to the data source and add the data source.
NoteEvery time you add a data source, Data Catalog automatically creates its corresponding root virtual folder in the repository.

Procedure

  1. Click Test Connection to test your connection to the specified data source.

    If you are testing a MySQL connector and you get the following error, it means you need a more recent MySQL connector library:
    java.sql.SQLException: Client does not support authentication protocol requested by server. plugin type was = 'caching_sha2_password'

    1. Go to MySQL :: Download Connector/J and select option Platform Independent.
    2. Download the compressed (.zip) file and copy to /opt/ldc/agent/ext where /opt/ldc/agent is your agent install directory, and unpack the file.
  2. (Optional) Enter a Note for any information you need to share with others who might access this data source.

  3. Click Create Data Source to establish your data source connection.

Next steps

You can also update the settings for existing data sources, create virtual folders, update existing virtual folders for data sources, and delete data sources.

Job sequences

Job sequences are sequences of jobs in Lumada Data Catalog that can be executed by users who have job execution privileges.

NoteThese sequences execute with predefined parameters in Data Catalog and cannot be overridden by the user with the Sequence option.

Trigger a sequence job

Follow the steps below to run a sequence job for a specific resource.

Procedure

  1. Click Data Canvas in the left navigation menu.

    The Explore Your Data page opens.
  2. Use the Navigation pane to drill down to the resource.

  3. Click More actions and then select Process from the menu that displays.

    The Process Selected Items page opens.
  4. Click the sequence that you want to use.

    SequenceDescription
    Select TemplateA template is a custom definition for a given process with a custom set of parameters.
    Format DiscoveryIdentifies the format of data resources, marking the resources that can be further processed.
    Schema DiscoveryApplies format-specific algorithms to determine the structure of the data in each resource, producing a list of columns or fields for each resource’s catalog entry.
    Collection DiscoveryDiscover collections of data elements with same schema.
    Data ProfilingProfiling applies data-specific logic to compute field-level statistics and patterns for each resource as unique fingerprints of the data.
    Data Profiling ComboStarts a combined sequence of processes to profile your data. Executes format discovery, schema discovery and data profile process.
    Business Term DiscoveryCompares and analyzes the computed fingerprints with any defined or seeded label signatures to discover possible matches.

    Note that users must have Run Term Discovery permissions to run this job.

    Lineage DiscoveryShows relationships among resources in the form of a lineage graph. Data lineage identifies copies of the same data, merges between resources, and the horizontal and vertical subsets of these resources.
    Data RationalizationFinds redundant data copies and overlaps.
    The sequence page opens.
  5. Based on the resource, follow the workflow for the sequence.

  6. Click Incremental Profiling if you want to use incremental processing.

    NoteWhen you select Fast profiling mode in the Sequence flow, the default values for sample-splits and sample-rows are used as defined in the Agent component's configuration.
  7. In the Enter Parameters field, enter any command line parameters for the sequence.

  8. Click Start Now.

Results

The job is submitted to the Data Catalog processing engine.