Now that you have successfully installed Lumada Data Catalog you are ready to start building your data catalog.
Home page is the landing page for a user after login. This page is customized for the user logged in. The Welcome header identifies and welcomes the user logged in. The Welcome header also displays the last login information or, if this is the user's maiden login, it welcomes a first login into Lumada Data Catalog.
Your home screen will look like this:
Navigating the Home page
The Home page is customized for the logged in user and provides multiple access points into exploring the catalog.
Main navigation dashboard
The main navigation dashboard is omnipresent throughout the product for quick navigation to different parts of the product. The various navigation links provide following functionality.
Allows the user to browse Data Catalog assets like virtual folders, datasets and data objects.
Provides access to the Tags and Tag Domain management.
Helps manage the various assets and functions.
The operational dashboard that summarizes the data catalog information. This is your one stop shop for information like:
- Processed/unprocessed data assets
- Sensitive/non-sensitive resources and fields
- Number of tag associations identified
- Number of users that have accessed the catalog
- Total searches performed
- Metadata objects generated using Data Catalog
- Community contribution
Lists the bookmarked resources for easy access.
Enables keyword searches through the Data Catalog.
Alerts the user of important events that require the user's attention like job completion, if user is mentioned in any posts, or if there is a request for resource access.
Provides a secondary credential validations menu for JDBC sources, and job activity monitoring for jobs triggered by that user.
Getting Started card
The Getting Started card provides links to different points of entry in the Data Catalog to help the user get started with cataloging functions based on their role. The blocks are tabulated based on the various business functions (business user, business analyst, and data steward) that may be performed by the users.
Data assets are organized by glossary domains and tags.
Browse Glossary maps toin the main navigation menu. Here one can view the Data Catalog by domains to locate domains of interest.
Files, tables, and other data resources are organized in virtual folders.
Browse Folders maps to in the main navigation menu. Here one can browse resources in attached data sources. Data assets can also be searched for using key word and advanced search.
Read Documentation & Watch Videos
This link maps to the documentation where one can learn about Data Catalog Assets and Features, watch video tutorials, and browse other “How to” topics.
Data assets are tagged at resource level and field level. These tags maybe provided by authorized users and/or suggested by the catalog’s discovery engine.
Manage Tags maps to in the main navigation menu. Start here to manage tags and tagged items.
Manage Data Objects
Data objects are assets formed by joining several other assets.
Manage Data Objects maps to in the main navigation menu. Data objects represent the most curated and studied data. Some of these objects can be key to jump-start your project.
As a Data Steward, their primary task is to continuously curate the processed data that is flowing into the Data Catalog. There are many types of data assets, files, HIVE tables, tables, collections, datasets and data objects. Curation can be done in multiple dimensions, such as rich descriptions, discussions and, of course, tagging. To curate data, use Browse or Search to navigate to the data asset of interest.
My Information card
The My Information card lists user information like roles assigned to that user and any pre-filters defined for their search dimensions.
Continue Working card
The Continue Working card lists the number of new and unread notifications, searches performed (last five searches), and the resources the user worked on during last login (up to last five resources).
Planning your data catalog
With tens or even hundreds of data sources in your enterprise data lake catering to the curiosities and needs of thousands of users, it is helpful to plan your data catalog before building it with Data Catalog.
The following steps are just a beginner's guideline for a data catalog plan:
Plan data sources to be added along with their path details
One of the ways Data Catalog achieves data security is by allowing control over virtual folder designation based on roles. Also Data Catalog does not allow data sources with overlapping paths, virtual folder paths can overlap with the use of include and exclude patterns.
Plan custom roles
Default Data Catalog roles (Administrator, Steward, Analyst and Guest) have a very defined behavior with access to all virtual folders in the data catalog. To be able to assign virtual folders to a role, you can emulate the default roles in custom roles with a finer control on role based accessibility and job management. Refer to our Role Permissions matrix to plan your custom roles, see Administrator role permissions. You may also want to plan how many admin users will be required to manage your data lake. Based on your organizational structure, you may want an admin for each business function/division/region.
Plan tag domains and tag glossary
Tags are labels that users can attach to the data or the resources to mark a particular data pattern or resource that contributes to their business value. These tags can then be automatically propagated throughout the data lake used to identify similar data patterns. Tags can be grouped into tag domains with multi-level hierarchy and assigned to roles for perform data analysis. Depending on the roles, the actions that can be performed with tags differs, with admins having the most control to guest users limited to just being able to view tags and tag suggestions. Tag domains also help with separation of business domains - tags from one business function cannot see tags from another business function unless assigned to them.
Plan job management functions
Plan on which roles will be allowed to run jobs. Data Catalog provides the ability to run profiling jobs as basic sequences with default attributes to any role with job profiling enabled. However for resources that require custom system or Data Catalog attributes, an admin user may need to create job templates that apply these custom attributes, for profiling such resources. A basic plan with above guidelines will help in building your data catalog in an efficient manner.
You are now ready to start building your data catalog.
Building your data catalog
Following are the generic steps you would take towards building your data catalog.
Step 1: Add data sources
This is the first step towards building your data catalog. It is here that you have the opportunity to connect the different data sources in your data lake (HDFS, HIVE, Oracle, Teradata, Redshift, Snowflake, etc.) and both on prem or hosted in cloud.
Refer to Manage data sources for details on how to add data sources to the Lumada Data Catalog.
Step 2: Create virtual folders
Data Catalog allows admins to create virtual folders from groups of resources belonging to the same data source. These can be then delegated to the target role group for data analysis and cataloging. This is one way Data Catalog employs its own authorization layer over the native layer using role-based access control (RBAC).
Refer to Manage virtual folders in the Administration Guide for details on creating virtual folders.
Step 3: Add tag domains and tag glossary
Depending on the function carried out by the different admins, you may want to create functional tag domains and any parent tags that will be commonly used. These can also be created by the regional admins.
Refer to Managing tag domains in the Administration Guide for tag domain and tag management. Our Tags and tag propagation section under Data Catalog Assets and Features in the Getting Started section details information on tags, different types of tags, and tag propagation.
Step 4: Add custom roles
If the default roles in Data Catalog seem rigid for your application, create custom roles with custom defined functions with role-based access control and functionality, say a regional administrator role that is not permitted to run jobs or is divisional data steward with metadata level resource access, etc.
Refer to Managing roles under Administration Guide for more information on creating and managing custom roles.
Step 5: Add users
Once you have setup the roles and their virtual folders, it is time to add the users. You may choose to add at least the divisional or departmental admins and delegate their share of the virtual folders. These admins can then add their own stewards and analysts.
Refer to Manage users in the Administration Guide for details about managing users.
Step 6: Profile data
If you decide to delegate the job profiling to the functional admins, you will need to create job templates with custom Data Catalog or system parameters. If you decide to perform the maiden profiling your data lake as Data Catalog service user, this would be the best opportunity. This way your Lumada Data Catalog is fully functional for users to start their data analysis and cataloging functions.
Step 7: Run tag discovery
Assuming you have defined seed tags and regex tags for data that needs to be identified in the catalog, now is the time to run the initial tag discovery job. The tag discovery job looks for data that matches the built-in tags and user defined value tags and regex tags and populates the catalog glossary with the discovered suggestions.
Step 8: Tag curation and learning
Once the catalog glossary is populated with the discovered tag suggestion, any user (typically those with an analyst role) can then curate these tags by accepting desired tags or rejecting undesired ones. Depending on the definition of the user defined value tags and regex tags, the discovered suggested tags may be false positives or false negatives. In such cases, the data stewards would update the regex definitions or reassign seed values and re-run Step 7 (Run tag discovery). Steps 7 and 8 are interactive, and tag curation requires user interaction to modify tag definitions and re-run tag discovery in order to help Data Catalog learn tag discovery.
These steps can be repeated multiple times until the user is satisfied with the discovered tag quality.