Advanced settings for connecting to Google Dataproc
This article explains advanced settings for configuring the Pentaho Server to connect to Google Dataproc.
Before you begin
Procedure
Check the Components Reference to verify that your Pentaho version supports your version of Google Dataproc.
Prepare to use Google Dataproc by performing the following tasks:
Obtain the required credentials for a Google account and access to the Google Cloud Console.
Obtain the required credentials for Google Cloud Platform, Compute Engine, and Google Dataproc from your system administrator.
Contact yor Hadoop administrator to obtain the connection information for the cluster and services that you intend to use. Some of this information may be available from a cluster management tool. You also need to supply some of this information to users after you are finished.
Create a Dataproc cluster
You can create a Dataproc cluster using several different methods. For more information on setting up your cluster, see the Google Cloud Documentation.
Install the Google Cloud SDK on your local machine
Use the Google Cloud Documentation to learn how to install the Google Cloud SDK on your supported platform.
- For Linux machines, see https://cloud.google.com/sdk/docs/downloads-interactive#linux
- For Windows machines, see https://cloud.google.com/sdk/docs/downloads-interactive#windows
Set command variables
Perform the following steps to set command variables.
Procedure
Export the project using the following example:
$ export PROJECT=project;export HOSTNAME=hostname;export ZONE=zone
Set the PROJECT variable to your Google Cloud project ID.
Set the HOSTNAME variable to the name of the master node in your Dataproc cluster.
NoteThe master name ends with an -m suffix.Set the ZONE variable to the zone of the instances in your Dataproc cluster.
Set up a Google Compute Engine instance for PDI
Perform the following procedures to set up a PDI client instance in the Google Compute Engine and use it as a client instance for Dataproc.
Procedure
In the GCP platform dashboard, navigate to the Compute Engine console.
Navigate from the menu to
.Click Create Instance.
Click
.In the Network Tags text box, enter vnc-server.
Install and update a working VNC service for the remote user interface.
Log in to the instance using SSH.
Use a locally installed SSH client command line to access the remote client instance using its external IP address.
NoteThe console displays the external IP.Use the Compute Engine list of active virtual machines and select SSH from the list next to the virtual machine you want to use.
Update the operating system on the virtual machine.
Install Gnome and VNC.
Create an SSH tunnel from your VNC client machine.
Connect to the VNC.
(Optional) Configure and log in to Kerberos on your client instance.
If you are using Kerberos, the VM instance running PDI in GCE must be configured with Kerberos to work with a Kerberos-enabled Dataproc cluster. Kerberos must be properly configured and the client machine must be authenticated with the Kerberos controller.
Results
Edit configuration files for users
Your cluster administrator must download configuration files from the cluster for the applications your teams are using, and then edit them to include Pentaho-specific and user-specific parameters. When edited, provide these modified files to the applicable users who must copy the files into the their directory: <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name>.
When creating a named connection, the <user-defined connection name> directory is also created. When you set up the named connection, PDI copies these configuration files into that directory. The cluster administrator must provide users with the name to assign the named connection, so that PDI can copy these modified files into that directory.
The following files must be provided to your users:
- core-site.xml
- hdfs-site.xml
- mapred-site.xml
- yarn-site.xml
- hive-site.xml
Edit the XML file for MapReduce
Perform the following steps to edit the mapred-site.xml file.
Procedure
Navigate to the <username>/.pentaho/metastore/pentaho/NamedCluster/Configs/<user-defined connection name> directory and open the mapred-site.xml file.
Add the following values for the parameter:
Parameter Value mapreduce.app-submission.cross-platform This property is only needed to run MapReduce jobs on Windows platforms. <property> <name>mapreduce.app-submission.cross-platform</name> <value>true</value> </property>
Save and close the file.
Connect to a Hadoop cluster with the PDI client
After you have set up the Pentaho Server to connect to a cluster, you must configure and test the connection to the cluster. For more information about setting up the connection, see Connecting to a Hadoop cluster with the PDI client.
Connect other Pentaho components to Dataproc
The following sections explain how to create and test a connection to the cluster in the Pentaho Server, Pentaho Report Designer, and Pentaho Metadata Editor. Creating and testing a connection to the cluster in the PDI client includes the following tasks:
- Install a driver for the Pentaho Server. For instructions, see Set up the Pentaho Server to connect to a Hadoop cluster.
- Create and test the cluster connections.
Create and test connections
For each Pentaho component, create the test as described in the following list.
Pentaho Server for DI
Create a transformation in the PDI client and run it remotely.
Pentaho Server for BA
Create a connection to the cluster in the Data Source Wizard.
PME
Create a connection to the cluster in PME.
PRD
Create a connection to the cluster in PRD.
After you have connected to the cluster and its services properly, provide the connection information to users who need access to the cluster and its services. Those users can only access the cluster on machines that are properly configured to connect to the cluster.
To connect, users need the following information:
- Hadoop distribution and version of the cluster
- HDFS, JobTracker, ZooKeeper, and Hive2/Impala Hostnames, IP addresses and port numbers
- Oozie URL (if used)
Users also require permissions to access the directories they need on HDFS, such as their home directory and any other required directories.
They might also need more information depending on the job entries, transformation steps, and services they use. For a detailed list of information that your users need to use supported Hadoop services, see Hadoop connection and access information list.