Using a Job Entry to Load Data into Hive
- Hadoop
- Pentaho Data Integration
- Hive
PDI jobs can be used to put files into Hive from many different sources. This tutorial instructs you how to use a PDI job to load a sample data file into a Hive table.
If not already running, start Hadoop, PDI, and the Hive server. Unzip the sample data files and put them in a convenient location: weblogs_parse.txt.zip.
This file should be placed in the /user/pdi/weblogs/parse directory of HDFS using these three commands.
hadoop fs -mkdir /user/pdi/weblogs hadoop fs -mkdir /user/pdi/weblogs/parse hadoop fs -put weblogs_parse.txt /user/pdi/weblogs/parse/part-00000
If you previously completed the Using Pentaho MapReduce to Parse Weblog Datatutorial, the necessary files will already be in the proper directory.
- Create a Hive Table.
- Open the Hive shell by entering
'hive'
at the command line. - Create a table in Hive for the sample data by entering
create table weblogs ( client_ip string, full_request_date string, day string, month string, month_num int, year string, hour string, minute string, second string, timezone string, http_verb string, uri string, http_status_code string, bytes_returned string, referrer string, user_agent string) row format delimited fields terminated by '\t';
- Close the Hive shell by entering
'quit'
.
- Open the Hive shell by entering
- Create a new Job to load the sample data into a Hive table by selecting File > New > Job.
- Add a Start job entry to the canvas. From the Design palette on the left, under the General folder, drag a Start job entry onto the canvas.
- Add a Hadoop Copy Files job entry to the canvas. From the Design palette, under the Big Data folder, drag a Hadoop Copy Files job entry onto the canvas.
- Connect the two job entries by hovering over the Start entry and selecting the output connector
, then drag the connector arrow to the Hadoop Copy Files entry.
- Enter the source and destination information within the properties of the Hadoop Copy Files entry by double-clicking it.
- For File/Folder source(s), enter
hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/parse
, where NAMENODE and PORT reflect your Hadoop destination. - For File/Folder destination(s), enter
hdfs://<NAMENODE>:<PORT>/user/hive/warehouse/weblogs
. - For Wildcard (RegExp), enter
part-.*
. - Click the Add button to add the entries to the list of files to copy.
When you are done your window should look like this (your file paths may be different)
Click OK to close the window.
- For File/Folder source(s), enter
- Save the job by selecting Save as from the File menu. Enter
load_hive.kjb
as the file name within a folder of your choice. - Run the job by clicking the green Run button on the job toolbar
, or by selecting Action > Run from the menu. The Execute a job window opens. Click Launch.
An Execution Results panel opens at the bottom of the Spoon interface and displays the progress of the job as it runs. After a few seconds the job finishes successfully.
If any errors occurred the job entry that failed will be highlighted in red and you can use the Logging tab to view error messages.
- Verify the data was loaded by querying Hive.
- Open the Hive shell from the command line by entering
hive
. - Enter this query to very the data was loaded correctly into Hive.
select * from weblogs limit 10;
- Open the Hive shell from the command line by entering