DataWave 7.x Docs - DataWave Tour: Ingest Basics

The examples below will demonstrate DataWave Ingest usage. In order to follow along in your own DataWave environment, you should first complete the Quickstart Installation

Configuration Basics

DataWave Ingest is largely driven by configuration. Below we’ll use DataWave’s example tvmaze data type (myjson config) to examine the basic settings used to establish a data type within the ingest framework.

Note: At ingest time, DataWave creates/maintains a data dictionary of known data types and their associated field names. A typical DataWave query expression will leverage one or more of these field names

Note: The dataset used for this example is from tvmaze.com/api. For demonstration purposes, the data file here was ingested during the quickstart install via the quickstart’s datawaveIngestJson function. The raw file was created by issuing queries such as http://api.tvmaze.com/singlesearch/shows?q=seinfeld&embed=cast for a variety of TV shows and concatenating the returned JSON objects into a single file

Step 1: Define the Data Type

To ingest data in Step 5 below, we’ll be utilizing the myjson-ingest-config.xml example file, which defines our tvmaze data type and encompasses all of the configuration settings required to ingest the raw JSON from tvmaze.com

The settings below are a subset of these and represent the core settings for any data type. They are replicated here only for demonstration purposes. Please see the documentation here for full descriptions of the properties.

Deploy directory: $DATAWAVE_INGEST_HOME/config/

<configuration>

 <property>
     <name>data.name</name>
     <value>myjson</value>
 </property>

 <property>
     <name>myjson.output.name</name>
     <value>tvmaze</value>
 </property>
 
 <property>
     <name>myjson.data.category.date</name>
     <value>PREMIERED</value>
 </property>

 <property>
     <name>file.input.format</name>
     <value>datawave.ingest.json.mr.input.JsonInputFormat</value>
 </property>

 <property>
     <name>myjson.reader.class</name>
     <value>datawave.ingest.json.mr.input.JsonRecordReader</value>
 </property>

 <property>
     <name>myjson.ingest.helper.class</name>
     <value>datawave.ingest.json.config.helper.JsonIngestHelper</value>
 </property>

 <property>
     <name>myjson.handler.classes</name>
     <value>datawave.ingest.json.mr.handler.ContentJsonColumnBasedHandler</value>
 </property>
 
 <property>
     <name>myjson.data.category.marking.default</name>
     <value>PRIVATE|(BAR&amp;FOO)</value>
 </property>
 
 <property>
     <name>myjson.data.category.index</name>
     <value>NAME,ID,EMBEDDED_CAST_CHARACTER_NAME,...</value>
 </property>
 
 <property>
     <name>myjson.data.category.index.reverse</name>
     <value>NAME,NETWORK_NAME,OFFICIALSITE,URL</value>
 </property>
 ...
 <!-- 
   The remaining settings in the linked config file are beyond the scope of this example
 -->
 ...
</configuration>

Step 2: Register the Data Type

The config file, ingest-config.xml, is used to register all of our data types and define a few global settings. Properties of interest are replicated here only for demonstration purposes

Deploy directory: $DATAWAVE_INGEST_HOME/config/

<configuration>
 
 <property>
   <name>ingest.data.types</name>
   <value>myjson,...</value>
   <description>
     Comma-delimited list of data types to be processed by the system. The
     {data.name} value for your data type MUST appear in this list in order for
     it to be processed 
   </description>
 </property>
 ...
 ...
 <property>
   <name>event.discard.interval</name>
   <value>0</value>
   <description>
     Per the DataWave data model, each data object (a.k.a. "event") has an
     associated date, and this date is used to determine the object's YYYYMMDD shard
     partition within the primary data table.
     
     With that in mind, the value of this property is defined in milliseconds and
     denotes that an object having a date prior to (NOW - event.discard.interval)
     should be discarded.
     
     E.g., use a value of (1000 x 60 x 60 x 24) to automatically discard
     objects more than 24 hrs old
       
     - A value of 0 disables this check
     - A data type can override this value with its own
       "{data.name}.event.discard.interval" property
   </description>
 </property>
 ...
 <!-- 
      The remaining settings in this file are beyond the scope of this example,
      and will be covered later 
 -->
 ...
</configuration>

Step 3: Additional Considerations

3.1: The ‘all’ Data Type

The all-config.xml file represents a globally-recognized data type that may be leveraged to configure settings that we wish to have applied to all data types.

For example, the all.handler.classes property below will enable edge creation and date indexing for any data types that are configured to take advantage of those behaviors. That is, these particular classes are designed such that they require additional configuration in order to have any effect during ingest, whereas other handler classes may have no such requirement.

Deploy directory: $DATAWAVE_INGEST_HOME/config/

<configuration>
 ...
 ...
 <property>
     <name>all.handler.classes</name>
     <value>
       datawave.ingest.mapreduce.handler.edge.ProtobufEdgeDataTypeHandler,
       datawave.ingest.mapreduce.handler.dateindex.DateIndexDataTypeHandler
     </value>
     <description>
       Comma-delimited list of data type handlers to be utilized by all
       registered types, in *addition* to any distinct handlers set by
       individual data types via their own *-config.xml files.
     </description>
 </property>
 ...
 ...
 <property>
     <name>all.ingest.policy.enforcer.class</name>
     <value>datawave.policy.IngestPolicyEnforcer$NoOpIngestPolicyEnforcer</value>
     <description>
        Name of the class to use for record-level (or per-data object) policy
        enforcement on all incoming records
     </description>
 </property>
 ...
 <!-- 
     The remaining settings in this file are beyond the scope of this example,
     and will be covered later 
 -->
 ...
</configuration>

3.2: Edge Definitions

The example config, edge-definitions.xml, is used to define various edges, all of which are based on our example data types. Note that several edge types are defined for our tvmaze/myjson data type. Performing edge-related queries will be covered later in the tour.

Deploy directory: $DATAWAVE_INGEST_HOME/config/

The ProtobufEdgeDataTypeHandler class, as mentioned above, is responsible for generating graph edges from incoming data objects. Thus, it leverages this config file to determine how and when to create edge key/value pairs for a given data type.

Note: If the DataWave Edge Table contains graph edges, then a dictionary of known/registered edge types will be maintained in DataWave’s data dictionary

In Steps 1, 2 and 3, we looked at a few of the most important configuration properties for establishing a data type within the DataWave Ingest framework

data.name, and optionally {data.name}.output.name, will uniquely identify a data type and its associated raw data feed. These two properties can be leveraged to split up a single data type over multiple data feeds, if needed
{data.name}.reader.class, {data.name}.ingest.helper.class, {data.name}.handler.classes, and all.handler.classes together define processing pipeline for a given data type, in order to transform raw input into data objects comprised of Accumulo key/value pairs
The all-config.xml file can be used to control various ingest behaviors for all registered data types
Graph edges in DataWave are composed of Normalized Field Value pairs known to exist within the Data Objects of DataWave’s registered data types. Thus, edge types are defined via configuration on a per-data type basis. Given a single data object as input, DataWave Ingest may emit zero or more edges based on this configuration.

Load Some New Data

Lastly, we’ll fetch some new data via the TVMAZE-API service and we’ll load it into DataWave.

4: Stage the Raw Data
5: Run the Ingest M/R Job

Step 4: Stage the Raw Data

Here we use the quickstart’s ingest-tv-shows.sh script to fetch additional TV shows along with cast member information. If you wish, the default list of shows to download may be overridden by passing your own comma-delimited list of show names via the --shows argument.

Use the --help option for more information.

The script simply iterates over the specified list of TV show names, invokes http://api.tvmaze.com/singlesearch/shows?q=SHOWNAME&embed=cast for each show, and appends each search result to the output file specified by --outfile

 $ cd DW_SOURCE/contrib/datawave-quickstart/bin/services/datawave/ingest-examples
 
 # We'll use the --download-only flag to write the data to a local file only,
 # to avoid ingesting the data automatically...

 $ ./ingest-tv-shows.sh --download-only --outfile ~/more-tv-shows.json
 
 [DW-INFO] - Writing json records to /home/me/more-tv-shows.json
 [DW-INFO] - Downloading show data: 'Veep'
 [DW-INFO] - Downloading show data: 'Game of Thrones'
 [DW-INFO] - Downloading show data: 'I Love Lucy'
 [DW-INFO] - Downloading show data: 'Breaking Bad'
 [DW-INFO] - Downloading show data: 'Malcom in the Middle'
 [DW-INFO] - Downloading show data: 'The Simpsons'
 [DW-INFO] - Downloading show data: 'Sneaky Pete'
 [DW-INFO] - Downloading show data: 'King of the Hill'
 [DW-INFO] - Downloading show data: 'Threes Company'
 [DW-INFO] - Downloading show data: 'The Andy Griffith Show'
 [DW-INFO] - Downloading show data: 'Matlock'
 [DW-INFO] - Downloading show data: 'North and South'
 [DW-INFO] - Downloading show data: 'MASH'
 [DW-INFO] - Data download is complete
 ...
 

Step 5: Run the Ingest M/R Job

All that’s left now is to write the raw data to HDFS and then invoke DataWave’s IngestJob class with the appropriate parameters. To accomplish both, we’ll simply invoke a quickstart utility function, datawaveIngestJson

This quickstart function also displays the actual bash commands required to ingest the file, as shown below.

They key thing to note here is that DataWave’s live-ingest.sh script is used, which passes arguments to the IngestJob class in order to enable Live Ingest mode for our data.

Note: Typically, MapReduce job submission for a particular data type is automated by a DataWave Flag Maker process. Configuring and running a flag maker is beyond the scope of this exercise

 # Quickstart utility function...

 $ datawaveIngestJson ~/more-tv-shows.json

 [DW-INFO] - Initiating DataWave Ingest job for '~/more-tv-shows.json'
 [DW-INFO] - Loading raw data into HDFS:
 
   # Here's the command to write the raw data to HDFS...
   hdfs dfs -copyFromLocal ~/more-tv-shows.json /Ingest/myjson

 [DW-INFO] - Submitting M/R job:
 
   # And here's the command to submit the ingest job. The "live-ingest.sh" script
   # passes the parameters below and others to DataWave's IngestJob class...
   $DATAWAVE_INGEST_HOME/bin/ingest/live-ingest.sh /Ingest/myjson/more-tv-shows.json 1 \
      -inputFormat datawave.ingest.json.mr.input.JsonInputFormat \
      -data.name.override=myjson
 ...
 ...
 # Job log is written to the console and also to QUICKSTART_DIR/data/hadoop/yarn/log/
 ...
 ...
 18/04/01 19:45:43 INFO mapreduce.Job: Running job: job_1523209920746_0004
 18/04/01 19:45:48 INFO mapreduce.Job: Job job_1523209920746_0004 running in uber mode : false
 18/04/01 19:45:48 INFO mapreduce.Job:  map 0% reduce 0%
 18/04/01 19:45:55 INFO mapreduce.Job:  map 100% reduce 0%
 18/04/01 19:45:55 INFO mapreduce.Job: Job job_1523209920746_0004 completed successfully
 18/04/01 19:45:55 INFO mapreduce.Job: Counters: 43
 	File System Counters
 		FILE: Number of bytes read=0
 		FILE: Number of bytes written=188556
 		FILE: Number of read operations=0
 		FILE: Number of large read operations=0
 		FILE: Number of write operations=0
 		HDFS: Number of bytes read=394045
 		HDFS: Number of bytes written=0
 		HDFS: Number of read operations=3
 		HDFS: Number of large read operations=0
 		HDFS: Number of write operations=0
 	Job Counters 
 		Launched map tasks=1
 		Data-local map tasks=1
 		Total time spent by all maps in occupied slots (ms)=8842
 		Total time spent by all reduces in occupied slots (ms)=0
 		Total time spent by all map tasks (ms)=4421
 		Total vcore-seconds taken by all map tasks=4421
 		Total megabyte-seconds taken by all map tasks=9054208
 	Map-Reduce Framework
 		Map input records=13
 		Map output records=19584
 		Input split bytes=119
 		Spilled Records=0
 		Failed Shuffles=0
 		Merged Map outputs=0
 		GC time elapsed (ms)=128
 		CPU time spent (ms)=9650
 		Physical memory (bytes) snapshot=311042048
 		Virtual memory (bytes) snapshot=2908852224
 		Total committed heap usage (bytes)=253427712
 	Content Index Counters
 		Document synonyms processed=24
 		Document tokens processed=2083
 	EVENTS_PROCESSED
 		MYJSON=13
 	FILE_NAME
 		hdfs://localhost:9000/Ingest/myjson/more-tv-shows.json=1
 	LINE_BYTES
 		MAX=64184
 		MIN=4686
 		TOTAL=250260
 	ROWS_CREATED
 		ContentJsonColumnBasedHandler=12231
 		DateIndexDataTypeHandler=0
 		ProtobufEdgeDataTypeHandler=8672
 	datawave.ingest.metric.IngestOutput
 		DUPLICATE_VALUE=1563
 		MERGED_VALUE=9104
 		ROWS_CREATED=20903
 	File Input Format Counters 
 		Bytes Read=393926
 	File Output Format Counters 
 		Bytes Written=0
 18/04/01 19:45:55 INFO datawave.ingest: Counters: 43
 	File System Counters
 		FILE: Number of bytes read=0
 		FILE: Number of bytes written=188556
 		FILE: Number of read operations=0
 		FILE: Number of large read operations=0
 		FILE: Number of write operations=0
 		HDFS: Number of bytes read=394045
 		HDFS: Number of bytes written=0
 		HDFS: Number of read operations=3
 		HDFS: Number of large read operations=0
 		HDFS: Number of write operations=0
 	Job Counters 
 		Launched map tasks=1
 		Data-local map tasks=1
 		Total time spent by all maps in occupied slots (ms)=8842
 		Total time spent by all reduces in occupied slots (ms)=0
 		Total time spent by all map tasks (ms)=4421
 		Total vcore-seconds taken by all map tasks=4421
 		Total megabyte-seconds taken by all map tasks=9054208
 	Map-Reduce Framework
 		Map input records=13
 		Map output records=19584
 		Input split bytes=119
 		Spilled Records=0
 		Failed Shuffles=0
 		Merged Map outputs=0
 		GC time elapsed (ms)=128
 		CPU time spent (ms)=9650
 		Physical memory (bytes) snapshot=311042048
 		Virtual memory (bytes) snapshot=2908852224
 		Total committed heap usage (bytes)=253427712
 	Content Index Counters
 		Document synonyms processed=24
 		Document tokens processed=2083
 	EVENTS_PROCESSED
 		MYJSON=13
 	FILE_NAME
 		hdfs://localhost:9000/Ingest/myjson/more-tv-shows.json=1
 	LINE_BYTES
 		MAX=64184
 		MIN=4686
 		TOTAL=250260
 	ROWS_CREATED
 		ContentJsonColumnBasedHandler=12231
 		DateIndexDataTypeHandler=0
 		ProtobufEdgeDataTypeHandler=8672
 	datawave.ingest.metric.IngestOutput
 		DUPLICATE_VALUE=1563
 		MERGED_VALUE=9104
 		ROWS_CREATED=20903
 	File Input Format Counters 
 		Bytes Read=393926
 	File Output Format Counters
 		Bytes Written=0
 	...
 	...

 # Use the Accumulo shell to verify the new TV shows were loaded into
 # the primary data table (ie, the "shard" table in the quickstart env)...
 	
 $ ashell
 
   Shell - Apache Accumulo Interactive Shell
   -
   - version: 1.8.1
   - instance name: my-instance-01
   - instance id: cc3e8158-a94a-4f2e-af9e-d1014b5d1912 
   -
   - type 'help' for a list of available commands
   -
   root@my-instance-01> grep Veep -t datawave.shard

In Steps 4 and 5, we acquired some new raw data via the TVMAZE-API service and we ingested it via MapReduce

We loaded the raw data into HDFS with a simple copy command…

  hdfs dfs -copyFromLocal ~/more-tv-shows.json /Ingest/myjson

Finally, we used DataWave’s live-ingest.sh script to kick off the MapReduce job. This script passes the-outputMutations and -mapOnly flags to the IngestJob class to enable “live” mode…

 $DATAWAVE_INGEST_HOME/bin/ingest/live-ingest.sh \
   /Ingest/myjson/more-tv-shows.json \                           # Raw file in HDFS
   1 \                                                           # $NUM_SHARDS (i.e., number of reducers)
   -inputFormat datawave.ingest.json.mr.input.JsonInputFormat \  # Overrides IngestJob's default input format
   -data.name.override=myjson                                    # Forces the job to use our 'myjson' config

Previous Next

Tags: