DataWave 7.x Docs - DataWave Ingest Configuration

Data Type Configuration

Here, Data Type denotes a distinct flow of data into the system in which all the raw input arrives in the same binary format and conforms to some well-defined information schema (e.g., an XML dataset where all the input files conform to the same XSD)

Configuration Files

File Name: {Data Type}-config.xml
- The only requirement for the file name is that it must end with “-config.xml”
- Example file: myjson-ingest-config.xml
Edge definitions for the data type, if any, should be defined in a distinct, global config file
- Example file: edge-definitions.xml

Properties

In practice, the settings available to a given data type may originate from any number of specialized classes throughout the ingest API, each class establishing its own set of configurable behaviors for various ingest-related purposes. Thus, the properties below are a relatively small subset of all those possible, but they represent core settings that will be common across most, if not all, of your data types.

Data Type Properties
Property Name	Description
data.name	This value is effectively an identifier for both the data type and its associated data feed. Therefore, the value must be unique across all data type configs. Unless a (data.name).output.name value is specified, this value will also be used as the Data Type (DT) identifier (and Column Family prefix) for all associated objects in the primary data table. If so, then it may be leveraged by query clients for data filtering purposes. Note that '(data.name).' must be used as a prefix for (most of) the data type's remaining property names
(data.name).output.name	This value will be used to identify the Data Type (DT) and to establish the Column Family prefix for all associated data objects in Accumulo. Thus, query clients may leverage this value for data filtering. Unlike data.name, this value does not have to be unique across all configs. For example, we might find later on that there is enrichment info related to our original data feed that we'd like to incorporate. Rather than modify the original data feed and its config, we may opt to establish a new feed with its own distinct config. If so, we may find it beneficial to reuse the *.output.name value in the new feed's configuration Using the same output name for the new feed allows its data objects in Accumulo to be merged into and collocated with the corresponding data objects from the original feed, provided that both are utilizing compatible sharding and UID creation strategies for their respective objects
(data.name).data.category.date	A known field name within the data to be used, if present, for the given object's shard row date (a.k.a. "event date"), thus affecting the object's partition assignment within the primary data table
(data.name).data.category.date.formats	Known/valid date format(s) for the field identified by (data.name).data.category.date. Comma-delimited, if more than one. Examples: yyyy-MM-dd, yyyy-MM-dd'T'HH:mm:ss'Z', yyyy-MM-dd HH:mm:ss
file.input.format	Hadoop MapReduce InputFormat implementation. Fully-qualified class name
(data.name).reader.class	Fully-qualified name of the class implementing Hadoop MapReduce RecordReader and extending datawave.ingest.input.reader.EventRecordReader. As such, this class presents raw data objects (in the form of datawave.ingest.data.RawRecordContainer) as input to the DataWave Ingest mapper, datawave.ingest.mapreduce.EventMapper.
(data.name).ingest.helper.class	Implements datawave.ingest.data.config.ingest.IngestHelperInterface, for parsing/extracting field name/value pairs from a single raw data object. Fully-qualified class name
(data.name).handler.classes	Comma-delimited list of classes that will process each data object in order to produce Accumulo key/value pairs in accordance with DataWave's data model. These classes implement datawave.ingest.mapreduce.handler.DataTypeHandler Typically, a data type will configure at least one concrete class here that is derived from datawave.ingest.mapreduce.handler.shard.ShardedDataTypeHandler, which is a specialized DataTypeHandler abstraction tailored for ingest into the DataWave data model
(data.name).data.category.index	Comma-delimited list of fields names that we want to have forward indexed in order to make them searchable via the query api
(data.name).data.category.index.reverse	Comma-delimited list of fields names that we want to have reverse indexed in order to make them searchable via leading wildcard
(data.name).data.category.marking.default	The default behavior of DataWave is to interpret this value as the exact Accumulo visibility expression to be applied to each object and data field during ingest. This is due to DataWave's default MarkingsHelper implementation, datawave.ingest.data.config.MarkingsHelper.NoOp Example value: PRIVATE\|(BAR&FOO) Thus, security marking behavior is API-driven and may be overridden as needed by implementing a specialized datawave.ingest.data.config.MarkingsHelper, which can then be injected at runtime via the datawave.ingest.config.IngestConfigurationFactory service loader
(data.name).(FieldName).data.field.marking	This property may be used to apply distinct security markings to specific fields as needed, overriding the (data.name).data.category.marking.default property for the given field. That is, the configured value here will be used to assign the appropriate security marking to the "FieldName" field
(data.name).(FieldName).data.field.type.class	Fully-qualified class name of the DataWave type to be used to interpret and normalize "FieldName" values Example types are datawave.data.type.DateType, datawave.data.type.NumberType, datawave.data.type.GeoType, etc

Flag Maker Configuration

Configuration Files

File Name: flag-maker-{Flow Name}.xml

This file contains configuration settings for a single Flag Maker process and its associated data types. The above file name format is only a recommendation. The file name is not important and can be whatever you’d like.

Examples in the DataWave project include two Flag Maker configs and two sets of accompanying bash scripts. These demonstrate bulk ingest and live ingest data flows respectively. However, new configs and scripts can be created as needed. Generally speaking, there is no upper bound on the number of Flag Maker processes that DataWave Ingest can support.

Scripts

{Flow Name}-ingest-server.sh – regulates the number of jobs running and existing marker files for the flow, calls {Flow Name}-execute.sh if more jobs can be supported
{Flow Name}-execute.sh – runs the {Flow Name}-ingest.sh command from the first line in the flag file
{Flow Name}-ingest.sh – starts the mapreduce job

Classes and Interfaces

FlagMaker.java
FlagMakerConfig.java
FlagDataTypeConfig.java
FlagDistributor.java

Properties

Flag Maker Instance Properties
Property Name	Description
baseHDFSDir	Base HDFS directory
datawaveHome	/path/to/datawave-ingest/current
distributorType	One of "simple", "date", or "folderdate". See SimpleFlagDistributor, DateFlagDistributor, DateFolderFlagDistributor classes respectively
filePattern	Regex of files to be added to the file list (ignore “.” files, etc)
flagFileDirectory	Local directory on the ingest master host to put flag files
hdfs	Ingest namenode uri. E.g., hdfs://ingest.namenode.host:9000
setFlagFileTimestamp	If set to true, then the timestamp on flag files will be set to the last timestamp of the file contained therein
sleepMilliSecs	Wait this long before making another flag file. Defaults to 15 seconds
socketPort	Port on which this flag maker will listen for a shutdown command
timeoutMilliSecs	Stop appending to the input file list after this time. Defaults to 5 minutes. That is, if there is any data to be processed, then a flag file must be created within this time period regardless of other considerations
useFolderTimestamp	If set to true, use folder for file timestamp instead of the actual file timestamp

Flag Maker Data Type Properties
Property Name	Description
dataName	Unique data.name identifier for the registered data type
distributionArgs	Allows arguments to be passed to the FlagDistributor instance. Defauts to "none"
extraIngestArgs	Extra arguments to pass to the ingest process
fileListMarker	Marker to aid flag file parsing. Denotes that a list of input files for the MR job will follow immediately, one input file per line
folder	Folder under baseHDFSDir to look for files of type dataName. The folder will be considered an absolute path if it leads with a slash, otherwise relative to the HDFS base dir
ingestPool	Used in the naming of flag files and may also identify the yarn scheduler queue to use
inputFormat	Input format to use for the job. Defaults to datawave.ingest.input.reader.event.EventSequenceFileInputFormat
lifo	Should we process the data lifo (true) or fifo (false)? Defaults to false, i.e., fifo. This is based on the file date, within a bucket
maxFlags	Maximum number of blocks (mappers) per job. Allows you to override maximum files/mappers/blocks for the given data type
reducers	Number of reducers to use for ingest jobs
script	This will be the script used to launch the ingest job. Forms the basis of the command that will be written to the flag file
timeoutMilliSecs	Overrides the parent flag maker's timeoutMilliSecs setting

Bulk Loader Configuration

Usage

Java class: datawave.ingest.mapreduce.job.BulkIngestMapFileLoader

*.BulkIngestMapFileLoader hdfsWorkDir jobDirPattern instanceName zooKeepers username password \
   [-sleepTime sleepTime] \
   [-majcThreshold threshold] \
   [-majcCheckInterval count] \
   [-majcDelay majcDelay] \
   [-seqFileHdfs seqFileSystemUri] \
   [-srcHdfs srcFileSystemURI] \
   [-destHdfs destFileSystemURI] \
   [-jt jobTracker] \
   [-shutdownPort portNum] \
   confFile [{confFile}]

Properties

Bulk Loader Arguments
Property Name	Description
confFile	Array of *-config.xml files
destHdfs	Destination file system URI (Warehouse)
hdfsWorkDir	Directory in HDFS to watch
instanceName	Accumulo instance name
jobDirPattern	Pattern for dirs in hdfsWorkDir to check for the complete file marker
jt	Job tracker node
majcCheckInterval	Number of bulk loads to process before rechecking majcThreshold and majcDelay
majcDelay	Amount of time (ms) to wait between bringing map files online
majcThreshold	Max number of compactions allowed before waiting
maxDirectories	Max number of directories
numAssignThreads	Number of bulk import assignment threads (default 4)
numHdfsThreads	Number of threads to use for concurrent HDFS operations (default 1)
numThreads	Number of bulk import threads (default 8)
password	password
seqFileHdfs	Sequence file system URI (Ingest)
shutdownPort	Shutdown port
sleepTime	Amount of time (ms) to sleep between checks for map files
srcHdfs	Source file system URI (Warehouse)
username	username
zooKeepers	Zookeepers

Tags: