Data Type Configuration

Here, Data Type denotes a distinct flow of data into the system in which all the raw input arrives in the same binary format and conforms to some well-defined information schema (e.g., an XML dataset where all the input files conform to the same XSD)

Configuration Files

  • File Name: {Data Type}-config.xml
    • The only requirement for the file name is that it must end with “-config.xml
    • Example file: myjson-ingest-config.xml
  • Edge definitions for the data type, if any, should be defined in a distinct, global config file

Properties

In practice, the settings available to a given data type may originate from any number of specialized classes throughout the ingest API, each class establishing its own set of configurable behaviors for various ingest-related purposes. Thus, the properties below are a relatively small subset of all those possible, but they represent core settings that will be common across most, if not all, of your data types.

Data Type Properties
Property NameDescription
data.nameThis value is effectively an identifier for both the data type and its associated data feed. Therefore, the value must be unique across all data type configs.

Unless a (data.name).output.name value is specified, this value will also be used as the Data Type (DT) identifier (and Column Family prefix) for all associated objects in the primary data table. If so, then it may be leveraged by query clients for data filtering purposes.

Note that '(data.name).' must be used as a prefix for (most of) the data type's remaining property names
(data.name).output.nameThis value will be used to identify the Data Type (DT) and to establish the Column Family prefix for all associated data objects in Accumulo. Thus, query clients may leverage this value for data filtering.

Unlike data.name, this value does not have to be unique across all configs.

For example, we might find later on that there is enrichment info related to our original data feed that we'd like to incorporate. Rather than modify the original data feed and its config, we may opt to establish a new feed with its own distinct config. If so, we may find it beneficial to reuse the *.output.name value in the new feed's configuration

Using the same output name for the new feed allows its data objects in Accumulo to be merged into and collocated with the corresponding data objects from the original feed, provided that both are utilizing compatible sharding and UID creation strategies for their respective objects
(data.name).data.category.dateA known field name within the data to be used, if present, for the given object's shard row date (a.k.a. "event date"), thus affecting the object's partition assignment within the primary data table
(data.name).data.category.date.formatsKnown/valid date format(s) for the field identified by (data.name).data.category.date. Comma-delimited, if more than one.

Examples:

yyyy-MM-dd,
yyyy-MM-dd'T'HH:mm:ss'Z',
yyyy-MM-dd HH:mm:ss
file.input.formatHadoop MapReduce InputFormat implementation. Fully-qualified class name
(data.name).reader.classFully-qualified name of the class implementing Hadoop MapReduce RecordReader and extending datawave.ingest.input.reader.EventRecordReader. As such, this class presents raw data objects (in the form of datawave.ingest.data.RawRecordContainer) as input to the DataWave Ingest mapper, datawave.ingest.mapreduce.EventMapper.
(data.name).ingest.helper.classImplements datawave.ingest.data.config.ingest.IngestHelperInterface, for parsing/extracting field name/value pairs from a single raw data object. Fully-qualified class name
(data.name).handler.classesComma-delimited list of classes that will process each data object in order to produce Accumulo key/value pairs in accordance with DataWave's data model. These classes implement datawave.ingest.mapreduce.handler.DataTypeHandler

Typically, a data type will configure at least one concrete class here that is derived from datawave.ingest.mapreduce.handler.shard.ShardedDataTypeHandler, which is a specialized DataTypeHandler abstraction tailored for ingest into the DataWave data model

(data.name).data.category.indexComma-delimited list of fields names that we want to have forward indexed in order to make them searchable via the query api
(data.name).data.category.index.reverseComma-delimited list of fields names that we want to have reverse indexed in order to make them searchable via leading wildcard
(data.name).data.category.marking.defaultThe default behavior of DataWave is to interpret this value as the exact Accumulo visibility expression to be applied to each object and data field during ingest. This is due to DataWave's default MarkingsHelper implementation, datawave.ingest.data.config.MarkingsHelper.NoOp

Example value: PRIVATE|(BAR&FOO)

Thus, security marking behavior is API-driven and may be overridden as needed by implementing a specialized datawave.ingest.data.config.MarkingsHelper, which can then be injected at runtime via the datawave.ingest.config.IngestConfigurationFactory service loader
(data.name).(FieldName).data.field.markingThis property may be used to apply distinct security markings to specific fields as needed, overriding the (data.name).data.category.marking.default property for the given field.

That is, the configured value here will be used to assign the appropriate security marking to the "FieldName" field

(data.name).(FieldName).data.field.type.classFully-qualified class name of the DataWave type to be used to interpret and normalize "FieldName" values

Example types are datawave.data.type.DateType, datawave.data.type.NumberType, datawave.data.type.GeoType, etc

Flag Maker Configuration

Configuration Files

File Name: flag-maker-{Flow Name}.xml

This file contains configuration settings for a single Flag Maker process and its associated data types. The above file name format is only a recommendation. The file name is not important and can be whatever you’d like.

Examples in the DataWave project include two Flag Maker configs and two sets of accompanying bash scripts. These demonstrate bulk ingest and live ingest data flows respectively. However, new configs and scripts can be created as needed. Generally speaking, there is no upper bound on the number of Flag Maker processes that DataWave Ingest can support.

Scripts

  • {Flow Name}-ingest-server.sh – regulates the number of jobs running and existing marker files for the flow, calls {Flow Name}-execute.sh if more jobs can be supported
  • {Flow Name}-execute.sh – runs the {Flow Name}-ingest.sh command from the first line in the flag file
  • {Flow Name}-ingest.sh – starts the mapreduce job

Classes and Interfaces

  • FlagMaker.java
  • FlagMakerConfig.java
  • FlagDataTypeConfig.java
  • FlagDistributor.java

Properties

Flag Maker Instance Properties
Property NameDescription
baseHDFSDirBase HDFS directory
datawaveHome/path/to/datawave-ingest/current
distributorTypeOne of "simple", "date", or "folderdate". See SimpleFlagDistributor, DateFlagDistributor, DateFolderFlagDistributor classes respectively
filePatternRegex of files to be added to the file list (ignore “.” files, etc)
flagFileDirectoryLocal directory on the ingest master host to put flag files
hdfsIngest namenode uri. E.g., hdfs://ingest.namenode.host:9000
setFlagFileTimestampIf set to true, then the timestamp on flag files will be set to the last timestamp of the file contained therein
sleepMilliSecsWait this long before making another flag file. Defaults to 15 seconds
socketPortPort on which this flag maker will listen for a shutdown command
timeoutMilliSecsStop appending to the input file list after this time. Defaults to 5 minutes. That is, if there is any data to be processed, then a flag file must be created within this time period regardless of other considerations
useFolderTimestampIf set to true, use folder for file timestamp instead of the actual file timestamp
Flag Maker Data Type Properties
Property NameDescription
dataNameUnique data.name identifier for the registered data type
distributionArgsAllows arguments to be passed to the FlagDistributor instance. Defauts to "none"
extraIngestArgsExtra arguments to pass to the ingest process
fileListMarkerMarker to aid flag file parsing. Denotes that a list of input files for the MR job will follow immediately, one input file per line
folderFolder under baseHDFSDir to look for files of type dataName. The folder will be considered an absolute path if it leads with a slash, otherwise relative to the HDFS base dir
ingestPoolUsed in the naming of flag files and may also identify the yarn scheduler queue to use
inputFormatInput format to use for the job. Defaults to datawave.ingest.input.reader.event.EventSequenceFileInputFormat
lifoShould we process the data lifo (true) or fifo (false)? Defaults to false, i.e., fifo. This is based on the file date, within a bucket
maxFlagsMaximum number of blocks (mappers) per job. Allows you to override maximum files/mappers/blocks for the given data type
reducersNumber of reducers to use for ingest jobs
scriptThis will be the script used to launch the ingest job. Forms the basis of the command that will be written to the flag file
timeoutMilliSecsOverrides the parent flag maker's timeoutMilliSecs setting

Bulk Loader Configuration

Usage

Java class: datawave.ingest.mapreduce.job.BulkIngestMapFileLoader

*.BulkIngestMapFileLoader hdfsWorkDir jobDirPattern instanceName zooKeepers username password \
   [-sleepTime sleepTime] \
   [-majcThreshold threshold] \
   [-majcCheckInterval count] \
   [-majcDelay majcDelay] \
   [-seqFileHdfs seqFileSystemUri] \
   [-srcHdfs srcFileSystemURI] \
   [-destHdfs destFileSystemURI] \
   [-jt jobTracker] \
   [-shutdownPort portNum] \
   confFile [{confFile}]

Properties

Bulk Loader Arguments
Property NameDescription
confFileArray of *-config.xml files
destHdfsDestination file system URI (Warehouse)
hdfsWorkDirDirectory in HDFS to watch
instanceNameAccumulo instance name
jobDirPatternPattern for dirs in hdfsWorkDir to check for the complete file marker
jtJob tracker node
majcCheckIntervalNumber of bulk loads to process before rechecking majcThreshold and majcDelay
majcDelayAmount of time (ms) to wait between bringing map files online
majcThresholdMax number of compactions allowed before waiting
maxDirectoriesMax number of directories
numAssignThreadsNumber of bulk import assignment threads (default 4)
numHdfsThreadsNumber of threads to use for concurrent HDFS operations (default 1)
numThreadsNumber of bulk import threads (default 8)
passwordpassword
seqFileHdfsSequence file system URI (Ingest)
shutdownPortShutdown port
sleepTimeAmount of time (ms) to sleep between checks for map files
srcHdfsSource file system URI (Warehouse)
usernameusername
zooKeepersZookeepers