Data Type Configuration
Here, Data Type denotes a distinct flow of data into the system in which all the raw input arrives in the same binary format and conforms to some well-defined information schema (e.g., an XML dataset where all the input files conform to the same XSD)
Configuration Files
- File Name: {Data Type}-config.xml
- The only requirement for the file name is that it must end with “-config.xml”
- Example file: myjson-ingest-config.xml
- Edge definitions for the data type, if any, should be defined in a distinct, global config file
- Example file: edge-definitions.xml
Properties
In practice, the settings available to a given data type may originate from any number of specialized classes throughout the ingest API, each class establishing its own set of configurable behaviors for various ingest-related purposes. Thus, the properties below are a relatively small subset of all those possible, but they represent core settings that will be common across most, if not all, of your data types.
Property Name | Description |
---|---|
data.name | This value is effectively an identifier for both the data type and its associated data feed. Therefore, the value
must be unique across all data type configs.
Unless a (data.name).output.name value is specified, this value will also be used as the Data Type (DT) identifier (and Column Family prefix) for all associated objects in the primary data table. If so, then it may be leveraged by query clients for data filtering purposes. Note that '(data.name).' must be used as a prefix for (most of) the data type's remaining property names |
(data.name).output.name | This value will be used to identify the Data Type (DT) and to establish the Column Family prefix for all associated
data objects in Accumulo. Thus, query clients may leverage this value for data filtering.
Unlike data.name, this value does not have to be unique across all configs. For example, we might find later on that there is enrichment info related to our original data feed that we'd like to incorporate. Rather than modify the original data feed and its config, we may opt to establish a new feed with its own distinct config. If so, we may find it beneficial to reuse the *.output.name value in the new feed's configuration Using the same output name for the new feed allows its data objects in Accumulo to be merged into and collocated with the corresponding data objects from the original feed, provided that both are utilizing compatible sharding and UID creation strategies for their respective objects |
(data.name).data.category.date | A known field name within the data to be used, if present, for the given object's shard row date (a.k.a. "event date"), thus affecting the object's partition assignment within the primary data table |
(data.name).data.category.date.formats | Known/valid date format(s) for the field identified by (data.name).data.category.date. Comma-delimited, if more than one.
Examples: yyyy-MM-dd,yyyy-MM-dd'T'HH:mm:ss'Z', yyyy-MM-dd HH:mm:ss |
file.input.format | Hadoop MapReduce InputFormat implementation. Fully-qualified class name |
(data.name).reader.class | Fully-qualified name of the class implementing Hadoop MapReduce RecordReader and extending datawave.ingest.input.reader.EventRecordReader. As such, this class presents raw data objects (in the form of datawave.ingest.data.RawRecordContainer) as input to the DataWave Ingest mapper, datawave.ingest.mapreduce.EventMapper. |
(data.name).ingest.helper.class | Implements datawave.ingest.data.config.ingest.IngestHelperInterface, for parsing/extracting field name/value pairs from a single raw data object. Fully-qualified class name |
(data.name).handler.classes | Comma-delimited list of classes that will process each data object in order to produce Accumulo key/value pairs in
accordance with DataWave's data model. These classes implement datawave.ingest.mapreduce.handler.DataTypeHandler
Typically, a data type will configure at least one concrete class here that is derived from datawave.ingest.mapreduce.handler.shard.ShardedDataTypeHandler, which is a specialized DataTypeHandler abstraction tailored for ingest into the DataWave data model |
(data.name).data.category.index | Comma-delimited list of fields names that we want to have forward indexed in order to make them searchable via the query api |
(data.name).data.category.index.reverse | Comma-delimited list of fields names that we want to have reverse indexed in order to make them searchable via leading wildcard |
(data.name).data.category.marking.default | The default behavior of DataWave is to interpret this value as the exact Accumulo visibility expression to be applied
to each object and data field during ingest. This is due to DataWave's default MarkingsHelper implementation,
datawave.ingest.data.config.MarkingsHelper.NoOp
Example value: PRIVATE|(BAR&FOO) Thus, security marking behavior is API-driven and may be overridden as needed by implementing a specialized datawave.ingest.data.config.MarkingsHelper, which can then be injected at runtime via the datawave.ingest.config.IngestConfigurationFactory service loader |
(data.name).(FieldName).data.field.marking | This property may be used to apply distinct security markings to specific fields as needed, overriding
the (data.name).data.category.marking.default property for the given field.
That is, the configured value here will be used to assign the appropriate security marking to the "FieldName" field |
(data.name).(FieldName).data.field.type.class | Fully-qualified class name of the DataWave type to be used to interpret and normalize "FieldName" values
Example types are datawave.data.type.DateType, datawave.data.type.NumberType, datawave.data.type.GeoType, etc |
Flag Maker Configuration
Configuration Files
File Name: flag-maker-{Flow Name}.xml
This file contains configuration settings for a single Flag Maker process and its associated data types. The above file name format is only a recommendation. The file name is not important and can be whatever you’d like.
Examples in the DataWave project include two Flag Maker configs and two sets of accompanying bash scripts. These demonstrate bulk ingest and live ingest data flows respectively. However, new configs and scripts can be created as needed. Generally speaking, there is no upper bound on the number of Flag Maker processes that DataWave Ingest can support.
Scripts
- {Flow Name}-ingest-server.sh – regulates the number of jobs running and existing marker files for the flow, calls {Flow Name}-execute.sh if more jobs can be supported
- {Flow Name}-execute.sh – runs the {Flow Name}-ingest.sh command from the first line in the flag file
- {Flow Name}-ingest.sh – starts the mapreduce job
Classes and Interfaces
- FlagMaker.java
- FlagMakerConfig.java
- FlagDataTypeConfig.java
- FlagDistributor.java
Properties
Property Name | Description |
---|---|
baseHDFSDir | Base HDFS directory |
datawaveHome | /path/to/datawave-ingest/current |
distributorType | One of "simple", "date", or "folderdate". See SimpleFlagDistributor, DateFlagDistributor, DateFolderFlagDistributor classes respectively |
filePattern | Regex of files to be added to the file list (ignore “.” files, etc) |
flagFileDirectory | Local directory on the ingest master host to put flag files |
hdfs | Ingest namenode uri. E.g., hdfs://ingest.namenode.host:9000 |
setFlagFileTimestamp | If set to true, then the timestamp on flag files will be set to the last timestamp of the file contained therein |
sleepMilliSecs | Wait this long before making another flag file. Defaults to 15 seconds |
socketPort | Port on which this flag maker will listen for a shutdown command |
timeoutMilliSecs | Stop appending to the input file list after this time. Defaults to 5 minutes. That is, if there is any data to be processed, then a flag file must be created within this time period regardless of other considerations |
useFolderTimestamp | If set to true, use folder for file timestamp instead of the actual file timestamp |
Property Name | Description |
---|---|
dataName | Unique data.name identifier for the registered data type |
distributionArgs | Allows arguments to be passed to the FlagDistributor instance. Defauts to "none" |
extraIngestArgs | Extra arguments to pass to the ingest process |
fileListMarker | Marker to aid flag file parsing. Denotes that a list of input files for the MR job will follow immediately, one input file per line |
folder | Folder under baseHDFSDir to look for files of type dataName. The folder will be considered an absolute path if it leads with a slash, otherwise relative to the HDFS base dir |
ingestPool | Used in the naming of flag files and may also identify the yarn scheduler queue to use |
inputFormat | Input format to use for the job. Defaults to datawave.ingest.input.reader.event.EventSequenceFileInputFormat |
lifo | Should we process the data lifo (true) or fifo (false)? Defaults to false, i.e., fifo. This is based on the file date, within a bucket |
maxFlags | Maximum number of blocks (mappers) per job. Allows you to override maximum files/mappers/blocks for the given data type |
reducers | Number of reducers to use for ingest jobs |
script | This will be the script used to launch the ingest job. Forms the basis of the command that will be written to the flag file |
timeoutMilliSecs | Overrides the parent flag maker's timeoutMilliSecs setting |
Bulk Loader Configuration
Usage
Java class: datawave.ingest.mapreduce.job.BulkIngestMapFileLoader
*.BulkIngestMapFileLoader hdfsWorkDir jobDirPattern instanceName zooKeepers username password \
[-sleepTime sleepTime] \
[-majcThreshold threshold] \
[-majcCheckInterval count] \
[-majcDelay majcDelay] \
[-seqFileHdfs seqFileSystemUri] \
[-srcHdfs srcFileSystemURI] \
[-destHdfs destFileSystemURI] \
[-jt jobTracker] \
[-shutdownPort portNum] \
confFile [{confFile}]
Properties
Property Name | Description |
---|---|
confFile | Array of *-config.xml files |
destHdfs | Destination file system URI (Warehouse) |
hdfsWorkDir | Directory in HDFS to watch |
instanceName | Accumulo instance name |
jobDirPattern | Pattern for dirs in hdfsWorkDir to check for the complete file marker |
jt | Job tracker node |
majcCheckInterval | Number of bulk loads to process before rechecking majcThreshold and majcDelay |
majcDelay | Amount of time (ms) to wait between bringing map files online |
majcThreshold | Max number of compactions allowed before waiting |
maxDirectories | Max number of directories |
numAssignThreads | Number of bulk import assignment threads (default 4) |
numHdfsThreads | Number of threads to use for concurrent HDFS operations (default 1) |
numThreads | Number of bulk import threads (default 8) |
password | password |
seqFileHdfs | Sequence file system URI (Ingest) |
shutdownPort | Shutdown port |
sleepTime | Amount of time (ms) to sleep between checks for map files |
srcHdfs | Source file system URI (Warehouse) |
username | username |
zooKeepers | Zookeepers |