DataWave utilizes the Accumulo table schemas described below as the basis for its ingest and query components

Primary Data Table

The primary data table uses a sharded approach and can be described as an intra-day hash partitioned table where fields in a data object are stored collocated in a single partition. The Shard ID is a function of the UID and therefore should be reproducible given the same object ingested at different points in time. This enables de-duplication of objects when they are re-ingested. The Data Type is a user defined category of the data that will typically be used at query time. The Data Type allows for further reduction in the amount of data to be searched.

The primary data table also contains an in-partition index, which we call the Field Index, and we denote the K,V pairs that are in the field index with a leading ‘fi’ in the column family. The field index is used by custom Accumulo iterators at query time to find data objects in the partition.

Optionally, if the table is used to store documents, then the original document or different views of the document can be stored in the ‘d’ column family. Typically this column family would be set up as its own locality group. An example of different views of a document could be .txt and .html versions of the original document.

To enable phrase queries on documents, the ‘tf’ column family contains a protocol buffer (PB) in the value that is a list of word offsets for the term in the document. This too could also be stored in a separate locality group.

Primary Data Table Layout
E.g.RowColumn FamilyColumn QualifierValuePurpose
SHARD DT  + NB  + UID NFN  + NB  + FV NULL Represents a name/value pair for a given field (NFN) within the Data Object (DO)
SHARD fi  + NB  + NFN NFV  + NB  + DT  + NB  + UID NULL The fi (field index) column provides an in-partition index of field names and associated values for Data Objects that exist within the given shard
SHARD d DT  + NB  + UID  + NB  + DVN DC (Optional) The d (document) column provides a named view of raw content, typically the raw source input that generated the Data Object
SHARD tf DT  + NB  + UID  + NT  + NFN TFPB (Optional) The tf (term frequency) column enables phrase queries by mapping a Data Object's tokenized terms to the positions (word offsets) of those terms within the source document
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Global Index Tables

The forward and reverse index tables serve as global indexes mapping terms to partitions. The index maps a NFN:NFV pair to a category of data within the partitions of the primary data table. The Uid.List Protocol Buffer (ULPB) object contains the number of occurrences of the NFN:NFV pair in a category of data in the partition. Additionally, the ULPB may contain the UIDs of the objects that contain the NFN:NFV. We say “may contain” because there is an upper limit on the number of UIDs in the ULPB.

Global Forward Index Table Layout
E.g.RowColumn FamilyColumn QualifierValuePurpose
NFV NFN SHARD  + NB  + DT ULPB Maps the NFV:NFN pair to a given shard partition within the Primary Data Table (PDT), and optionally to a bounded list of Data Object UIDs within that partition
+ symbol used only as a visual delimiter above. It does not appear in the actual data

NFV’s that are indexed within the global reverse index table can be searched using leading wildcards. Thus, the index is created by simply reversing the characters in the NFV

Global Reverse Index Table Layout
E.g.RowColumn FamilyColumn QualifierValuePurpose
RNFV NFN SHARD  + NB  + DT ULPB Maps the RNFV:NFN pair to a given shard partition within the Primary Data Table (PDT), and optionally to a bounded list of Data Object UIDs within that partition
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Data Dictionary Table

The data dictionary table contains metadata about the data stored in other tables and is used primarily for query planning purposes. For example, this includes information about whether a particular field is indexed, about specific type-normalization performed on a field’s values, etc. The structure of the table is as follows.

Data Dictionary Table Layout
E.g.RowColumn FamilyColumn QualifierValuePurpose
NFN e DT NULL The e column family denotes that NFN is a field that exists within objects of the given Data Type (DT) within the Primary Data Table (PDT). Note: The absence of an e entry for a given NFN and the presence of i and/or ri entries for the NFN denotes that it is an index-only field, meaning that no such field exists within the PDT's Data Objects for the given DT
NFN i DT  + NB  + YYYYMMDD Integer Count The i column family denotes that NFN exists in both the Global Forward Index (GFIDX) table and in the Primary Data Table's (PDT) fi column. The value denotes that 'Integer Count' instances of the NFN exist for the given date. Note: The absence of an e entry for a given NFN and the presence of i and/or ri entries for the NFN denotes that it is an index-only field, meaning that no such field exists within the PDT's Data Objects for the given DT
NFN ri DT  + NB  + YYYYMMDD Integer Count The ri column family denotes that NFN exists in the Global Reverse Index (GRIDX) table, and the value denotes that 'Integer Count' instances of the NFN exist for the given date
NFN f DT  + NB  + YYYYMMDD Integer Count Similar to the i column, the f (frequency) column family conveys the 'Integer Count' times that instances of NFN were seen on the given date. These entries are recorded for non-indexed fields as well
NFN t DT  + NB  + NCN NULL The t (type) column denotes that values of the given field are normalized for indexing purposes using the specified Java class
NFN desc DT Text Description The desc column family may be used to supply a text description of the NFN within type DT
NFN tf DT NULL The tf column family denotes that NFN is enabled for phrase queries, i.e., that it has tf (term frequency) column entries in the Primary Data Table (PDT)
ETYPE  + /  + EREL edge EAV[0] EPB0 The edge column family denotes that one or more Edge Table keys were formed using the given Row and CQ identifiers and were derived from field names in the Primary Data Table (PDT) that are given in the Protocol Buffer value. These dictionary entries allow an Edge Table key to be mapped back to the source Data Object from which it was derived (e.g., via the EdgeEventQuery query logic)
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Edge Table

The edge table may represent one or more graphs, any of which may be comprised of unidirectional and bidirectional edges depending on the user’s needs. A single edge key represents a unidirectional pair of Normalized Field Value vertices, which may be thought of as a source vertex and a sink vertex respectively. Thus, bidirectional edges are generated by simply creating a second key having the original source, sink, and other attributes reversed. Additional information may be encoded into an edge key as well, such as the relationship between the two vertices, the type of the edge, and others.

Edge Table Layout
E.g.RowColumn FamilyColumn QualifierValuePurpose
ENFV1  + NB  + ENFV2 ETYPE  + /  + EREL YYYYMMDD  + /  + EAV EPB1 A unidirectional pair of vertices that expresses some relationship between a pair of Field Values originating from a Data Object, where the CQ and value may contain additional metadata about the related activity
ENFV1 STATS  + NB  + ACTIVITY  + ETYPE YYYYMMDD  + NB  + EAV EPB2 Provides activity stats for ENFV1 for the given day
ENFV1 STATS  + NB  + DURATION  + ETYPE YYYYMMDD  + NB  + EAV EPB3 Provides duration stats for ENFV1 for the given day
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Date Index Table

By design, the primary data table permits at most one YYYYMMDD value to be encoded within the assigned row partition (i.e., Shard ID) of a given data object, and, by default, this date will serve as the basis for the date range criteria of any query that targets the object. However, a given data object may contain any number date-related fields, any of which may be important to a user at query time for filtering purposes.

Indexing may be configured for such fields at ingest time. If date indexing is configured for a field, then its values along with the field name itself will be mapped by entries within this table to the partitions in the primary data table where the source objects are stored. Query clients can then leverage this table to enable date range filtering based on these dates, rather than on the dates encoded within the Shard IDs of the stored objects.

Note that the Date Type Name (DTN) column family identifier here is typically leveraged to provide semantic grouping of distinct field names from disparate datasets. For example, a DTN of “SALE_DATE” might be used to group the values of semantically equivalent fields such as “PURCHASE_DATE”, “RECEIPT_DATE”, “DATE_OF_SALE”, “DATE_PURCHASED”, etc.

Data Index Table Layout
RowColumn FamilyColumn QualifierValuePurpose
YYYYMMDD1 DTN YYYYMMDD2  + NB  + DT  + NB  + NFN BitSet denoting the date's presence within specific shard partitions. E.g., bit 0 is YYYYMMDD_0, and so on Maps the date field (given by NFN) and its value (given by YYYYMMDD1) to the SHARD partition(s) specified by the YYYYMMDD2 and BitSet values
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Load Dates Table

The load dates table tracks the dates on which specific field names were loaded into specific tables via DataWave Ingest. This information may be leveraged internally for the purposes of query optimization, load date-based filtering for queries, etc.

Load Dates Table Layout
RowColumn FamilyColumn QualifierValuePurpose
NFN FIELD_NAME  + NB  + Table Name YYYYMMDD  + NB  + DT Long integer encoded via Accumulo SummingCombiner (VARLEN) Denotes that NFN from the given DT was loaded into the specified table on the given date
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Other Tables

Ingest Error Tables

The layouts associated with the four ingest error tables are identical to those listed above for the Primary Data Table, Global Index Tables, and Data Dictionary Table. The only difference is that the respective error tables here are meant to capture Data Objects that failed to be fully loaded during ingest due to one or more processing errors.

That is, these tables are intended to capture all successfully-processed NFN:FV pairs from their respective Data Objects, just as they would have appeared in the normal schema, including supplemental key/value pairs related to the errors themselves. Since schema descriptions for the four primary data tables apply here as well, we describe below only the specific entries used to convey information about the error(s)

Note: The fields described below will supplement each Data Object persisted in the Ingest Errors data table Note: The date value encoded within the Shard ID of an object here is based on the object's LOAD_DATE
RowColumn FamilyColumn QualifierValuePurpose
SHARD DT  + NB  + UID ERROR  + NB  + FV NULL Denotes the error category, where NFN = ERROR and FV is one of datawave.ingest.data.RawDataErrorNames
SHARD DT  + NB  + UID JOB_ID  + NB  + FV NULL Identifies the job, where NFN = JOB_ID and FV is the MapReduce Job ID
SHARD DT  + NB  + UID JOB_NAME  + NB  + FV NULL Identifies the job, where NFN = JOB_NAME and FV is the MapReduce Job Name
SHARD D DT  + NB  + UID  + NB  + EVENT DC Conveys the raw content of the actual data object that failed via DC, using EVENT as the DVN
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Query Metrics Tables

The layouts associated with the four query metrics tables are identical to those listed above for the Primary Data Table, Global Index Tables, and Data Dictionary Table. The only difference here is that the respective query metrics tables are intended to persist information associated with user queries exclusively. They can be leveraged by users to gain insight into their own queries, and by administrators to gain insight into active and historical queries. Since schema descriptions for the primary data tables apply here as well, we describe below only the specific NFN and FV components that are used to represent a query metrics Data Object.

Query Metrics Schema Note: The Data Type (DT) portion of all respective keys is denoted by the literal querymetrics Note: The YYYYMMDD shard date for a query metrics object identifies the creation date of the query
NFNFV
AUTHORIZATIONSUser-specified list of Accumulo auths
BEGIN_DATEUser-specified date range start value
END_DATEUser-specified date range end value
PARAMETERSUser-specified query parameters
QUERYUser-specified query expression
QUERY_LOGICUser-specified query logic name
QUERY_NAMEUser-specified name assigned to the query
CREATE_DATEQuery creation date/time YYYYMMDD hhmmss
DOC_RANGESInteger count of document-based table ranges generated by the query
FI_RANGESInteger count of Field Index based table ranges generated by the query
HOSTHost name from which the query originated
LAST_UPDATEDDate/time of most recent update to the query (YYYYMMDD hhmmss)
LIFECYCLEOne of datawave.webservice.query.metric.BaseQueryMetric.Lifecycle
NEXT_COUNTNumber of times user requested a page of results, or invoked the /Query/{query id}/next endpoint, for the query
NUM_UPDATESInteger count denoting the number of updates that occurred to the query
PAGE_METRICS.[n]Metrics metadata for a single page of results, where N is an integer denoting the Nth page
PLANThe final (actual) query expression to be evaluated by DataWave
QUERY_IDUID generated by the system upon query creation, used in subsequent API calls
QUERY_TYPEName of the internal Java class that was employed to encapsulate the query and its state
CREATE_TIMEAmount of time elapsed (in ms) for the creation phase of the query
ELAPSED_TIMETotal amount of time elapsed (in ms) between creation and the query's current state
SEEK_COUNTNumber of Accumulo Iterator seeks required by the query
SOURCE_COUNTNumber of Accumulo Iterator sources required by the query
USERUser name associated with the query
USER_DNUser's distinguished name from client certificate

Terms and Definitions

Data Model Terms and Definitions
NameDefinitionID
Column Family The portion of the Accumulo key representing the column family CF
Column Qualifier The portion of the Accumulo key representing the column qualifier CQ
Document Column Family Column family for document content. Literal string with the value of ' d ' D
Document Content Raw document content for the ' d ' column entry. Gzip compressed, then base64 encoded DC
Data Object A real-life object for which field names and field values can be derived. Typically, use of this term will denote an object stored within DataWave's primary data table DO
Data Type An identifier that describes some category or facet of the Data Objects associated with it DT
Date Type Name A user-specified identifier used as the CF within the Date Index table. E.g., this value may be used as a query parameter to indicate to the Query API that it should leverage the Date Index for date range filtering rather than the default, shard-based date range DTN
Document View Name Identifier within the ' d ' column qualifier, which categorizes the content DVN
Edge Attribute Vector Three optional, user-supplied attributes which can be appended to the edge key's CQ. These attributes are typically delimited by ' / ' EAV
EAV Element #1 The first element of the Edge Attribute Vector (EAV). Typically used to denote a subcategory or facet of the given Edge Type (ETYPE) value. This element is required (typically set in configuration), whereas the remaining EAV elements are optional EAV[0]
Source Edge Vertex (NFV) A Field Value processed by a Normalizer that represents the SOURCE vertex in a unidirectional edge key ENFV1
Sink Edge Vertex (NFV) A Field Value processed by a Normalizer that represents the SINK vertex in a unidirectional edge key ENFV2
Edge Metadata Protocol Buffer Protocol Buffer denoting the Data Object (DO) field mappings that were used in the creation of one or more Edge Table keys EPB0
Edge Table Protocol Buffer Protocol Buffer containing one LONG which is the count of the edge for the day and an INT32 bitmask for hour of the day EPB1
Edge STATS‑ACTIVITY Protocol Buffer Protocol Buffer containing one LONG[24] which is the count per hour for each our of the day EPB2
Edge STATS‑DURATION Protocol Buffer Protocol Buffer containing histogram as LONG[7]: [ <10 sec, 10-30 sec, 30-60 sec, 1-5 min, 5-10 min, 10-30 min, >30 min ] EPB3
Edge Relationship Two user-provided string values delimited by ' - ' used as part of the edge key CF, for describing the relationship between the respective SOURCE and SINK vertices EREL
Edge Type User-provided string value used as part of the edge key CF, for categorizing edges ETYPE
Field Index Column Family A column family prefix for field index keys. Literal string with the value of ' fi ' FI
Field Name The name of a field in a Data Object FN
Field Value The raw value associated with a particular Field Name FV
Normalizer An object that transforms data so that it can be sorted lexicographically N
Null Byte Character A character with the hex value of \x00, used as a delimiter NB
Normalizer Class Name The fully-qualified name of the Normalizer (N) Java class used for indexing purposes NCN
Normalized Field Name A Field Name (FN) transformed to be compatible with the query code NFN
Normalized Field Value A Field Value (FV) that has been processed/transformed by a Normalizer (N) NFV
Normalized Term An individual term within a document transformed to be compatible with the query code NT
Null Value No value is assigned for keys of this type NULL
Protocol Buffer Google Protocol Buffer PB
Reversed Normalized Field Value A Field Value (FV) that has been processed/transformed by a Normalizer (N) and then reversed RNFV
Shard ID The partition identifier in the form of YYYYMMDD_N, where N is an integer denoting a specific sub-partition within the given day SHARD
Term Frequency Column Family Column family for term frequency keys. Literal string with the value of ' tf ' TF
Term Frequency Protocol Buffer Google Protocol Buffer containing a list of word offsets for the term in the document TFPB
Unique Identifier An internal identifier generated for the Data Object that is typically a hash of the raw source data. Guaranteed to be deterministic and unique within a given partition and Data Type UID
Uid.List Protocol Buffer A protocol buffer object that contains the number of occurrences of the field name/value pair, along with a bounded list of UIDs ULPB
Vertex A single node within a graph VERT