DataWave 7.x Docs - DataWave Data Model

DataWave utilizes the Accumulo table schemas described below as the basis for its ingest and query components

Primary Data Table

The primary data table uses a sharded approach and can be described as an intra-day hash partitioned table where fields in a data object are stored collocated in a single partition. The Shard ID is a function of the UID and therefore should be reproducible given the same object ingested at different points in time. This enables de-duplication of objects when they are re-ingested. The Data Type is a user defined category of the data that will typically be used at query time. The Data Type allows for further reduction in the amount of data to be searched.

The primary data table also contains an in-partition index, which we call the Field Index, and we denote the K,V pairs that are in the field index with a leading ‘fi’ in the column family. The field index is used by custom Accumulo iterators at query time to find data objects in the partition.

Optionally, if the table is used to store documents, then the original document or different views of the document can be stored in the ‘d’ column family. Typically this column family would be set up as its own locality group. An example of different views of a document could be .txt and .html versions of the original document.

To enable phrase queries on documents, the ‘tf’ column family contains a protocol buffer (PB) in the value that is a list of word offsets for the term in the document. This too could also be stored in a separate locality group.

Primary Data Table Layout
E.g.	Row	Column Family	Column Qualifier	Value	Purpose
20240101_6 cars\x00uid.abc.123:DESCRIPTION\x001990 Ford Mustang (Red) 20240101_6 cars\x00uid.abc.123:MAKE\x00Ford 20240101_6 cars\x00uid.abc.123:MODEL\x00Mustang 20240101_6 cars\x00uid.abc.123:YEAR\x001990	SHARD	DT + NB + UID	NFN + NB + FV	NULL	Represents a name/value pair for a given field (NFN) within the Data Object (DO)
20240101_6 fi\x00MAKE:ford\x00cars\x00uid.abc.123 20240101_6 fi\x00MODEL:mustang\x00cars\x00uid.abc.123	SHARD	fi + NB + NFN	NFV + NB + DT + NB + UID	NULL	The fi (field index) column provides an in-partition index of field names and associated values for Data Objects that exist within the given shard
20240101_6 d:cars\x00uid.abc.123\x00JSON [Base64 encoded, gzipped json]	SHARD	d	DT + NB + UID + NB + DVN	DC	(Optional) The d (document) column provides a named view of raw content, typically the raw source input that generated the Data Object
20240101_6 tf:cars\x00uid.abc.123\x001990\x00DESCRIPTION [Protobuf of '1990' offsets] 20240101_6 tf:cars\x00uid.abc.123\x00ford\x00DESCRIPTION [Protobuf of 'ford' offsets] 20240101_6 tf:cars\x00uid.abc.123\x00mustang\x00DESCRIPTION [Protobuf of 'mustang' offsets] 20240101_6 tf:cars\x00uid.abc.123\x00red\x00DESCRIPTION [Protobuf of 'red' offsets]	SHARD	tf	DT + NB + UID + NT + NFN	TFPB	(Optional) The tf (term frequency) column enables phrase queries by mapping a Data Object's tokenized terms to the positions (word offsets) of those terms within the source document
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Global Index Tables

The forward and reverse index tables serve as global indexes mapping terms to partitions. The index maps a NFN:NFV pair to a category of data within the partitions of the primary data table. The Uid.List Protocol Buffer (ULPB) object contains the number of occurrences of the NFN:NFV pair in a category of data in the partition. Additionally, the ULPB may contain the UIDs of the objects that contain the NFN:NFV. We say “may contain” because there is an upper limit on the number of UIDs in the ULPB.

Global Forward Index Table Layout
E.g.	Row	Column Family	Column Qualifier	Value	Purpose
ford MAKE:20240101_6\x00cars [Protobuf list of uid's including 'uid.abc.123'] mustang MODEL:20240101_6\x00cars [Protobuf list of uid's including 'uid.abc.123']	NFV	NFN	SHARD + NB + DT	ULPB	Maps the NFV:NFN pair to a given shard partition within the Primary Data Table (PDT), and optionally to a bounded list of Data Object UIDs within that partition
+ symbol used only as a visual delimiter above. It does not appear in the actual data

NFV’s that are indexed within the global reverse index table can be searched using leading wildcards. Thus, the index is created by simply reversing the characters in the NFV…

Global Reverse Index Table Layout
E.g.	Row	Column Family	Column Qualifier	Value	Purpose
drof MAKE:20240101_6\x00cars [Protobuf list of uid's including 'uid.abc.123'] gnatsum MODEL:20240101_6\x00cars [Protobuf list of uid's including 'uid.abc.123']	RNFV	NFN	SHARD + NB + DT	ULPB	Maps the RNFV:NFN pair to a given shard partition within the Primary Data Table (PDT), and optionally to a bounded list of Data Object UIDs within that partition
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Data Dictionary Table

The data dictionary table contains metadata about the data stored in other tables and is used primarily for query planning purposes. For example, this includes information about whether a particular field is indexed, about specific type-normalization performed on a field’s values, etc. The structure of the table is as follows.

Data Dictionary Table Layout
E.g.	Row	Column Family	Column Qualifier	Value	Purpose
DESCRIPTION e:cars MAKE e:cars MODEL e:cars	NFN	e	DT	NULL	The e column family denotes that NFN is a field that exists within objects of the given Data Type (DT) within the Primary Data Table (PDT). Note: The absence of an e entry for a given NFN and the presence of i and/or ri entries for the NFN denotes that it is an index-only field, meaning that no such field exists within the PDT's Data Objects for the given DT
MAKE i:cars\x0020240101 \x01 MAKE i:cars\x0020240102 \x0E MODEL i:cars\x0020240101 \x01 MODEL i:cars\x0020240102 \x0E	NFN	i	DT + NB + YYYYMMDD	Integer Count	The i column family denotes that NFN exists in both the Global Forward Index (GFIDX) table and in the Primary Data Table's (PDT) fi column. The value denotes that 'Integer Count' instances of the NFN exist for the given date. Note: The absence of an e entry for a given NFN and the presence of i and/or ri entries for the NFN denotes that it is an index-only field, meaning that no such field exists within the PDT's Data Objects for the given DT
MODEL ri:cars\x0020240101 \x01 MODEL ri:cars\x0020240102 \x0E	NFN	ri	DT + NB + YYYYMMDD	Integer Count	The ri column family denotes that NFN exists in the Global Reverse Index (GRIDX) table, and the value denotes that 'Integer Count' instances of the NFN exist for the given date
MAKE f:cars\x0020240101 \x01 MODEL f:cars\x0020240101 \x01 NON_INDEXED_FIELD f:cars\x0020240101 \x01	NFN	f	DT + NB + YYYYMMDD	Integer Count	Similar to the i column, the f (frequency) column family conveys the 'Integer Count' times that instances of NFN were seen on the given date. These entries are recorded for non-indexed fields as well
MAKE t:cars\x00datawave.data.type.LcNoDiacriticsType YEAR t:cars\x00datawave.data.type.NumberType	NFN	t	DT + NB + NCN	NULL	The t (type) column denotes that values of the given field are normalized for indexing purposes using the specified Java class
MAKE desc:cars Denotes the car's make/manufacturer name MODEL desc:cars Denotes the car's model name	NFN	desc	DT	Text Description	The desc column family may be used to supply a text description of the NFN within type DT
DESCRIPTION tf:cars	NFN	tf	DT	NULL	The tf column family denotes that NFN is enabled for phrase queries, i.e., that it has tf (term frequency) column entries in the Primary Data Table (PDT)
CARS/MODEL‑MAKE edge:SALES [Protobuf containing field name mapping] CARS/MODEL‑YEAR edge:SALES [Protobuf containing field name mapping]	ETYPE + / + EREL	edge	EAV[0]	EPB₀	The edge column family denotes that one or more Edge Table keys were formed using the given Row and CQ identifiers and were derived from field names in the Primary Data Table (PDT) that are given in the Protocol Buffer value. These dictionary entries allow an Edge Table key to be mapped back to the source Data Object from which it was derived (e.g., via the EdgeEventQuery query logic)
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Edge Table

The edge table may represent one or more graphs, any of which may be comprised of unidirectional and bidirectional edges depending on the user’s needs. A single edge key represents a unidirectional pair of Normalized Field Value vertices, which may be thought of as a source vertex and a sink vertex respectively. Thus, bidirectional edges are generated by simply creating a second key having the original source, sink, and other attributes reversed. Additional information may be encoded into an edge key as well, such as the relationship between the two vertices, the type of the edge, and others.

Edge Table Layout
E.g.	Row	Column Family	Column Qualifier	Value	Purpose
mustang\x00ford CARS/MODEL‑MAKE:20240101/SALES/RegionA/Dealership1 [Protobuf histogram object] mustang\x001990 CARS/MODEL‑YEAR:20240101/SALES/RegionA/Dealership1 [Protobuf histogram object]	ENFV₁ + NB + ENFV₂	ETYPE + / + EREL	YYYYMMDD + / + EAV	EPB₁	A unidirectional pair of vertices that expresses some relationship between a pair of Field Values originating from a Data Object, where the CQ and value may contain additional metadata about the related activity
	ENFV₁	STATS + NB + ACTIVITY + ETYPE	YYYYMMDD + NB + EAV	EPB₂	Provides activity stats for ENFV₁ for the given day
	ENFV₁	STATS + NB + DURATION + ETYPE	YYYYMMDD + NB + EAV	EPB₃	Provides duration stats for ENFV₁ for the given day
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Date Index Table

By design, the primary data table permits at most one YYYYMMDD value to be encoded within the assigned row partition (i.e., Shard ID) of a given data object, and, by default, this date will serve as the basis for the date range criteria of any query that targets the object. However, a given data object may contain any number date-related fields, any of which may be important to a user at query time for filtering purposes.

Indexing may be configured for such fields at ingest time. If date indexing is configured for a field, then its values along with the field name itself will be mapped by entries within this table to the partitions in the primary data table where the source objects are stored. Query clients can then leverage this table to enable date range filtering based on these dates, rather than on the dates encoded within the Shard IDs of the stored objects.

Note that the Date Type Name (DTN) column family identifier here is typically leveraged to provide semantic grouping of distinct field names from disparate datasets. For example, a DTN of “SALE_DATE” might be used to group the values of semantically equivalent fields such as “PURCHASE_DATE”, “RECEIPT_DATE”, “DATE_OF_SALE”, “DATE_PURCHASED”, etc.

Data Index Table Layout
Row	Column Family	Column Qualifier	Value	Purpose
YYYYMMDD₁	DTN	YYYYMMDD₂ + NB + DT + NB + NFN	BitSet denoting the date's presence within specific shard partitions. E.g., bit 0 is YYYYMMDD_0, and so on	Maps the date field (given by NFN) and its value (given by YYYYMMDD₁) to the SHARD partition(s) specified by the YYYYMMDD₂ and BitSet values
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Load Dates Table

The load dates table tracks the dates on which specific field names were loaded into specific tables via DataWave Ingest. This information may be leveraged internally for the purposes of query optimization, load date-based filtering for queries, etc.

Load Dates Table Layout
Row	Column Family	Column Qualifier	Value	Purpose
NFN	FIELD_NAME + NB + Table Name	YYYYMMDD + NB + DT	Long integer encoded via Accumulo SummingCombiner (VARLEN)	Denotes that NFN from the given DT was loaded into the specified table on the given date
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Other Tables

Ingest Error Tables

The layouts associated with the four ingest error tables are identical to those listed above for the Primary Data Table, Global Index Tables, and Data Dictionary Table. The only difference is that the respective error tables here are meant to capture Data Objects that failed to be fully loaded during ingest due to one or more processing errors.

That is, these tables are intended to capture all successfully-processed NFN:FV pairs from their respective Data Objects, just as they would have appeared in the normal schema, including supplemental key/value pairs related to the errors themselves. Since schema descriptions for the four primary data tables apply here as well, we describe below only the specific entries used to convey information about the error(s)

**Note:** The fields described below will supplement each Data Object persisted in the Ingest Errors data table
Row	Column Family	Column Qualifier	Value	Purpose
SHARD	DT + NB + UID	ERROR + NB + FV	NULL	Denotes the error category, where NFN = ERROR and FV is one of datawave.ingest.data.RawDataErrorNames
SHARD	DT + NB + UID	JOB_ID + NB + FV	NULL	Identifies the job, where NFN = JOB_ID and FV is the MapReduce Job ID
SHARD	DT + NB + UID	JOB_NAME + NB + FV	NULL	Identifies the job, where NFN = JOB_NAME and FV is the MapReduce Job Name
SHARD	D	DT + NB + UID + NB + EVENT	DC	Conveys the raw content of the actual data object that failed via DC, using EVENT as the DVN
+ symbol used only as a visual delimiter above. It does not appear in the actual data

Query Metrics Tables

The layouts associated with the four query metrics tables are identical to those listed above for the Primary Data Table, Global Index Tables, and Data Dictionary Table. The only difference here is that the respective query metrics tables are intended to persist information associated with user queries exclusively. They can be leveraged by users to gain insight into their own queries, and by administrators to gain insight into active and historical queries. Since schema descriptions for the primary data tables apply here as well, we describe below only the specific NFN and FV components that are used to represent a query metrics Data Object.

Query Metrics Schema
NFN	FV
AUTHORIZATIONS	User-specified list of Accumulo auths
BEGIN_DATE	User-specified date range start value
END_DATE	User-specified date range end value
PARAMETERS	User-specified query parameters
QUERY	User-specified query expression
QUERY_LOGIC	User-specified query logic name
QUERY_NAME	User-specified name assigned to the query
CREATE_DATE	Query creation date/time YYYYMMDD hhmmss
DOC_RANGES	Integer count of document-based table ranges generated by the query
FI_RANGES	Integer count of Field Index based table ranges generated by the query
HOST	Host name from which the query originated
LAST_UPDATED	Date/time of most recent update to the query (YYYYMMDD hhmmss)
LIFECYCLE	One of datawave.webservice.query.metric.BaseQueryMetric.Lifecycle
NEXT_COUNT	Number of times user requested a page of results, or invoked the /Query/{query id}/next endpoint, for the query
NUM_UPDATES	Integer count denoting the number of updates that occurred to the query
PAGE_METRICS.[n]	Metrics metadata for a single page of results, where N is an integer denoting the Nth page
PLAN	The final (actual) query expression to be evaluated by DataWave
QUERY_ID	UID generated by the system upon query creation, used in subsequent API calls
QUERY_TYPE	Name of the internal Java class that was employed to encapsulate the query and its state
CREATE_TIME	Amount of time elapsed (in ms) for the creation phase of the query
ELAPSED_TIME	Total amount of time elapsed (in ms) between creation and the query's current state
SEEK_COUNT	Number of Accumulo Iterator seeks required by the query
SOURCE_COUNT	Number of Accumulo Iterator sources required by the query
USER	User name associated with the query
USER_DN	User's distinguished name from client certificate

Terms and Definitions

Data Model Terms and Definitions
Name	Definition	ID
Column Family	The portion of the Accumulo key representing the column family	CF
Column Qualifier	The portion of the Accumulo key representing the column qualifier	CQ
Document Column Family	Column family for document content. Literal string with the value of ' d '	D
Document Content	Raw document content for the ' d ' column entry. Gzip compressed, then base64 encoded	DC
Data Object	A real-life object for which field names and field values can be derived. Typically, use of this term will denote an object stored within DataWave's primary data table	DO
Data Type	An identifier that describes some category or facet of the Data Objects associated with it	DT
Date Type Name	A user-specified identifier used as the CF within the Date Index table. E.g., this value may be used as a query parameter to indicate to the Query API that it should leverage the Date Index for date range filtering rather than the default, shard-based date range	DTN
Document View Name	Identifier within the ' d ' column qualifier, which categorizes the content	DVN
Edge Attribute Vector	Three optional, user-supplied attributes which can be appended to the edge key's CQ. These attributes are typically delimited by ' / '	EAV
EAV Element #1	The first element of the Edge Attribute Vector (EAV). Typically used to denote a subcategory or facet of the given Edge Type (ETYPE) value. This element is required (typically set in configuration), whereas the remaining EAV elements are optional	EAV[0]
Source Edge Vertex (NFV)	A Field Value processed by a Normalizer that represents the SOURCE vertex in a unidirectional edge key	ENFV₁
Sink Edge Vertex (NFV)	A Field Value processed by a Normalizer that represents the SINK vertex in a unidirectional edge key	ENFV₂
Edge Metadata Protocol Buffer	Protocol Buffer denoting the Data Object (DO) field mappings that were used in the creation of one or more Edge Table keys	EPB₀
Edge Table Protocol Buffer	Protocol Buffer containing one LONG which is the count of the edge for the day and an INT32 bitmask for hour of the day	EPB₁
Edge STATS‑ACTIVITY Protocol Buffer	Protocol Buffer containing one LONG[24] which is the count per hour for each our of the day	EPB₂
Edge STATS‑DURATION Protocol Buffer	Protocol Buffer containing histogram as LONG[7]: [ <10 sec, 10-30 sec, 30-60 sec, 1-5 min, 5-10 min, 10-30 min, >30 min ]	EPB₃
Edge Relationship	Two user-provided string values delimited by ' - ' used as part of the edge key CF, for describing the relationship between the respective SOURCE and SINK vertices	EREL
Edge Type	User-provided string value used as part of the edge key CF, for categorizing edges	ETYPE
Field Index Column Family	A column family prefix for field index keys. Literal string with the value of ' fi '	FI
Field Name	The name of a field in a Data Object	FN
Field Value	The raw value associated with a particular Field Name	FV
Normalizer	An object that transforms data so that it can be sorted lexicographically	N
Null Byte Character	A character with the hex value of \x00, used as a delimiter	NB
Normalizer Class Name	The fully-qualified name of the Normalizer (N) Java class used for indexing purposes	NCN
Normalized Field Name	A Field Name (FN) transformed to be compatible with the query code	NFN
Normalized Field Value	A Field Value (FV) that has been processed/transformed by a Normalizer (N)	NFV
Normalized Term	An individual term within a document transformed to be compatible with the query code	NT
Null Value	No value is assigned for keys of this type	NULL
Protocol Buffer	Google Protocol Buffer	PB
Reversed Normalized Field Value	A Field Value (FV) that has been processed/transformed by a Normalizer (N) and then reversed	RNFV
Shard ID	The partition identifier in the form of YYYYMMDD_N, where N is an integer denoting a specific sub-partition within the given day	SHARD
Term Frequency Column Family	Column family for term frequency keys. Literal string with the value of ' tf '	TF
Term Frequency Protocol Buffer	Google Protocol Buffer containing a list of word offsets for the term in the document	TFPB
Unique Identifier	An internal identifier generated for the Data Object that is typically a hash of the raw source data. Guaranteed to be deterministic and unique within a given partition and Data Type	UID
Uid.List Protocol Buffer	A protocol buffer object that contains the number of occurrences of the field name/value pair, along with a bounded list of UIDs	ULPB
Vertex	A single node within a graph	VERT

Tags: