Primary Data Table
The primary data table uses a sharded approach and can be described as an intra-day hash partitioned table where fields in a data object are stored collocated in a single partition. The Shard ID is a function of the UID and therefore should be reproducible given the same object ingested at different points in time. This enables de-duplication of objects when they are re-ingested. The Data Type is a user defined category of the data that will typically be used at query time. The Data Type allows for further reduction in the amount of data to be searched.
The primary data table also contains an in-partition index, which we call the Field Index, and we denote the K,V pairs that are in the field index with a leading ‘fi’ in the column family. The field index is used by custom Accumulo iterators at query time to find data objects in the partition.
Optionally, if the table is used to store documents, then the original document or different views of the document can be stored in the ‘d’ column family. Typically this column family would be set up as its own locality group. An example of different views of a document could be .txt and .html versions of the original document.
To enable phrase queries on documents, the ‘tf’ column family contains a protocol buffer (PB) in the value that is a list of word offsets for the term in the document. This too could also be stored in a separate locality group.
E.g. | Row | Column Family | Column Qualifier | Value | Purpose |
---|---|---|---|---|---|
20240101_6 cars\x00uid.abc.123:DESCRIPTION\x001990 Ford Mustang (Red) 20240101_6 cars\x00uid.abc.123:MAKE\x00Ford 20240101_6 cars\x00uid.abc.123:MODEL\x00Mustang 20240101_6 cars\x00uid.abc.123:YEAR\x001990
|
SHARD | DT + NB + UID | NFN + NB + FV | NULL | Represents a name/value pair for a given field (NFN) within the Data Object (DO) |
20240101_6 fi\x00MAKE:ford\x00cars\x00uid.abc.123 20240101_6 fi\x00MODEL:mustang\x00cars\x00uid.abc.123
|
SHARD | fi + NB + NFN | NFV + NB + DT + NB + UID | NULL | The fi (field index) column provides an in-partition index of field names and associated values for Data Objects that exist within the given shard |
20240101_6 d:cars\x00uid.abc.123\x00JSON [Base64 encoded, gzipped json]
|
SHARD | d | DT + NB + UID + NB + DVN | DC | (Optional) The d (document) column provides a named view of raw content, typically the raw source input that generated the Data Object |
20240101_6 tf:cars\x00uid.abc.123\x001990\x00DESCRIPTION [Protobuf of '1990' offsets] 20240101_6 tf:cars\x00uid.abc.123\x00ford\x00DESCRIPTION [Protobuf of 'ford' offsets] 20240101_6 tf:cars\x00uid.abc.123\x00mustang\x00DESCRIPTION [Protobuf of 'mustang' offsets] 20240101_6 tf:cars\x00uid.abc.123\x00red\x00DESCRIPTION [Protobuf of 'red' offsets]
|
SHARD | tf | DT + NB + UID + NT + NFN | TFPB | (Optional) The tf (term frequency) column enables phrase queries by mapping a Data Object's tokenized terms to the positions (word offsets) of those terms within the source document |
+ symbol used only as a visual delimiter above. It does not appear in the actual data |
Global Index Tables
The forward and reverse index tables serve as global indexes mapping terms to partitions. The index maps a NFN:NFV pair to a category of data within the partitions of the primary data table. The Uid.List Protocol Buffer (ULPB) object contains the number of occurrences of the NFN:NFV pair in a category of data in the partition. Additionally, the ULPB may contain the UIDs of the objects that contain the NFN:NFV. We say “may contain” because there is an upper limit on the number of UIDs in the ULPB.
E.g. | Row | Column Family | Column Qualifier | Value | Purpose |
---|---|---|---|---|---|
ford MAKE:20240101_6\x00cars [Protobuf list of uid's including 'uid.abc.123'] mustang MODEL:20240101_6\x00cars [Protobuf list of uid's including 'uid.abc.123']
|
NFV | NFN | SHARD + NB + DT | ULPB | Maps the NFV:NFN pair to a given shard partition within the Primary Data Table (PDT), and optionally to a bounded list of Data Object UIDs within that partition |
+ symbol used only as a visual delimiter above. It does not appear in the actual data |
NFV’s that are indexed within the global reverse index table can be searched using leading wildcards. Thus, the index is created by simply reversing the characters in the NFV…
E.g. | Row | Column Family | Column Qualifier | Value | Purpose |
---|---|---|---|---|---|
drof MAKE:20240101_6\x00cars [Protobuf list of uid's including 'uid.abc.123'] gnatsum MODEL:20240101_6\x00cars [Protobuf list of uid's including 'uid.abc.123']
|
RNFV | NFN | SHARD + NB + DT | ULPB | Maps the RNFV:NFN pair to a given shard partition within the Primary Data Table (PDT), and optionally to a bounded list of Data Object UIDs within that partition |
+ symbol used only as a visual delimiter above. It does not appear in the actual data |
Data Dictionary Table
The data dictionary table contains metadata about the data stored in other tables and is used primarily for query planning purposes. For example, this includes information about whether a particular field is indexed, about specific type-normalization performed on a field’s values, etc. The structure of the table is as follows.
E.g. | Row | Column Family | Column Qualifier | Value | Purpose |
---|---|---|---|---|---|
DESCRIPTION e:cars MAKE e:cars MODEL e:cars
|
NFN | e | DT | NULL | The e column family denotes that NFN is a field that exists within objects of the given Data Type (DT) within the Primary Data Table (PDT). Note: The absence of an e entry for a given NFN and the presence of i and/or ri entries for the NFN denotes that it is an index-only field, meaning that no such field exists within the PDT's Data Objects for the given DT |
MAKE i:cars\x0020240101 \x01 MAKE i:cars\x0020240102 \x0E MODEL i:cars\x0020240101 \x01 MODEL i:cars\x0020240102 \x0E
|
NFN | i | DT + NB + YYYYMMDD | Integer Count | The i column family denotes that NFN exists in both the Global Forward Index (GFIDX) table and in the Primary Data Table's (PDT) fi column. The value denotes that 'Integer Count' instances of the NFN exist for the given date. Note: The absence of an e entry for a given NFN and the presence of i and/or ri entries for the NFN denotes that it is an index-only field, meaning that no such field exists within the PDT's Data Objects for the given DT |
MODEL ri:cars\x0020240101 \x01 MODEL ri:cars\x0020240102 \x0E
|
NFN | ri | DT + NB + YYYYMMDD | Integer Count | The ri column family denotes that NFN exists in the Global Reverse Index (GRIDX) table, and the value denotes that 'Integer Count' instances of the NFN exist for the given date |
MAKE f:cars\x0020240101 \x01 MODEL f:cars\x0020240101 \x01 NON_INDEXED_FIELD f:cars\x0020240101 \x01
|
NFN | f | DT + NB + YYYYMMDD | Integer Count | Similar to the i column, the f (frequency) column family conveys the 'Integer Count' times that instances of NFN were seen on the given date. These entries are recorded for non-indexed fields as well |
MAKE t:cars\x00datawave.data.type.LcNoDiacriticsType YEAR t:cars\x00datawave.data.type.NumberType
|
NFN | t | DT + NB + NCN | NULL | The t (type) column denotes that values of the given field are normalized for indexing purposes using the specified Java class |
MAKE desc:cars Denotes the car's make/manufacturer name MODEL desc:cars Denotes the car's model name
|
NFN | desc | DT | Text Description | The desc column family may be used to supply a text description of the NFN within type DT |
DESCRIPTION tf:cars
|
NFN | tf | DT | NULL | The tf column family denotes that NFN is enabled for phrase queries, i.e., that it has tf (term frequency) column entries in the Primary Data Table (PDT) |
CARS/MODEL‑MAKE edge:SALES [Protobuf containing field name mapping] CARS/MODEL‑YEAR edge:SALES [Protobuf containing field name mapping]
|
ETYPE + / + EREL | edge | EAV[0] | EPB0 | The edge column family denotes that one or more Edge Table keys were formed using the given Row and CQ identifiers and were derived from field names in the Primary Data Table (PDT) that are given in the Protocol Buffer value. These dictionary entries allow an Edge Table key to be mapped back to the source Data Object from which it was derived (e.g., via the EdgeEventQuery query logic) |
+ symbol used only as a visual delimiter above. It does not appear in the actual data |
Edge Table
The edge table may represent one or more graphs, any of which may be comprised of unidirectional and bidirectional edges depending on the user’s needs. A single edge key represents a unidirectional pair of Normalized Field Value vertices, which may be thought of as a source vertex and a sink vertex respectively. Thus, bidirectional edges are generated by simply creating a second key having the original source, sink, and other attributes reversed. Additional information may be encoded into an edge key as well, such as the relationship between the two vertices, the type of the edge, and others.
E.g. | Row | Column Family | Column Qualifier | Value | Purpose |
---|---|---|---|---|---|
mustang\x00ford CARS/MODEL‑MAKE:20240101/SALES/RegionA/Dealership1 [Protobuf histogram object] mustang\x001990 CARS/MODEL‑YEAR:20240101/SALES/RegionA/Dealership1 [Protobuf histogram object]
|
ENFV1 + NB + ENFV2 | ETYPE + / + EREL | YYYYMMDD + / + EAV | EPB1 | A unidirectional pair of vertices that expresses some relationship between a pair of Field Values originating from a Data Object, where the CQ and value may contain additional metadata about the related activity |
|
ENFV1 | STATS + NB + ACTIVITY + ETYPE | YYYYMMDD + NB + EAV | EPB2 | Provides activity stats for ENFV1 for the given day |
|
ENFV1 | STATS + NB + DURATION + ETYPE | YYYYMMDD + NB + EAV | EPB3 | Provides duration stats for ENFV1 for the given day |
+ symbol used only as a visual delimiter above. It does not appear in the actual data |
Date Index Table
By design, the primary data table permits at most one YYYYMMDD value to be encoded within the assigned row partition (i.e., Shard ID) of a given data object, and, by default, this date will serve as the basis for the date range criteria of any query that targets the object. However, a given data object may contain any number date-related fields, any of which may be important to a user at query time for filtering purposes.
Indexing may be configured for such fields at ingest time. If date indexing is configured for a field, then its values along with the field name itself will be mapped by entries within this table to the partitions in the primary data table where the source objects are stored. Query clients can then leverage this table to enable date range filtering based on these dates, rather than on the dates encoded within the Shard IDs of the stored objects.
Note that the Date Type Name (DTN) column family identifier here is typically leveraged to provide semantic grouping of distinct field names from disparate datasets. For example, a DTN of “SALE_DATE” might be used to group the values of semantically equivalent fields such as “PURCHASE_DATE”, “RECEIPT_DATE”, “DATE_OF_SALE”, “DATE_PURCHASED”, etc.
Row | Column Family | Column Qualifier | Value | Purpose | |
---|---|---|---|---|---|
YYYYMMDD1 | DTN | YYYYMMDD2 + NB + DT + NB + NFN | BitSet denoting the date's presence within specific shard partitions. E.g., bit 0 is YYYYMMDD_0, and so on | Maps the date field (given by NFN) and its value (given by YYYYMMDD1) to the SHARD partition(s) specified by the YYYYMMDD2 and BitSet values | |
+ symbol used only as a visual delimiter above. It does not appear in the actual data |
Load Dates Table
The load dates table tracks the dates on which specific field names were loaded into specific tables via DataWave Ingest. This information may be leveraged internally for the purposes of query optimization, load date-based filtering for queries, etc.
Row | Column Family | Column Qualifier | Value | Purpose | |
---|---|---|---|---|---|
NFN | FIELD_NAME + NB + Table Name | YYYYMMDD + NB + DT | Long integer encoded via Accumulo SummingCombiner (VARLEN) | Denotes that NFN from the given DT was loaded into the specified table on the given date | |
+ symbol used only as a visual delimiter above. It does not appear in the actual data |
Other Tables
Ingest Error Tables
The layouts associated with the four ingest error tables are identical to those listed above for the Primary Data Table, Global Index Tables, and Data Dictionary Table. The only difference is that the respective error tables here are meant to capture Data Objects that failed to be fully loaded during ingest due to one or more processing errors.
That is, these tables are intended to capture all successfully-processed NFN:FV pairs from their respective Data Objects, just as they would have appeared in the normal schema, including supplemental key/value pairs related to the errors themselves. Since schema descriptions for the four primary data tables apply here as well, we describe below only the specific entries used to convey information about the error(s)
Row | Column Family | Column Qualifier | Value | Purpose | |
---|---|---|---|---|---|
SHARD | DT + NB + UID | ERROR + NB + FV | NULL | Denotes the error category, where NFN = ERROR and FV is one of datawave.ingest.data.RawDataErrorNames | |
SHARD | DT + NB + UID | JOB_ID + NB + FV | NULL | Identifies the job, where NFN = JOB_ID and FV is the MapReduce Job ID | |
SHARD | DT + NB + UID | JOB_NAME + NB + FV | NULL | Identifies the job, where NFN = JOB_NAME and FV is the MapReduce Job Name | |
SHARD | D | DT + NB + UID + NB + EVENT | DC | Conveys the raw content of the actual data object that failed via DC, using EVENT as the DVN | |
+ symbol used only as a visual delimiter above. It does not appear in the actual data |
Query Metrics Tables
The layouts associated with the four query metrics tables are identical to those listed above for the Primary Data Table, Global Index Tables, and Data Dictionary Table. The only difference here is that the respective query metrics tables are intended to persist information associated with user queries exclusively. They can be leveraged by users to gain insight into their own queries, and by administrators to gain insight into active and historical queries. Since schema descriptions for the primary data tables apply here as well, we describe below only the specific NFN and FV components that are used to represent a query metrics Data Object.
NFN | FV |
---|---|
AUTHORIZATIONS | User-specified list of Accumulo auths |
BEGIN_DATE | User-specified date range start value |
END_DATE | User-specified date range end value |
PARAMETERS | User-specified query parameters |
QUERY | User-specified query expression |
QUERY_LOGIC | User-specified query logic name |
QUERY_NAME | User-specified name assigned to the query |
CREATE_DATE | Query creation date/time YYYYMMDD hhmmss |
DOC_RANGES | Integer count of document-based table ranges generated by the query |
FI_RANGES | Integer count of Field Index based table ranges generated by the query |
HOST | Host name from which the query originated |
LAST_UPDATED | Date/time of most recent update to the query (YYYYMMDD hhmmss) |
LIFECYCLE | One of datawave.webservice.query.metric.BaseQueryMetric.Lifecycle |
NEXT_COUNT | Number of times user requested a page of results, or invoked the /Query/{query id}/next endpoint, for the query |
NUM_UPDATES | Integer count denoting the number of updates that occurred to the query |
PAGE_METRICS.[n] | Metrics metadata for a single page of results, where N is an integer denoting the Nth page |
PLAN | The final (actual) query expression to be evaluated by DataWave |
QUERY_ID | UID generated by the system upon query creation, used in subsequent API calls |
QUERY_TYPE | Name of the internal Java class that was employed to encapsulate the query and its state |
CREATE_TIME | Amount of time elapsed (in ms) for the creation phase of the query |
ELAPSED_TIME | Total amount of time elapsed (in ms) between creation and the query's current state |
SEEK_COUNT | Number of Accumulo Iterator seeks required by the query |
SOURCE_COUNT | Number of Accumulo Iterator sources required by the query |
USER | User name associated with the query |
USER_DN | User's distinguished name from client certificate |
Terms and Definitions
Name | Definition | ID |
---|---|---|
Column Family | The portion of the Accumulo key representing the column family | CF |
Column Qualifier | The portion of the Accumulo key representing the column qualifier | CQ |
Document Column Family | Column family for document content. Literal string with the value of ' d ' | D |
Document Content | Raw document content for the ' d ' column entry. Gzip compressed, then base64 encoded | DC |
Data Object | A real-life object for which field names and field values can be derived. Typically, use of this term will denote an object stored within DataWave's primary data table | DO |
Data Type | An identifier that describes some category or facet of the Data Objects associated with it | DT |
Date Type Name | A user-specified identifier used as the CF within the Date Index table. E.g., this value may be used as a query parameter to indicate to the Query API that it should leverage the Date Index for date range filtering rather than the default, shard-based date range | DTN |
Document View Name | Identifier within the ' d ' column qualifier, which categorizes the content | DVN |
Edge Attribute Vector | Three optional, user-supplied attributes which can be appended to the edge key's CQ. These attributes are typically delimited by ' / ' | EAV |
EAV Element #1 | The first element of the Edge Attribute Vector (EAV). Typically used to denote a subcategory or facet of the given Edge Type (ETYPE) value. This element is required (typically set in configuration), whereas the remaining EAV elements are optional | EAV[0] |
Source Edge Vertex (NFV) | A Field Value processed by a Normalizer that represents the SOURCE vertex in a unidirectional edge key | ENFV1 |
Sink Edge Vertex (NFV) | A Field Value processed by a Normalizer that represents the SINK vertex in a unidirectional edge key | ENFV2 |
Edge Metadata Protocol Buffer | Protocol Buffer denoting the Data Object (DO) field mappings that were used in the creation of one or more Edge Table keys | EPB0 |
Edge Table Protocol Buffer | Protocol Buffer containing one LONG which is the count of the edge for the day and an INT32 bitmask for hour of the day | EPB1 |
Edge STATS‑ACTIVITY Protocol Buffer | Protocol Buffer containing one LONG[24] which is the count per hour for each our of the day | EPB2 |
Edge STATS‑DURATION Protocol Buffer | Protocol Buffer containing histogram as LONG[7]: [ <10 sec, 10-30 sec, 30-60 sec, 1-5 min, 5-10 min, 10-30 min, >30 min ] | EPB3 |
Edge Relationship | Two user-provided string values delimited by ' - ' used as part of the edge key CF, for describing the relationship between the respective SOURCE and SINK vertices | EREL |
Edge Type | User-provided string value used as part of the edge key CF, for categorizing edges | ETYPE |
Field Index Column Family | A column family prefix for field index keys. Literal string with the value of ' fi ' | FI |
Field Name | The name of a field in a Data Object | FN |
Field Value | The raw value associated with a particular Field Name | FV |
Normalizer | An object that transforms data so that it can be sorted lexicographically | N |
Null Byte Character | A character with the hex value of \x00, used as a delimiter | NB |
Normalizer Class Name | The fully-qualified name of the Normalizer (N) Java class used for indexing purposes | NCN |
Normalized Field Name | A Field Name (FN) transformed to be compatible with the query code | NFN |
Normalized Field Value | A Field Value (FV) that has been processed/transformed by a Normalizer (N) | NFV |
Normalized Term | An individual term within a document transformed to be compatible with the query code | NT |
Null Value | No value is assigned for keys of this type | NULL |
Protocol Buffer | Google Protocol Buffer | PB |
Reversed Normalized Field Value | A Field Value (FV) that has been processed/transformed by a Normalizer (N) and then reversed | RNFV |
Shard ID | The partition identifier in the form of YYYYMMDD_N, where N is an integer denoting a specific sub-partition within the given day | SHARD |
Term Frequency Column Family | Column family for term frequency keys. Literal string with the value of ' tf ' | TF |
Term Frequency Protocol Buffer | Google Protocol Buffer containing a list of word offsets for the term in the document | TFPB |
Unique Identifier | An internal identifier generated for the Data Object that is typically a hash of the raw source data. Guaranteed to be deterministic and unique within a given partition and Data Type | UID |
Uid.List Protocol Buffer | A protocol buffer object that contains the number of occurrences of the field name/value pair, along with a bounded list of UIDs | ULPB |
Vertex | A single node within a graph | VERT |