Introduction
DataWave will typically accept query expressions conforming to either JEXL syntax (the default) or a modified Lucene syntax. As for JEXL, DataWave supports a subset of the language elements in the Apache Commons JEXL grammar and also provides several custom JEXL functions. DataWave has enabled support for Lucene expressions as a convenience and will provide equivalent functionality to JEXL, except where noted below.
JEXL Query Syntax
Supported JEXL Operators
- ==
- !=
- <
- ≤
- >
- ≥
- =~ (regex)
- !~ (negative regex)
- || , or
- && , and
Custom JEXL Functions
- Content functions
- phrase()
- adjacent()
- within()
- Geospatial Functions
- within_bounding_box()
- within_circle()
- intersects_bounding_box()
- intersects_radius_km()
- contains()
- covers()
- coveredBy()
- crosses()
- intersects()
- overlaps()
- within()
- Utility Functions
- between()
- length()
A Note About Field Names
Field names in DataWave are required to conform to JEXL naming conventions. That is, a field name must contain only alphanumeric characters and underscores, and the name must begin with either an alphabetic character or an underscore.
Structured vs Unstructured Objects
By extension, this would seem to imply that a given data object in DataWave, which is largely just a collection of field name/value pairs, is strictly a flat data structure, given that there is no apparent way to encode hierarchical structure within a JEXL field name.
While DataWave does adhere to 'flat object' semantics in most respects, it does allow the natural hierarchical structure of a field to be encoded and retained during data ingest, if needed. In fact, DataWave Query clients can retrieve such objects by leveraging both the flattened view and the hierarchical view of an object via their query expressions.
See the section on hierarchical data below for more information.
JEXL Unfielded Queries
JEXL is an expression language and not a text-query language per se, so JEXL doesn't natively support the notion of an 'unfielded' query, that is, a query expression containing only search terms and no specific field names to search within.
As a convenience, DataWave does provide support for unfielded JEXL queries, at least for the subset of query logic types that are designed to retrieve objects from the primary data table. To achieve this with JEXL, the user must add the internally-recognized pseudo field, _ANYFIELD_, to the query in order for it to pass syntax validation. See the examples below for usage.
Lucene Query Syntax
DataWave provides a slightly modified Lucene syntax, such that the NOT operator is not unary, AND operators are not fuzzy, and the implicit operator is AND instead of OR.
Our Lucene syntax has the following form
ModClause ::= DisjQuery [NOT DisjQuery]* DisjQuery ::= ConjQuery [ OR ConjQuery ]* ConjQuery ::= Query [ AND Query ]* Query ::= Clause [ Clause ]* Clause ::= Term | [ ModClause ] Term ::= field:selector | field:selec* | field:selec*or | field:*lector | field:selec?or | selector | (can use wildcards) field:[begin TO end] | field:{begin TO end} | "quick brown dog" | "quick brown dog"~20 | #FUNCTION(ARG1, ARG2)
Note that to search for punctuation characters within a term, you need to escape it with a backslash.
JEXL and Lucene Examples
Example Queries
JEXL Query | Lucene Query |
---|---|
_ANYFIELD_ == 'SomeValue' | SomeValue |
(_ANYFIELD_ == 'AAA' && _ANYFIELD_ == 'BBB') && (_ANYFIELD_ == 'CCC' || _ANYFIELD_ == 'DDD') | (AAA BBB) (CCC OR DDD) |
FIELDNAME == 'SomeValue' | FIELDNAME:SomeValue |
FIELDNAME =~ 'SomeVal.*' | FIELDNAME:SomeVal* |
(FIELD1 == 'AAA' && FIELD2 == 'BBB') && (FIELD3 == 'CCC' || FIELD3 == 'DDD') | (FIELD1:AAA FIELD2:BBB) (FIELD3:CCC OR FIELD3:DDD) |
TEXT_FIELD == 'TextValue' && f:between(NUMBER_FIELD,1, 10) | TEXT_FIELD:TextValue AND NUMBER_FIELD:[1 TO 10] |
Support for Hierarchical Data
DataWave also allows for queries that leverage the hierarchical context of structured data types, provided that the structure of the data is preserved during ingest. This requires that a special “grouping” notation be applied during ingest to any nested fields that are parsed.
For example, consider the XML objects below:
Data Object 1
<foo>
<parent>
<child>
<field1>A</field1>
</child>
<child>
<field2>B</field2>
</child>
</parent>
</foo>
Data Object 2
<bar>
<field1>A</field1>
<field2>B</field2>
</bar>
At ingest time, the field name/value pairs above may be stored logically in Accumulo as follows:
{FIELD NAME}.{GROUPING CONTEXT} = {VALUE}
Here, the grouping context preserves the fully-qualified path of the name/value pair and also conveys its relative position within the hierarchy.
Thus, given the XML above, we could flatten the data objects during ingest, transforming the name/value pairs as follows:
# Data Object 1 flattened, including grouping context
FIELD1.FOO_0.PARENT_0.CHILD_0.FIELD1_0 = A
FIELD2.FOO_0.PARENT_0.CHILD_1.FIELD2_0 = B
-------------------------
# Data Object 2 flattened, including grouping context
FIELD1.BAR_0.FIELD1_0 = A
FIELD2.BAR_0.FIELD2_0 = B
As a result, the objects are flattened into simple maps with each consisting of two fields, ‘FIELD1’ and ‘FIELD2’. By default, the grouping context (i.e., all characters beyond the field name itself) will be ignored by the query API. Therefore, the following simple query could be used to return both objects as distinct search results:
JEXL Query | Lucene Query |
---|---|
FIELD1 == 'A' && FIELD2 == 'B' | FIELD1:A FIELD2:B |
However, if the objects originated from distinct XML schemas having entirely different semantics for their respective fields, then we might not want both to appear in our search results. To disambiguate the two objects, we can use the following function:
JEXL Function | Lucene Function |
---|---|
grouping:matchesInGroupLeft(F1, 'V1', F2, 'V2', ..., Fn, 'Vn', INTEGER) | #MATCHES_IN_GROUP_LEFT(F1, 'V1', F2, 'V2', ..., Fn, 'Vn', INTEGER) |
The INTEGER parameter denotes the level in the tree where the matching field name/value pairs must exist in order to constitute a match. Its values are defined as follows:
- 0 : Fields are siblings / same parent element
- 1 : Fields are cousins / same grandparent element
- 2 : Fields are 2nd cousins / same great-grandparent element
- And so on…
For example, to retrieve only Data Object 1 above, we might use the following query:
JEXL Function | Lucene Function |
---|---|
FIELD1 == 'A' && grouping:matchesInGroupLeft(FIELD1, 'A', FIELD2, 'B', 1) | FIELD1:A AND #MATCHES_IN_GROUP_LEFT(FIELD1, 'A', FIELD2, 'B', 1) |
Likewise, to return only Data Object 2, we could do the following:
JEXL Function | Lucene Function |
---|---|
FIELD1 == 'A' && grouping:matchesInGroupLeft(FIELD1, 'A', FIELD2, 'B', 0) | FIELD1:A AND #MATCHES_IN_GROUP_LEFT(FIELD1, 'A', FIELD2, 'B', 0) |
Custom Lucene Functions
DataWave has augmented Lucene to provide support for several JEXL features that were not supported natively. The table below maps the JEXL operators to the supported Lucene syntax
JEXL Operator | Lucene Operator |
---|---|
filter:includeRegex(field, regex) | #INCLUDE(field, regex) |
filter:excludeRegex(field, regex) | #EXCLUDE(field, regex) |
filter:includeRegex(field1, regex1) <op> filter:includeRegex(field2, regex2) ... | #INCLUDE(op, field1, regex1, field2, regex2 ...) where op is 'or' or 'and' |
filter:excludeRegex(field1, regex1) <op> filter:excludeRegex(field2, regex2) ... | #EXCLUDE(op, field1, regex1, field2, regex2 ...) where op is 'or' or 'and' |
filter:isNull(field) | #ISNULL(field) |
not(filter:isNull(field)) | #ISNOTNULL(field) |
filter:occurrence(field,operator,count)) | #OCCURRENCE(field,operator,count) |
filter:timeFunction(field1,field2,operator,equivalence,goal) | #TIME_FUNCTION(field1,field2,operator,equivalence,goal) |
Notes:
- None of these filter functions can be applied against index-only fields.
- The occurrence function is used to count the number of instances of a field in the event. Valid operators are '==' (or '='),'>','>=','<','<=', and '!='.
Basic Geospatial Functions
Some geo functions are supplied as well that may prove useful although the within_bounding_box function may be done with a simple range comparison (i.e. LAT_LON_USER <= <lat1>_<lon1> and LAT_LON_USER >= <lat2>_<lon2>.
JEXL Operator | Lucene Operator |
---|---|
geo:within_bounding_box(latLonField, lowerLeft, upperRight) | #GEO(bounding_box, latLonField, 'lowerLeft', 'upperRight') |
geo:within_bounding_box(lonField, latField, minLon, minLat, maxLon, maxLat) | #GEO(bounding_box, lonField, latField, minLon, minLat, maxLon, maxLat) |
geo:within_circle(latLonField, center, radius) | #GEO(circle, latLonField, center, radius) |
Notes:
- All lat and lon values are in decimal.
- The lowerLeft, upperRight, and center are of the form <lat>_<lon> and must be surrounded by single quotes.
- The radius is in decimal degrees as well.
GeoWave Functions
GeoWave is an optional component that provides the following functions when enabled
JEXL Operator | Lucene Operator |
---|---|
geowave:intersects_bounding_box(geometryField, westLon, eastLon, southLat, northLat) | #INTERSECTS_BOUNDING_BOX(geometryField, westLon, eastLon, southLat, northLat) |
geowave:intersects_radius_km(geometryField, centerLon, centerLat, radiusKm) | #INTERSECTS_RADIUS_KM(geometryField, centerLon, centerLat, radiusKm) |
geowave:contains(geometryField, Well-Known Text) | #CONTAINS(geometryField, centerLon, centerLat, radiusDegrees) |
geowave:covers(geometryField, Well-Known Text) | #COVERS(geometryField, Well-Known Text) |
geowave:coveredBy(geometryField, Well-Known Text) | #COVERED_BY(geometryField, Well-Known Text) |
geowave:crosses(geometryField, Well-Known Text) | #CROSSES(geometryField, Well-Known Text) |
geowave:intersects(geometryField, Well-Known Text) | #INTERSECTS(geometryField, Well-Known Text) |
geowave:overlaps(geometryField, Well-Known Text) | #OVERLAPS(geometryField, Well-Known Text) |
geowave:within(geometryField, Well-Known Text) | #WITHIN(geometryField, Well-Known Text) |
Notes:
- All lat and lon values are in decimal degrees.
- The lowerLeft, upperRight, and center are of the form <lat>_<lon> and must be surrounded by single quotes.
- Geometry is represented according to the Open Geospatial Consortium standard for Well-Known Text. It is in decimal degrees longitude for x, amd latitude for y. For example, a point at New York can be represented as 'POINT (-74.01 40.71)' and a box at New York can be repesented as 'POLYGON(( -74.1 40.75, -74.1 40.69, -73.9 40.69, -73.9 40.75, -74.1 40.75));
Date Functions
There are some additional functions that are supplied to handle dates more smoothly. It is intended that the need for these functions may go away in future versions (bolded parameters are literal, other parameters are substituted with appropriate values):
JEXL Operator | Lucene Operator |
---|---|
filter:betweenDates(field, start date, end date) | #DATE(field, start date, end date) or #DATE(field, between, start date, end date) |
filter:betweenDates(field, start date, end date, start/end date format) | #DATE(field, start date, end date, start/end date format) or #DATE(field, between, start date, end date, start/end date format) |
filter:betweenDates(field, field date format, start date, end date, start/end date format) | #DATE(field, field date format, start date, end date, start/end date format) or #DATE(field, between, field date format, start date, end date, start/end date format) |
filter:afterDate(field, date) | #DATE(field, after, date) |
filter:afterDate(field, date, date format) | #DATE(field, after, date, date format) |
filter:afterDate(field, field date format, date, date format) | #DATE(field, after, field date format, date, date format) |
filter:beforeDate(field, date) | #DATE(field, before, date) |
filter:beforeDate(field, date, date format) | #DATE(field, before, date, date format) |
filter:beforeDate(field, field date format, date, date format) | #DATE(field, before, field date format, date, date format) |
filter:betweenLoadDates(LOAD_DATE, start date, end date) | #LOADED(start date, end date) or #LOADED(between, start date, end date) |
filter:betweenLoadDates(LOAD_DATE, start date, end date, start/end date format) | #LOADED(start date, end date, start/end date format) or #LOADED(between, start date, end date, start/end date format) |
filter:afterLoadDate(LOAD_DATE, date) | #LOADED(after, date) |
filter:afterLoadDate(LOAD_DATE, date, date format) | #LOADED(after, date, date format) |
filter:beforeLoadDate(LOAD_DATE, date) | #LOADED(before, date) |
filter:beforeLoadDate(LOAD_DATE, date, date format) | #LOADED(before, date, date format) |
filter:timeFunction(DOWNTIME, UPTIME, '-', '>', 2522880000000L) | #TIME_FUNCTION(DOWNTIME, UPTIME, '-', '>', '2522880000000L') |
Notes:
- None of these filter functions can be applied against index-only fields.
- Between functions are inclusive, and the other functions are exclusive of the entered dates.
- Date formats must be entered in the Java SimpleDateFormat object format.
- If the entered date format is not specified, then the following list of date formats will be tried:
- yyyyMMdd:HH:mm:ss:SSSZ
- yyyyMMdd:HH:mm:ss:SSS
- EEE MMM dd HH:mm:ss zzz yyyy
- d MMM yyyy HH:mm:ss 'GMT'
- yyyy-MM-dd HH:mm:ss.SSS Z
- yyyy-MM-dd HH:mm:ss.SSS
- yyyy-MM-dd HH:mm:ss.S Z
- yyyy-MM-dd HH:mm:ss.S
- yyyy-MM-dd HH:mm:ss Z
- yyyy-MM-dd HH:mm:ssz
- yyyy-MM-dd HH:mm:ss
- yyyyMMdd HHmmss
- yyyy-MM-dd'T'HH'|'mm
- yyyy-MM-dd'T'HH':'mm':'ss'.'SSS'Z'
- yyyy-MM-dd'T'HH':'mm':'ss'Z'
- MM'/'dd'/'yyyy HH':'mm':'ss
- E MMM d HH:mm:ss z yyyy
- E MMM d HH:mm:ss Z yyyy
- yyyyMMdd_HHmmss
- yyyy-MM-dd
- MM/dd/yyyy
- yyyy-MMMM
- yyyy-MMM
- yyyyMMddHHmmss
- yyyyMMddHHmm
- yyyyMMddHH
- yyyyMMdd
- A special date format of 'e' can be supplied to mean milliseconds since epoch.