DataWave 6.x Docs - Query Syntax Guide

This page gives an overview of DataWave's JEXL and Lucene query syntax options

Introduction

DataWave will typically accept query expressions conforming to either JEXL syntax (the default) or a modified Lucene syntax. As for JEXL, DataWave supports a subset of the language elements in the Apache Commons JEXL grammar and also provides several custom JEXL functions. DataWave has enabled support for Lucene expressions as a convenience and will provide equivalent functionality to JEXL, except where noted below.

JEXL Query Syntax

Supported JEXL Operators

==
!=
<
≤
>
≥
=~ (regex)
!~ (negative regex)
|| , or
&& , and

Custom JEXL Functions

Content functions
- phrase()
- adjacent()
- within()
Geospatial Functions
- within_bounding_box()
- within_circle()
- intersects_bounding_box()
- intersects_radius_km()
- contains()
- covers()
- coveredBy()
- crosses()
- intersects()
- overlaps()
- within()
Utility Functions
- between()
- length()

A Note About Field Names

Field names in DataWave are required to conform to JEXL naming conventions. That is, a field name must contain only alphanumeric characters and underscores, and the name must begin with either an alphabetic character or an underscore.

Structured vs Unstructured Objects

By extension, this would seem to imply that a given data object in DataWave, which is largely just a collection of field name/value pairs, is strictly a flat data structure, given that there is no apparent way to encode hierarchical structure within a JEXL field name.

While DataWave does adhere to 'flat object' semantics in most respects, it does allow the natural hierarchical structure of a field to be encoded and retained during data ingest, if needed. In fact, DataWave Query clients can retrieve such objects by leveraging both the flattened view and the hierarchical view of an object via their query expressions.

See the section on hierarchical data below for more information.

JEXL Unfielded Queries

JEXL is an expression language and not a text-query language per se, so JEXL doesn't natively support the notion of an 'unfielded' query, that is, a query expression containing only search terms and no specific field names to search within.

As a convenience, DataWave does provide support for unfielded JEXL queries, at least for the subset of query logic types that are designed to retrieve objects from the primary data table. To achieve this with JEXL, the user must add the internally-recognized pseudo field, _ANYFIELD_, to the query in order for it to pass syntax validation. See the examples below for usage.

Lucene Query Syntax

DataWave provides a slightly modified Lucene syntax, such that the NOT operator is not unary, AND operators are not fuzzy, and the implicit operator is AND instead of OR.

Our Lucene syntax has the following form

    ModClause ::= DisjQuery [NOT DisjQuery]*
    DisjQuery ::= ConjQuery [ OR ConjQuery ]*
    ConjQuery ::= Query [ AND Query ]*
    Query ::= Clause [ Clause ]*
    Clause ::= Term | [ ModClause ]
    Term ::=
        field:selector |
        field:selec* |
        field:selec*or |
        field:*lector |
        field:selec?or |
        selector | (can use wildcards)
        field:[begin TO end] |
        field:{begin TO end} |
        "quick brown dog" |
        "quick brown dog"~20 |
        #FUNCTION(ARG1, ARG2)

Note that to search for punctuation characters within a term, you need to escape it with a backslash.

JEXL and Lucene Examples

Example Queries

JEXL Query	Lucene Query
_ANYFIELD_ == 'SomeValue'	SomeValue
(_ANYFIELD_ == 'AAA' && _ANYFIELD_ == 'BBB') && (_ANYFIELD_ == 'CCC' \|\| _ANYFIELD_ == 'DDD')	(AAA BBB) (CCC OR DDD)
FIELDNAME == 'SomeValue'	FIELDNAME:SomeValue
FIELDNAME =~ 'SomeVal.*'	FIELDNAME:SomeVal*
(FIELD1 == 'AAA' && FIELD2 == 'BBB') && (FIELD3 == 'CCC' \|\| FIELD3 == 'DDD')	(FIELD1:AAA FIELD2:BBB) (FIELD3:CCC OR FIELD3:DDD)
TEXT_FIELD == 'TextValue' && f:between(NUMBER_FIELD,1, 10)	TEXT_FIELD:TextValue AND NUMBER_FIELD:[1 TO 10]

Support for Hierarchical Data

DataWave also allows for queries that leverage the hierarchical context of structured data types, provided that the structure of the data is preserved during ingest. This requires that a special “grouping” notation be applied during ingest to any nested fields that are parsed.

For example, consider the XML objects below:

Data Object 1

   <foo>
      <parent>
         <child>
            <field1>A</field1>
         </child>
         <child>
            <field2>B</field2>
         </child>
      </parent>
   </foo>

Data Object 2

   <bar>
      <field1>A</field1>
      <field2>B</field2>
   </bar>

At ingest time, the field name/value pairs above may be stored logically in Accumulo as follows:

{FIELD NAME}.{GROUPING CONTEXT} = {VALUE}

Here, the grouping context preserves the fully-qualified path of the name/value pair and also conveys its relative position within the hierarchy.

Thus, given the XML above, we could flatten the data objects during ingest, transforming the name/value pairs as follows:

  # Data Object 1 flattened, including grouping context
  FIELD1.FOO_0.PARENT_0.CHILD_0.FIELD1_0 = A
  FIELD2.FOO_0.PARENT_0.CHILD_1.FIELD2_0 = B
  -------------------------
  # Data Object 2 flattened, including grouping context
  FIELD1.BAR_0.FIELD1_0 = A
  FIELD2.BAR_0.FIELD2_0 = B

As a result, the objects are flattened into simple maps with each consisting of two fields, ‘FIELD1’ and ‘FIELD2’. By default, the grouping context (i.e., all characters beyond the field name itself) will be ignored by the query API. Therefore, the following simple query could be used to return both objects as distinct search results:

JEXL Query	Lucene Query
FIELD1 == 'A' && FIELD2 == 'B'	FIELD1:A FIELD2:B

However, if the objects originated from distinct XML schemas having entirely different semantics for their respective fields, then we might not want both to appear in our search results. To disambiguate the two objects, we can use the following function:

JEXL Function	Lucene Function
grouping:matchesInGroupLeft(F1, 'V1', F2, 'V2', ..., Fn, 'Vn', INTEGER)	#MATCHES_IN_GROUP_LEFT(F1, 'V1', F2, 'V2', ..., Fn, 'Vn', INTEGER)

The INTEGER parameter denotes the level in the tree where the matching field name/value pairs must exist in order to constitute a match. Its values are defined as follows:

0 : Fields are siblings / same parent element
1 : Fields are cousins / same grandparent element
2 : Fields are 2nd cousins / same great-grandparent element
And so on…

For example, to retrieve only Data Object 1 above, we might use the following query:

JEXL Function	Lucene Function
FIELD1 == 'A' && grouping:matchesInGroupLeft(FIELD1, 'A', FIELD2, 'B', 1)	FIELD1:A AND #MATCHES_IN_GROUP_LEFT(FIELD1, 'A', FIELD2, 'B', 1)

Likewise, to return only Data Object 2, we could do the following:

JEXL Function	Lucene Function
FIELD1 == 'A' && grouping:matchesInGroupLeft(FIELD1, 'A', FIELD2, 'B', 0)	FIELD1:A AND #MATCHES_IN_GROUP_LEFT(FIELD1, 'A', FIELD2, 'B', 0)

Custom Lucene Functions

DataWave has augmented Lucene to provide support for several JEXL features that were not supported natively. The table below maps the JEXL operators to the supported Lucene syntax

JEXL Operator	Lucene Operator
filter:includeRegex(field, regex)	#INCLUDE(field, regex)
filter:excludeRegex(field, regex)	#EXCLUDE(field, regex)
filter:includeRegex(field1, regex1) <op> filter:includeRegex(field2, regex2) ...	#INCLUDE(op, field1, regex1, field2, regex2 ...) where op is 'or' or 'and'
filter:excludeRegex(field1, regex1) <op> filter:excludeRegex(field2, regex2) ...	#EXCLUDE(op, field1, regex1, field2, regex2 ...) where op is 'or' or 'and'
filter:isNull(field)	#ISNULL(field)
not(filter:isNull(field))	#ISNOTNULL(field)
filter:occurrence(field,operator,count))	#OCCURRENCE(field,operator,count)
filter:timeFunction(field1,field2,operator,equivalence,goal)	#TIME_FUNCTION(field1,field2,operator,equivalence,goal)

Notes:

None of these filter functions can be applied against index-only fields.
The occurrence function is used to count the number of instances of a field in the event. Valid operators are '==' (or '='),'>','>=','<','<=', and '!='.

Basic Geospatial Functions

Some geo functions are supplied as well that may prove useful although the within_bounding_box function may be done with a simple range comparison (i.e. LAT_LON_USER <= <lat1>_<lon1> and LAT_LON_USER >= <lat2>_<lon2>.

JEXL Operator	Lucene Operator
geo:within_bounding_box(latLonField, lowerLeft, upperRight)	#GEO(bounding_box, latLonField, 'lowerLeft', 'upperRight')
geo:within_bounding_box(lonField, latField, minLon, minLat, maxLon, maxLat)	#GEO(bounding_box, lonField, latField, minLon, minLat, maxLon, maxLat)
geo:within_circle(latLonField, center, radius)	#GEO(circle, latLonField, center, radius)

Notes:

All lat and lon values are in decimal.
The lowerLeft, upperRight, and center are of the form <lat>_<lon> and must be surrounded by single quotes.
The radius is in decimal degrees as well.

GeoWave Functions

GeoWave is an optional component that provides the following functions when enabled

JEXL Operator	Lucene Operator
geowave:intersects_bounding_box(geometryField, westLon, eastLon, southLat, northLat)	#INTERSECTS_BOUNDING_BOX(geometryField, westLon, eastLon, southLat, northLat)
geowave:intersects_radius_km(geometryField, centerLon, centerLat, radiusKm)	#INTERSECTS_RADIUS_KM(geometryField, centerLon, centerLat, radiusKm)
geowave:contains(geometryField, Well-Known Text)	#CONTAINS(geometryField, centerLon, centerLat, radiusDegrees)
geowave:covers(geometryField, Well-Known Text)	#COVERS(geometryField, Well-Known Text)
geowave:coveredBy(geometryField, Well-Known Text)	#COVERED_BY(geometryField, Well-Known Text)
geowave:crosses(geometryField, Well-Known Text)	#CROSSES(geometryField, Well-Known Text)
geowave:intersects(geometryField, Well-Known Text)	#INTERSECTS(geometryField, Well-Known Text)
geowave:overlaps(geometryField, Well-Known Text)	#OVERLAPS(geometryField, Well-Known Text)
geowave:within(geometryField, Well-Known Text)	#WITHIN(geometryField, Well-Known Text)

Notes:

All lat and lon values are in decimal degrees.
The lowerLeft, upperRight, and center are of the form <lat>_<lon> and must be surrounded by single quotes.
Geometry is represented according to the Open Geospatial Consortium standard for Well-Known Text. It is in decimal degrees longitude for x, amd latitude for y. For example, a point at New York can be represented as 'POINT (-74.01 40.71)' and a box at New York can be repesented as 'POLYGON(( -74.1 40.75, -74.1 40.69, -73.9 40.69, -73.9 40.75, -74.1 40.75));

Date Functions

There are some additional functions that are supplied to handle dates more smoothly. It is intended that the need for these functions may go away in future versions (bolded parameters are literal, other parameters are substituted with appropriate values):

JEXL Operator	Lucene Operator
filter:betweenDates(field, start date, end date)	#DATE(field, start date, end date) or #DATE(field, between, start date, end date)
filter:betweenDates(field, start date, end date, start/end date format)	#DATE(field, start date, end date, start/end date format) or #DATE(field, between, start date, end date, start/end date format)
filter:betweenDates(field, field date format, start date, end date, start/end date format)	#DATE(field, field date format, start date, end date, start/end date format) or #DATE(field, between, field date format, start date, end date, start/end date format)
filter:afterDate(field, date)	#DATE(field, after, date)
filter:afterDate(field, date, date format)	#DATE(field, after, date, date format)
filter:afterDate(field, field date format, date, date format)	#DATE(field, after, field date format, date, date format)
filter:beforeDate(field, date)	#DATE(field, before, date)
filter:beforeDate(field, date, date format)	#DATE(field, before, date, date format)
filter:beforeDate(field, field date format, date, date format)	#DATE(field, before, field date format, date, date format)
filter:betweenLoadDates(LOAD_DATE, start date, end date)	#LOADED(start date, end date) or #LOADED(between, start date, end date)
filter:betweenLoadDates(LOAD_DATE, start date, end date, start/end date format)	#LOADED(start date, end date, start/end date format) or #LOADED(between, start date, end date, start/end date format)
filter:afterLoadDate(LOAD_DATE, date)	#LOADED(after, date)
filter:afterLoadDate(LOAD_DATE, date, date format)	#LOADED(after, date, date format)
filter:beforeLoadDate(LOAD_DATE, date)	#LOADED(before, date)
filter:beforeLoadDate(LOAD_DATE, date, date format)	#LOADED(before, date, date format)
filter:timeFunction(DOWNTIME, UPTIME, '-', '>', 2522880000000L)	#TIME_FUNCTION(DOWNTIME, UPTIME, '-', '>', '2522880000000L')

Notes:

None of these filter functions can be applied against index-only fields.
Between functions are inclusive, and the other functions are exclusive of the entered dates.
Date formats must be entered in the Java SimpleDateFormat object format.
If the entered date format is not specified, then the following list of date formats will be tried:

yyyyMMdd:HH:mm:ss:SSSZ
yyyyMMdd:HH:mm:ss:SSS
EEE MMM dd HH:mm:ss zzz yyyy
d MMM yyyy HH:mm:ss 'GMT'
yyyy-MM-dd HH:mm:ss.SSS Z
yyyy-MM-dd HH:mm:ss.SSS
yyyy-MM-dd HH:mm:ss.S Z
yyyy-MM-dd HH:mm:ss.S
yyyy-MM-dd HH:mm:ss Z
yyyy-MM-dd HH:mm:ssz
yyyy-MM-dd HH:mm:ss
yyyyMMdd HHmmss
yyyy-MM-dd'T'HH'|'mm
yyyy-MM-dd'T'HH':'mm':'ss'.'SSS'Z'
yyyy-MM-dd'T'HH':'mm':'ss'Z'
MM'/'dd'/'yyyy HH':'mm':'ss
E MMM d HH:mm:ss z yyyy
E MMM d HH:mm:ss Z yyyy
yyyyMMdd_HHmmss
yyyy-MM-dd
MM/dd/yyyy
yyyy-MMMM
yyyy-MMM
yyyyMMddHHmmss
yyyyMMddHHmm
yyyyMMddHH
yyyyMMdd

A special date format of 'e' can be supplied to mean milliseconds since epoch.

Tags: