Structural Informatics Group (SIG) logo
Home | Projects | Demos | Downloads | Publications | Local Info | About Us | New site

StruQL

(Excerpted from Peter Mork's upcoming paper about OQAFMA.)

Recall that a semantic network is a collection of interconnected nodes. Each node corresponds to a concept (frame) or atomic value and each edge corresponds to a relationship (slot). Expressed in first order logic, an edge (E) that connect one node (N1) to another (N2) means that the relationship E holds for N1 and N2 . In addition, if an edge (E) connects a node (N1) to a value (V), then the relationship E holds for N1 and V. For example, if the nodes named Heart and Aorta are connected with an edge labeled part-of, then it is necessarily the case that the part-of relationship holds for Heart and Aorta. An edge labeled synonym linking Heart and the atomic value "Coer" means that the synonym relationship holds for Heart and "Coer." As is always the case with first order logic, the interpretation of these statements depends on the actual concepts we associate with Heart, Aorta, part-of and synonym. We assume the obvious correspondence between English names and real world concepts (e.g., synonym is associated the synonym relationship).

We have chosen STRUQL as the basis for querying the FMA because it assumes an edge labeled graph as the underlying data structure. This corresponds exactly with a semantic network, and very closely to a frame-based data model.


a) Basic Operations
The organization of a STRUQL query is very simple. There is a WHERE clause, which binds variables to a subset of the semantic network followed by CREATE and LINK clauses (the LINK clause has not yet been implemented), which construct nodes and edges in the result. There are two basic constructs in the WHERE clause. First, one can express necessary relationships between two nodes or between a node and a value. Second, one can express binary conditions between a variable and an atomic value. Readers familiar with SQL should note that re-using a variable name supports the equivalent of an equi-join operation.
These basic operations support all possible browse queries.


b) Path Expressions
STRUQL supports arbitrary regular expressions (paths) to be used in place of specific edge names. The basic operations include concatenation (.), alternation (| and ?) and closure (* and +). In addition, variables can be used in place of an edge name. With these additional operations, export and extraction are fully supported.
b.1) Concatenation
The concatenation of two paths (P1.P2) generates a new "longer" path, which can be interpreted as the first path, followed by the second path. Formally, this new relationship holds for X and Y whenever there exists some node (N) such that P1 holds for X and N and P2 holds for N and Y. (In the frame literature this is known as a slot-chain.) For example, in a semantic network corresponding to a family tree, grandparents can be retrieved by concatenating "parent" with itself (i.e., "parent"."parent"). Note that there is no requirement that N be distinct from X (or Y).
Recursive constructions are also possible. Paths of arbitrary (finite) length can be constructed by concatenating multiple edges together. This construct is supported by OKBC, but not Protégé-2000.
b.2) Alternation
The alternation of two paths (P1|P2) generates a choice between the two paths. Formally, this new relationship holds for X and Y whenever either P1 holds for X and Y or P2 holds. This operation can also be thought of as union; intersection is supported by adding an additional constraint (e.g., X'P1'Y, X'P2'Y). Continuing the family tree example, siblings can be retrieved by alternating "brother" and "sister" (i.e., "brother"|"sister"). Another version of alternation is the optional operator (?). This operator makes the preceding path optional. This can also be thought of as (P|e), except STRUQL does not support e (the empty path of length 0). Formally, the relationship P? holds for X and Y whenever P holds for X and Y or X and Y are the same node (i.e., X == Y).
b.3) Closure
Closure (P+) means following P an arbitrary number of times. Formally, this relationship holds for X and Y whenever there exists a collection of intermediate nodes such that P holds for each successive pair of nodes. This constraint is most naturally expressed (as in Figure 7) as a recursive relationship. In the relational literature this is implemented using fixed-point operator. This operation is essential to the traversal of hierarchies (for example 'part-of' or 'is-a') since it allows traversal to an arbitrary depth. Thus, closure is crucial whenever a relationship exhibits transitivity. To complete the family tree example, ancestors can be retrieved by performing closure on the "parent" relationship (i.e., "parent"+). Because of the frequency with the closure operator is followed by the optional operator, these two operators are combined by the star (*) operator. Thus, P* means follow P any number times or not at all.
b.4) Complex Expressions
The careful reader will note that all of the preceding definitions referred to paths and not edges. This is because STRUQL supports applying any of these operators to any path. This allows one to express arbitrary complex constraints. One of the early motivations for choosing a language that supports regular expressions over relationships was the way in which containment is modeled in the FMA.
In the FMA, an organ can only be contained in a cavity. Thus, it is valid to say that the left lung is contained in the thoracic cavity. However, it is not valid (in the model) to say that the left lung is contained in the thorax. A reasonable user query is "What are all of the organs contained in the thorax?" In the model, the answer is the empty set. Of course if you asked anyone with a passing knowledge of anatomy this question, they would be able to list several organs. In terms of the FMA, the actual user query needs to be phrased as "What are all of the organs contained in the thorax or any of its sub-parts?" This complex relationship can be written succinctly using the STRUQL operators as "part"*."contains" which can be read as "Starting from the thorax traverse 0 or more part relationships followed by a single containment relationship."
This example reveals one of the advantages of complex relationships. The underlying model can be constructed with great precision. For example, it is technically true that only cavities can contain things. However, the query interface should support higher levels of abstraction. For example, the query interface could suggest (or even automatically convert) replacing every appearance of "contains" with the less precise, but more intuitive, "part"*."contains" to the user. It is among our goals to construct a library of such conversions.
Moreover, this indirection allows the underlying model to evolve while still exposing the same collection of virtual relationships. This logical independence is exactly analogous to virtual tables (i.e., views) in a relational database system.

c) Syntax and Semantics
A formal grammar describing the subset of STRUQL we have implemented can be seen in the following EBNF grammar. Briefly, a STRUQL query consists of a WHERE clause followed by a CREATE clause. The WHERE clause consists of variables, which are related to one another via path expressions: X'P'Y. Variables can also be equated with constants: X=="Value". For convenience, the expression X'P'Y, Y=="Value" can be abbreviated: X'P'"Value". In STRUQL arbitrary paths are supported; at present we require a specific order of operations: closure, then alternation, then concatenation. The CREATE clause consists of node construction functions, parameterized using variables from the WHERE clause.


Query:- WhereClause CreateClause

WhereClause:-WHERE [WhereExpression,]* WhereExpression

WhereExpression:- Var->Path->Term
               | Var==String
               | Var { String }

Term:- Var
     | String 

Path:- Path.Path
     | Alternation
     | Var

Alternation:- Alternation | Alternation
            |Closure

Closure:- String Modifier

Modifier:- ?
         | *
         | +
                      
CreateClause:- CREATE [CreateExpression]* CreateExpression

CreateExpression:- Var ( [Var,]*)


Precise semantics can be found below, but we will first consider an example. Until now we have been assuming that the (unique) identifier for every node in the semantic network is meaningful. This is not, in fact, the case. Nodes are uniquely identified using arbitrary numbers. However, because the FMA was constructed using Protégé-2000, every node (frame) has exactly one ":NAME" edge relating that node to its human-readable name.

d) WHERE Semantics
The WHERE clause generates a series of variable bindings. Let N represent all of the nodes in the semantic network and let E represent all of the edges in the semantic network. Finally, let Q represent all of the variables in the WHERE clause. The semantics of the WHERE clause is the set of all assignments from Q into N E such that all of the conditions in the where clause are satisfied. That is, for every path constraint X'P'Y, the relationship P(X,Y) holds. Note that P could itself be a variable; this does not, however, change the semantics. In addition, every binary condition (of the form X == "String" or X {"String"}) must also hold. The former holds if the value of X equals the string constant. The latter construct allows one to use any binary condition supported by SQL; the semantics of this comparison are defined by SQL. It should be noted that there are potentially an infinite number of paths connecting X and Y. However, the semantics of the WHERE clause restricts the output to a finite collection. Let N E contain k elements and let Q contain q variables. There are at most kq assignments from Q into N E. Thus, theWHERE clause must return a finite collection (polynomial in the size of the network).

e) CREATE Semantics
The CREATE clause generates output based on the results returned by the WHERE clause. Each expression in the CREATE clause corresponds to a Skolem function. Each Skolem function returns an arbitrary object subject to the following constraints: Whenever a Skolem function is passed different collections of parameters, it will produce different return values. Whenever it is passed the same collection of parameters, it will return the same value.
As a result, each expression in the CREATE clause constructs a new element for every unique binding of the specified parameters returned by the WHERE clause. To guarantee that the resulting document is valid XML, the create clause wraps all of the elements it creates in a <results> tag. Each Skolem function returns elements corresponding to the name of the function. The sub-elements of these elements are the names of the variables, the contents of which are the values bound to the variables.

e) Example

WHERE
X->":NAME"->"Thorax",
X->"part"*."contains"->Y,
Y->":NAME"->Contains
CREATE
TheThorax(Contains)

The query in the above figure retrieves the names of the things contained in the thorax, or one of its sub-parts. The only possible binding for X is the node whose name is "Thorax". There are several possible bindings for Y, one for each node reachable from X by the path "generic part"*."contains". Finally, for every binding of Y, there is exactly one binding for Contains because each node has a unique ":NAME". The results returned by the WHERE clause are displayed below:


<results>   
  <TheThorax>   
    <Contains>Upper lobe of right lung</Contains>   
  </TheThorax>   
  <TheThorax>   
    <Contains>Right lung</Contains>   
  </TheThorax>   
  <TheThorax>   
    <Contains>Lung</Contains>   
  </TheThorax>   
  <TheThorax>   
    <Contains>Upper lobe of left lung</Contains>   
  </TheThorax>   
  <TheThorax>   
    <Contains>Left lung</Contains>   
  </TheThorax>   
  <TheThorax>   
    <Contains>Pleural fluid</Contains>   
  </TheThorax>   
  <TheThorax>   
    <Contains>Pericardial fluid</Contains>   
  </TheThorax>   
  <TheThorax>   
    <Contains>Thymus</Contains>   
  </TheThorax>   
</results>   

Return to the OQAFMA page.


Last modified: Mon Apr 28 16:46:49 PDT 2003 by Kevin Hinshaw (khinshaw (at) u.washington.edu)