(Excerpted from Peter Mork's upcoming paper about OQAFMA.)
Recall that a semantic network is a collection of interconnected nodes. Each
node corresponds to a concept (frame) or atomic value and each edge corresponds
to a relationship (slot). Expressed in first order logic, an edge (E) that connect
one node (N1) to another (N2) means that the relationship E holds for N1 and
N2 . In addition, if an edge (E) connects a node (N1) to a value (V), then the
relationship E holds for N1 and V. For example, if the nodes named Heart and
Aorta are connected with an edge labeled part-of, then it is necessarily the
case that the part-of relationship holds for Heart and Aorta. An edge labeled
synonym linking Heart and the atomic value "Coer" means that the synonym
relationship holds for Heart and "Coer." As is always the case with
first order logic, the interpretation of these statements depends on the actual
concepts we associate with Heart, Aorta, part-of and synonym. We assume the
obvious correspondence between English names and real world concepts (e.g.,
synonym is associated the synonym relationship).
We have chosen STRUQL as the basis for querying the FMA because it assumes an
edge labeled graph as the underlying data structure. This corresponds exactly
with a semantic network, and very closely to a frame-based data model.
a) Basic Operations
The organization of a STRUQL query is very simple. There is a WHERE clause,
which binds variables to a subset of the semantic network followed by CREATE
and LINK clauses (the LINK clause has not yet been implemented), which construct
nodes and edges in the result. There are two basic constructs in the WHERE clause.
First, one can express necessary relationships between two nodes or between
a node and a value. Second, one can express binary conditions between a variable
and an atomic value. Readers familiar with SQL should note that re-using a variable
name supports the equivalent of an equi-join operation.
These basic operations support all possible browse queries.
b) Path Expressions
STRUQL supports arbitrary regular expressions (paths) to be used in place of
specific edge names. The basic operations include concatenation (.), alternation
(| and ?) and closure (* and +). In addition, variables can be used in place
of an edge name. With these additional operations, export and extraction are
fully supported.
b.1) Concatenation
The concatenation of two paths (P1.P2) generates a new "longer" path,
which can be interpreted as the first path, followed by the second path. Formally,
this new relationship holds for X and Y whenever there exists some node (N)
such that P1 holds for X and N and P2 holds for N and Y. (In the frame literature
this is known as a slot-chain.) For example, in a semantic network corresponding
to a family tree, grandparents can be retrieved by concatenating "parent"
with itself (i.e., "parent"."parent"). Note that there is
no requirement that N be distinct from X (or Y).
Recursive constructions are also possible. Paths of arbitrary (finite) length
can be constructed by concatenating multiple edges together. This construct
is supported by OKBC, but not Protégé-2000.
b.2) Alternation
The alternation of two paths (P1|P2) generates a choice between the two paths.
Formally, this new relationship holds for X and Y whenever either P1 holds for
X and Y or P2 holds. This operation can also be thought of as union; intersection
is supported by adding an additional constraint (e.g., X'P1'Y, X'P2'Y). Continuing
the family tree example, siblings can be retrieved by alternating "brother"
and "sister" (i.e., "brother"|"sister"). Another
version of alternation is the optional operator (?). This operator makes the
preceding path optional. This can also be thought of as (P|e), except STRUQL
does not support e (the empty path of length 0). Formally, the relationship
P? holds for X and Y whenever P holds for X and Y or X and Y are the same node
(i.e., X == Y).
b.3) Closure
Closure (P+) means following P an arbitrary number of times. Formally, this
relationship holds for X and Y whenever there exists a collection of intermediate
nodes such that P holds for each successive pair of nodes. This constraint is
most naturally expressed (as in Figure 7) as a recursive relationship. In the
relational literature this is implemented using fixed-point operator. This operation
is essential to the traversal of hierarchies (for example 'part-of' or 'is-a')
since it allows traversal to an arbitrary depth. Thus, closure is crucial whenever
a relationship exhibits transitivity. To complete the family tree example, ancestors
can be retrieved by performing closure on the "parent" relationship
(i.e., "parent"+). Because of the frequency with the closure operator
is followed by the optional operator, these two operators are combined by the
star (*) operator. Thus, P* means follow P any number times or not at all.
b.4) Complex Expressions
The careful reader will note that all of the preceding definitions referred
to paths and not edges. This is because STRUQL supports applying any of these
operators to any path. This allows one to express arbitrary complex constraints.
One of the early motivations for choosing a language that supports regular expressions
over relationships was the way in which containment is modeled in the FMA.
In the FMA, an organ can only be contained in a cavity. Thus, it is valid to
say that the left lung is contained in the thoracic cavity. However, it is not
valid (in the model) to say that the left lung is contained in the thorax. A
reasonable user query is "What are all of the organs contained in the thorax?"
In the model, the answer is the empty set. Of course if you asked anyone with
a passing knowledge of anatomy this question, they would be able to list several
organs. In terms of the FMA, the actual user query needs to be phrased as "What
are all of the organs contained in the thorax or any of its sub-parts?"
This complex relationship can be written succinctly using the STRUQL operators
as "part"*."contains" which can be read as "Starting
from the thorax traverse 0 or more part relationships followed by a single containment
relationship."
This example reveals one of the advantages of complex relationships. The underlying
model can be constructed with great precision. For example, it is technically
true that only cavities can contain things. However, the query interface should
support higher levels of abstraction. For example, the query interface could
suggest (or even automatically convert) replacing every appearance of "contains"
with the less precise, but more intuitive, "part"*."contains"
to the user. It is among our goals to construct a library of such conversions.
Moreover, this indirection allows the underlying model to evolve while still
exposing the same collection of virtual relationships. This logical independence
is exactly analogous to virtual tables (i.e., views) in a relational database
system.
c) Syntax and Semantics
A formal grammar describing the subset of STRUQL we have implemented can be
seen in the following EBNF grammar. Briefly, a STRUQL query consists of a WHERE
clause followed by a CREATE clause. The WHERE clause consists of variables,
which are related to one another via path expressions: X'P'Y. Variables can
also be equated with constants: X=="Value". For convenience, the expression
X'P'Y, Y=="Value" can be abbreviated: X'P'"Value". In STRUQL
arbitrary paths are supported; at present we require a specific order of operations:
closure, then alternation, then concatenation. The CREATE clause consists of node construction functions,
parameterized using variables from the WHERE clause.
Query:- WhereClause CreateClause WhereClause:-WHERE [WhereExpression,]* WhereExpression WhereExpression:- Var->Path->Term | Var==String | Var { String } Term:- Var | String Path:- Path.Path | Alternation | Var Alternation:- Alternation | Alternation |Closure Closure:- String Modifier Modifier:- ? | * | + CreateClause:- CREATE [CreateExpression]* CreateExpression CreateExpression:- Var ( [Var,]*)
Precise semantics can be found below, but we will first consider an example.
Until now we have been assuming that the (unique) identifier for every node
in the semantic network is meaningful. This is not, in fact, the case. Nodes
are uniquely identified using arbitrary numbers. However, because the FMA was
constructed using Protégé-2000, every node (frame) has exactly
one ":NAME" edge relating that node to its human-readable name.
d) WHERE Semantics
The WHERE clause generates a series of variable bindings. Let N represent all
of the nodes in the semantic network and let E represent all of the edges in
the semantic network. Finally, let Q represent all of the variables in the WHERE
clause. The semantics of the WHERE clause is the set of all assignments from
Q into N E such that all of the conditions in the where clause are satisfied.
That is, for every path constraint X'P'Y, the relationship P(X,Y) holds. Note
that P could itself be a variable; this does not, however, change the semantics.
In addition, every binary condition (of the form X == "String" or
X {"String"}) must also hold. The former holds if the value of X equals
the string constant. The latter construct allows one to use any binary condition
supported by SQL; the semantics of this comparison are defined by SQL. It should
be noted that there are potentially an infinite number of paths connecting X
and Y. However, the semantics of the WHERE clause restricts the output to a
finite collection. Let N E contain k elements and let Q contain q variables.
There are at most kq assignments from Q into N E. Thus, theWHERE clause must
return a finite collection (polynomial in the size of the network).
e) CREATE Semantics
The CREATE clause generates output based on the results returned by the WHERE
clause. Each expression in the CREATE clause corresponds to a Skolem function.
Each Skolem function returns an arbitrary object subject to the following constraints:
Whenever a Skolem function is passed different collections of parameters, it
will produce different return values. Whenever it is passed the same collection
of parameters, it will return the same value.
As a result, each expression in the CREATE clause constructs a new element for
every unique binding of the specified parameters returned by the WHERE clause.
To guarantee that the resulting document is valid XML, the create clause wraps
all of the elements it creates in a <results> tag. Each Skolem function
returns elements corresponding to the name of the function. The sub-elements
of these elements are the names of the variables, the contents of which are
the values bound to the variables.
e) Example
WHERE
X->":NAME"->"Thorax",
X->"part"*."contains"->Y,
Y->":NAME"->Contains
CREATE
TheThorax(Contains)
The query in the above figure retrieves the names of the things contained in
the thorax, or one of its sub-parts. The only possible binding for X is the
node whose name is "Thorax". There are several possible bindings for
Y, one for each node reachable from X by the path "generic part"*."contains".
Finally, for every binding of Y, there is exactly one binding for Contains because
each node has a unique ":NAME". The results returned by the WHERE
clause are displayed below:
<results> <TheThorax> <Contains>Upper lobe of right lung</Contains> </TheThorax> <TheThorax> <Contains>Right lung</Contains> </TheThorax> <TheThorax> <Contains>Lung</Contains> </TheThorax> <TheThorax> <Contains>Upper lobe of left lung</Contains> </TheThorax> <TheThorax> <Contains>Left lung</Contains> </TheThorax> <TheThorax> <Contains>Pleural fluid</Contains> </TheThorax> <TheThorax> <Contains>Pericardial fluid</Contains> </TheThorax> <TheThorax> <Contains>Thymus</Contains> </TheThorax> </results>