Compositional Semantics in a Localist
Neural Network
Reprint from the Proceedings
of ICAI
– 2001 (3)
Artificial Intelligence
Center
Athens, GA 30602-7415
email:
nrushton@ai.uga.edu
Abstract: We
discuss representation and content-based retrieval of atomic statements of
predicate logic in a localist neural network. The semantics of our
representation is compositional, despite the fact that it employs no symbols —
countering prominent objections to connectionism raised by Fodor and Pylyshin
[2]. Our techniques can implement massively parallel database retrieval, as well
as case-based and analogical reasoning. The key technique is the use of network
nodes whose stimulation values are matrices rather than the scalars. Examples
and simulation results are included.
Keywords: compositional semantics, neural networks,
database, multichannel networks.
1.
Introduction
Connectionism
has spawned successful models in many areas of AI, such as content based
retrieval, word-sense disambiguation, and pattern recognition. However, these
models have generally failed to address the compositionality of propositions used in
“higher” cognition — whose meaning depends on structure rather than the mere
presence or absence of features. Thus
most simulations of higher
cognitive functions have used symbol structures to represent knowledge. Even
many systems which are connectionist in their low level detail, such as Anderson
& Lebiere’s ACT-RN [1], and Touretzky’s BoltzCONS [6], represent atomic data
components by patterns of activation, which can be instantiated and then copied
from one battery of nodes to another. These batteries are essentially Platonic
“wax tablets,” which record symbols consisting of patterns of activation.
Indeed, both systems are connectionist implementations of preexisting
symbol-based architectures — LISP in the case of Touretzky, and ACT-R in the case of Anderson and
Lebiere.
Here we take a different
approach — a thoroughgoing connectionist approach — to the representation and
retrieval of combinatorial data structures. Section 2 introduces the basic tool,
the multichannel network, and
describes how it is used to represent knowledge in short term memory. Section 3
describes storage in long term memory. Section 4 treats content-based retrieval
from long term memory. Section 5 gives examples and simulation
results. Section
6 discusses neural plausibility.
2.
Storage in Short Term Memory
Consider a traditional
network with four nodes representing the predicates red, blue, dog, and cat. In order to represent the presence
of, say, a blue dog, we stimulate the blue and dog nodes. In this way we can represent
the predicates in any arrangement, so long as they adhere in a single subject.
But now what if we want to represent the presence of a blue dog and a red cat?
What we need is a way to
keep track of node activation “about” different subjects, without interference
between activations pertaining to different ones. An analogy comes to mind with
a telephone network: many conversations can travel along the same lines at once,
without interfering, because the signals
travel on different channels in the network. This analogy will carry us a
long way.
Suppose our network has,
say, three different channels, labeled x, y, and z. We can now represent the presence of
a blue dog and a red cat by
activating, say, the blue and dog nodes on channel x, and the red and cat nodes on channel y. By introducing multiple channels into
the network nodes, it goes from representing propositions (e.g., the feature blue is present) to representing bona
fide one-place predicates (e.g., x is
blue). Note that the stimulation of each node can now be viewed as a
3-dimensional vector, where stimulation on each channel occupies one component
in the vector; so stimulation of strength 0.0 on channel x, 1.0 on
channel y, and 0.5 on channel z could be viewed as stimulation lf
[0.0, 1.0, 0.5] for the node[1].
The problem arises again
when we try to represent higher arity predicates. Consider the proposition that
Tony Loves Maria. However we
stimulate nodes for Tony, Maria, and Loves, even using multiple channels, we
cannot account for who loves whom. The problem is that Loves is not simply a predicate about
both Tony and Maria — it also concerns the roles the subjects play in the relation.
The difficulty in representing such structures is perennial in network
architectures, and has been put forth by Fodor and Pylyshin [2] as an
insurmountable barrier to the connectionist modeling of higher
cognition.
But if scalar valued nodes
can represent 0-ary propositions, and vector valued nodes can represent
one-place predicates, then for two-place predicates it is natural to turn to matrix valued node stimulation. So
suppose that, in addition to the three “feature” channels x, y, and z, our network has nine “relation”
channels: xx, xy, xz, yx, yy, yz, zx,
zy, and zz. The stimulation of a
node for a two-place relation, such as Loves, can now be regarded as a
3´3 matrix consisting of its
stimulation levels on the various relation channels. In order to represent that
Tony loves Maria, we can stimulate a node for Tony on channel x, one for Maria on channel y, and the loves node on channel xy. Note this is different from the
representation of Maria Loves Tony,
in which the loves node would be
stimulated on channel yx (which,
however, could be present instead or in addition).
One might fear that in order to represent higher arity relations, we would have to move to higher dimensional “hyper matrices.” However, it turns out that two is the magic number, and higher arity relations can be viewed as complexes of binary relations. In general, Pred(x1,x2,…,xn) can be represented as Pred-instance(x), first-arg(x,x1), second-arg(x,x2),…, nth-arg(x,xn).
3. Storage in Long Term
Memory
The previous section showed how to represent information in short term memory of a multichannel network, in terms of stimulation levels. Given a fixed number of channels, only a handful of propositions can be represented in this way at once, and in any case such information fades away when the network is dormant. This section shows how to form connections among nodes so that propositions can be stored permanently in long term memory, so that the number of available channels does not limit the number of propositions able to be simultaneously stored.
We
introduce a special interface node,
labeled ‘Ä’, which mediates between
the feature and relation nodes. Each interface node has three slots which can be
used to make connections with exactly three other nodes — two feature nodes and
a relation node, which at any given interface will be labeled F1, F2, and R (imagining a physical network, these
labels can be thought of as being implicit in the physical structure of the
interface node). An interface node with connections made in this way stores the
proposition “F1 stands in relation R to F2.” We will see in Sections 4 and 5
that, with an appropriate algorithm
for spreading activation across interface nodes, this reading is the right one.
We can now represent, in long term memory, knowledge involving both features and relations, such as Lisbon is the capital of Portugal:
Figure 1:
![]()
Lisbon
Portugal
![]()
F1
F2
![]()
R
capital_of
Alternatively, such facts could be represented in proposition networks reminiscent of, e.g., Anderson and Lebiere (1998). A node is allocated for the fact itself, and then connections are made to indicate its subject and predicate, and possibly an object, instrument, time, place, and other semantic roles. Thus rather than capital_of(Portugal, Lisbon), we would allocate a node for the proposition P that Lisbon is the capital of Portugal, and then form connections representing the facts subject_of(P,Lisbon), object_of(P, Portugal), predicate_of(P, capital_of). In this way, the proposition nodes give each fact a “handle,” allowing us to recursively represent such statements as Bob thinks Mary said Lisbon is the Capital of Portugal.
4. Retrieval from Long Term
Memory
A
multichannel network, consisting of nodes and connections formed as above, can
use spread-of-activation to retrieve stored propositions relevant to a query,
and to deliver its answer or answers. Different approaches to propagation give
different effects; conservative propagation favors soundness while a more
liberal approach favors completeness and spontaneous generalization. The basic
method used here is designed primarily to illustrate the channel-sensitive approach to
propagation which allows the network to represent and retrieve combinatorial
structures.
The stimulation of the relation nodes is determined by the query and held constant at 1.0 for each channel which is stimulated. At each time step, activation is propagated among the feature nodes according to the following algorithm :
Stimulation on each channel of each node is nondecreasing. The amount of new energy introduced into the system decreases by 1/3 at each time step, and so the total energy of the network forms a convergent geometric series. Thus the network state converges in every case.
5. Examples
5.1 Basic
retrieval
Our
first example illustrates basic data retrieval. A proposition network was formed
representing the following data using the binary predicates subject, object, and relation:
A
hippie is in the park.
A
hippie is in the church.
A
hippie is in the bank.
A
captain is in the park.
A
captain is in the church.
A
debutante is in the bank.
A
fireman is in the park.
A
lawyer is in the cave.
The
church is in the
park
The
query who is in the church is
rendered into the following source stimulations: church(x), in(y), subject_of(pz), relation_of(py), object_of(px). z is the query variable: as
the network evolves, we should expect nodes representing correct answers to
accumulate stimulation on Channel z.
After evolving the network
10 cycles, the correct answers captain and hippie respond on Channel z with
stimulation levels 0.17 and 0.21 respectively. The stimulation level of the
nearest competitor is 0.035, roughly an order of magnitude less. The slight
disparity between the stimulation of the correct answers is due to the fact that
being the agent of ‘in-ness,’ according to our database, is slightly more
typical of hippies than of captains.
It is of interest that the incorrect answer park does not respond on Channel z at all, even though it is associated with query atoms church and in, since our database contains the proposition The church is in the park. This illustrates the ability of the network to take into account the structure of the data and query, and is the main difference between the mulichannel approach and classical models for spread-of-activation retrieval.
5.2 Case based
reasoning
Table 1:
species pupils shape of head general color venom type
|
copperhead |
elliptical |
triangular |
red |
venomous |
|
rattlesnake |
elliptical |
triangular |
brown |
venomous |
|
black rat snake |
round |
tapered |
dark |
non-venemous |
|
black racer |
round |
tapered |
dark |
non-venomous |
To
illustrate case-based reasoning we use the data of Table 1, represented in a network using
the binary relations
pupil_shape, head_shape, color, and venom_type. This parallel database
could be used to retrieve stored propositions as in the previous example, but we
are now concerned with what happens when the answer to the query is not in the database. For a test case
suppose we encounter a water moccasin, which has elliptical pupils, a triangular
head, and is dark in color. To query whether the moccasin is venomous, we initialize the network
with the following activations: elliptical(w), triangular(x), dark(y), pupil_shape(vw), head_shape(vx), color(vy), venom_type(vz). Again, z is the query variable,
corresponding to the channel on which appropriate answers should respond.
After evolving the network
for 10 cycles, venomous responds on
Channel z with a strength of 0.39 while non-venomous responds with strength
0.195. The strength of the venomous
response is roughly twice that of non-venomous because, according to our
database, the water moccasin shares twice as many features in common with
venomous snakes as with non-venomous ones, on average per species. In a more
sophisticated model, features could be weighted in importance by altering
network connection strengths, and combined non-linearly by adding “hidden nodes”
corresponding to conjunctions of basic features.
6. Neural
Plausibility
The
stimulation levels of biological neurons are almost certainly scalar valued
(i.e., single channeled). However, a multichannel network can easily be
constructed out of traditional (scalar valued) nodes and connections. For an N-channel network we simply allocate a
pool of N neurons for each feature
node, and N2 for each
binary relation node.
The primary operation
carried out at an interface node is that of determining candidates for
stimulation among channels of the feature nodes. A close examination of the
algorithm described in Section 6 reveals the essential operation is that of multiplying a real N-vector by an
N´N Boolean matrix,
on the
right or left depending on whether the given object is appears in the first or
second argument of the given relation. For example if feature node A has stimulation vector v and is connected to slot F1 of relation node R having stimulation matrix m, then the vector v´m describes the potential
stimulation transmitted from A to
slot F2 of R. Thus it seems that implementing matrix-vector arithmetic is an
alternative to the use of symbols in representing and processing statements of
predicate logic.
It is plausible that the
brain does employ vector and perhaps matrix representations – if for no other
reason in order to handle perception and motor activities in physical space.
Several models for viseospatial tasks have been proposed which involve vector
representations and vector arithmetic; and some of these have been found to
correspond with readings taken from brain cells in behaving animals ( [3], [5]).
However to my knowledge none of these has dealt in particular with matrix-vector
multiplication.
Multichannel networks are also extremely simple, especially compared to the neural symbol systems of, say, Touretzky (1990) or Anderson and Lebiere (1998), which makes them good candidates to arise natural selection (especially given the preexistence of vector representations). These factors weigh in favor of neural plausibility for multichannel networks. In the current state of science, however, plausibility is all that can be claimed for any theory of cognitive neuroscience; and we certainly make no claims for the neural reality of multichannel networks.
7. Conclusion
Multichannel networks
provide a massively parallel, non-symbolic framework for the storage and
retrieval of atomic statements of predicate logic. Unlike traditional (scalar)
network models, multichannel networks can represent and retrieve propositions
based on their compositional semantic structure. Using an appropriate algorithm
for spreading activation, retrieval degrades gracefully in case of missing data,
in that near matches are retrieved in order of their structural similarity to
the desired datum. This graceful degradation yields a capacity for analogical
and case-based reasoning in the network.
References
[1]
Anderson and Lebiere (1998). The
atomic
components of thought. Lawrence Erlbaum
Associates, Mahwah, NJ.
[2] Jerry Fodor and Zenon Pylyshin (1988).
Connectionism and cognitive architecture:
A critical analysis. Cognition 28: 3-71.
[3] Apostolos P. Georgopoulous et al. Mental
rotation of the neuronal population vector.
Science 243: 234-236.
[4]
Hans Kamp and Ewe Reyle (1993).
From
Discourse to Logic. Klewer Academic
Boston.
[5] A.D. Redish, D. Touretzky. The reaching
task: evidence for vector arithmetic in the
motor system, Biological Cybernetics,
71(4): 307-317.
[6] David Touretzky (1990). BoltzCONS –
dynamic symbol structures in a
connectionist network. Artificial
Intelligence 46 (1-2): 5-46
[1] The mulit-channel approach is partly inspired by the (non-connectionist) work of Kamp and Reyle (1993), and the channels play a role similar to Kamp and Reyle’s discourse variables. The number of channels essentially bounds the number of distinct objects which can be simultaneously entertained in short term memory.