Compositional Semantics in a Localist

Neural Network

 

Reprint from the Proceedings of ICAI – 2001 (3)

 

J.N. Rushton

Artificial Intelligence Center

The University of Georgia

 Athens, GA  30602-7415

email: nrushton@ai.uga.edu

 

 

 

Abstract: We discuss representation and content-based retrieval of atomic statements of predicate logic in a localist neural network. The semantics of our representation is compositional, despite the fact that it employs no symbols — countering prominent objections to connectionism raised by Fodor and Pylyshin [2]. Our techniques can implement massively parallel database retrieval, as well as case-based and analogical reasoning. The key technique is the use of network nodes whose stimulation values are matrices rather than the scalars. Examples and simulation results are included.


 

 


Keywords: compositional semantics, neural networks, database, multichannel networks.

 

 

 


1. Introduction

 

Connectionism has spawned successful models in many areas of AI, such as content based retrieval, word-sense disambiguation, and pattern recognition. However, these models have generally failed to address the compositionality of propositions used in “higher” cognition — whose meaning depends on structure rather than the mere presence or absence of features. Thus  most simulations of  higher cognitive functions have used symbol structures to represent knowledge. Even many systems which are connectionist in their low level detail, such as Anderson & Lebiere’s ACT-RN [1], and Touretzky’s BoltzCONS [6], represent atomic data components by patterns of activation, which can be instantiated and then copied from one battery of nodes to another. These batteries are essentially Platonic “wax tablets,” which record symbols consisting of patterns of activation. Indeed, both systems are connectionist implementations of preexisting symbol-based architectures — LISP in the case of Touretzky, and  ACT-R in the case of Anderson and Lebiere.

            Here we take a different approach — a thoroughgoing connectionist approach — to the representation and retrieval of combinatorial data structures. Section 2 introduces the basic tool, the multichannel network, and describes how it is used to represent knowledge in short term memory. Section 3 describes storage in long term memory. Section 4 treats content-based retrieval from long term memory. Section 5 gives examples and simulation results. Section 6 discusses neural plausibility.

 

 

 

 

 

2. Storage in Short Term Memory

 

Consider a traditional network with four nodes representing the predicates red, blue, dog, and cat. In order to represent the presence of, say, a blue dog, we stimulate the blue and dog nodes. In this way we can represent the predicates in any arrangement, so long as they adhere in a single subject. But now what if we want to represent the presence of a blue dog and a red cat?

What we need is a way to keep track of node activation “about” different subjects, without interference between activations pertaining to different ones. An analogy comes to mind with a telephone network: many conversations can travel along the same lines at once, without interfering, because the signals travel on different channels in the network. This analogy will carry us a long way.

Suppose our network has, say, three different channels, labeled x, y, and z. We can now represent the presence of a blue dog and a red cat by activating, say, the blue and dog nodes on channel x, and the red and cat nodes on channel y. By introducing multiple channels into the network nodes, it goes from representing propositions (e.g., the feature blue is present) to representing bona fide one-place predicates (e.g., x is blue). Note that the stimulation of each node can now be viewed as a 3-dimensional vector, where stimulation on each channel occupies one component in the vector; so stimulation of strength 0.0 on channel x, 1.0 on channel y, and 0.5 on channel z could be viewed as stimulation lf [0.0, 1.0, 0.5] for the node[1]. 

The problem arises again when we try to represent higher arity predicates. Consider the proposition that Tony Loves Maria. However we stimulate nodes for Tony, Maria, and Loves, even using multiple channels, we cannot account for who loves whom. The problem is that Loves is not simply a predicate about both Tony and Maria — it also concerns the roles the subjects play in the relation. The difficulty in representing such structures is perennial in network architectures, and has been put forth by Fodor and Pylyshin [2] as an insurmountable barrier to the connectionist modeling of higher cognition.

But if scalar valued nodes can represent 0-ary propositions, and vector valued nodes can represent one-place predicates, then for two-place predicates it is natural to turn to matrix valued node stimulation. So suppose that, in addition to the three “feature” channels x, y, and z, our network has nine “relation” channels: xx, xy, xz, yx, yy, yz, zx, zy, and zz. The stimulation of a node for a two-place relation, such as Loves, can now be regarded as a 3´3 matrix consisting of its stimulation levels on the various relation channels. In order to represent that Tony loves Maria, we can stimulate a node for Tony on channel x, one for Maria on channel y, and the loves node on channel xy. Note this is different from the representation of Maria Loves Tony, in which the loves node would be stimulated on channel yx (which, however, could be present instead or in addition).

One might fear that in order to represent higher arity relations, we would have to move to higher dimensional “hyper matrices.” However, it turns out that two is the magic number, and higher arity relations can be viewed as complexes of binary relations. In general, Pred(x1,x2,…,xn) can be represented as Pred-instance(x), first-arg(x,x1), second-arg(x,x2),…, nth-arg(x,xn).

 

 

 

 

3. Storage in Long Term Memory

 

The previous section showed how to represent information in short term memory of a multichannel network, in terms of stimulation levels. Given a fixed number of channels, only a handful of propositions can be represented in this way at once, and in any case such information fades away when the network is dormant. This section shows how to form connections among nodes so that propositions can be stored permanently in long term memory, so that the number of available channels does not limit the number of propositions able to be simultaneously stored.

We introduce a special interface node, labeled ‘Ä’, which mediates between the feature and relation nodes. Each interface node has three slots which can be used to make connections with exactly three other nodes — two feature nodes and a relation node, which at any given interface will be labeled F1, F2, and R (imagining a physical network, these labels can be thought of as being implicit in the physical structure of the interface node). An interface node with connections made in this way stores the proposition “F1 stands in relation R to F2.” We will see in Sections 4 and 5 that, with an  appropriate algorithm for spreading activation across interface nodes, this reading is the right one.

We can now represent, in long term memory, knowledge involving both features and relations, such as Lisbon is the capital of Portugal:


 

 

Figure 1:

                                          Lisbon                                        Portugal

                                               

                                                           

                         

                                                          F1           F2

 


                                                               R

                                     

 

                                                                 capital_of

 

 


 

Alternatively, such facts could be represented in proposition networks reminiscent of, e.g., Anderson and Lebiere (1998). A node is allocated for the fact itself, and then connections are made to indicate its subject and predicate, and possibly an object, instrument, time, place, and other semantic roles. Thus rather than capital_of(Portugal, Lisbon), we would allocate a node for the proposition P that Lisbon is the capital of Portugal, and then form connections representing the facts subject_of(P,Lisbon), object_of(P, Portugal), predicate_of(P, capital_of). In this way, the proposition nodes give each fact a “handle,” allowing us to recursively represent such statements as Bob thinks Mary said Lisbon is the Capital of Portugal.

 

 

4. Retrieval from Long Term Memory

 

A multichannel network, consisting of nodes and connections formed as above, can use spread-of-activation to retrieve stored propositions relevant to a query, and to deliver its answer or answers. Different approaches to propagation give different effects; conservative propagation favors soundness while a more liberal approach favors completeness and spontaneous generalization. The basic method used here is designed primarily to illustrate the channel-sensitive approach to propagation which allows the network to represent and retrieve combinatorial structures.

The stimulation of the relation nodes is determined by the query and held constant at 1.0 for each channel which is stimulated. At each time step, activation is propagated among the feature nodes according to the following algorithm :

 

 

  1. Each source node (as designated by the query) receives a stimulation of strength 1 on its designated channel(s).
  2. If R is a binary relation node stimulated on channel xy, and N1 and N2 are feature nodes connected to Slots 1 and 2 of (an interface node of) R respectively, then channel x of N1 is a candidate to receive stimulation from channel y of N2.
  3. If R is a binary relation node stimulated on channel xy, and N1 and N2 are feature nodes connected to Slots 1 and 2 of (an interface node of) R respectively, then channel y of N2 is a candidate to receive stimulation from channel x of N1.
  4. Each channel of each feature node passes on 2/3 of its current stimulation, divided evenly among  all candidates. The remaining 1/3 of the stimulation dissipates. These transfers occur simultaneously so that stimulation received during the current time step cannot be transferred until the subsequent time step.

 

Stimulation on each channel of each node is nondecreasing. The amount of new energy introduced into the system decreases by 1/3 at each time step, and so the total energy of the network forms a convergent geometric series. Thus the network state converges in every case. 

 

 

5. Examples

 

5.1 Basic retrieval

 

Our first example illustrates basic data retrieval. A proposition network was formed representing the following data using the binary predicates subject, object, and relation:

 

A hippie is in the park.

A hippie is in the church.

A hippie is in the bank.

A captain is in the park.

A captain is in the church.

A debutante is in the bank.

A fireman is in the park.

A lawyer is in the cave.

The church is in the park



The query who is in the church is rendered into the following source stimulations: church(x), in(y), subject_of(pz), relation_of(py), object_of(px). z is the query variable: as the network evolves, we should expect nodes representing correct answers to accumulate stimulation on Channel z.

After evolving the network 10 cycles, the correct answers captain and hippie respond on Channel z with stimulation levels 0.17 and 0.21 respectively. The stimulation level of the nearest competitor is 0.035, roughly an order of magnitude less. The slight disparity between the stimulation of the correct answers is due to the fact that being the agent of ‘in-ness,’ according to our database, is slightly more typical of hippies than of captains.

It is of interest that the incorrect answer park does not respond on Channel z at all, even though it is associated with query atoms church and in, since our database contains the proposition The church is in the park. This illustrates the ability of the network to take into account the structure of the data and query, and is the main difference between the mulichannel approach and classical models for spread-of-activation retrieval.


 

 

5.2 Case based reasoning

 

Table 1:

 

      species                  pupils                                shape of head              general color               venom type

copperhead

elliptical

triangular

red

venomous

rattlesnake

elliptical

triangular

brown

venomous

black rat snake

round

tapered

dark

non-venemous

black racer

round

tapered

dark

non-venomous

 

 


To illustrate case-based reasoning we use the data of  Table 1, represented in a network using the binary relations pupil_shape, head_shape, color, and venom_type. This parallel database could be used to retrieve stored propositions as in the previous example, but we are now concerned with what happens when the answer to the query is not in the database. For a test case suppose we encounter a water moccasin, which has elliptical pupils, a triangular head, and is dark in color. To query whether the moccasin  is venomous, we initialize the network with the following activations: elliptical(w), triangular(x), dark(y), pupil_shape(vw), head_shape(vx), color(vy), venom_type(vz). Again, z is the query variable, corresponding to the channel on which appropriate answers should respond.

After evolving the network for 10 cycles, venomous responds on Channel z with a strength of 0.39 while non-venomous responds with strength 0.195. The strength of the venomous response is roughly twice that of non-venomous because, according to our database, the water moccasin shares twice as many features in common with venomous snakes as with non-venomous ones, on average per species. In a more sophisticated model, features could be weighted in importance by altering network connection strengths, and combined non-linearly by adding “hidden nodes” corresponding to conjunctions of basic features.

 

 

6.  Neural Plausibility

 

The stimulation levels of biological neurons are almost certainly scalar valued (i.e., single channeled). However, a multichannel network can easily be constructed out of traditional (scalar valued) nodes and connections. For an N-channel network we simply allocate a pool of N neurons for each feature node, and N2 for each binary relation node.

The primary operation carried out at an interface node is that of determining candidates for stimulation among channels of the feature nodes. A close examination of the algorithm described in Section 6 reveals the essential operation is that of multiplying a real N-vector by an N´N Boolean matrix, on the right or left depending on whether the given object is appears in the first or second argument of the given relation. For example if feature node A has stimulation vector v and is connected to slot F1 of relation node R having stimulation matrix m, then the vector v´m describes the potential stimulation transmitted from A to slot F2 of R. Thus it seems that implementing matrix-vector arithmetic is an alternative to the use of symbols in representing and processing statements of predicate logic.

It is plausible that the brain does employ vector and perhaps matrix representations – if for no other reason in order to handle perception and motor activities in physical space. Several models for viseospatial tasks have been proposed which involve vector representations and vector arithmetic; and some of these have been found to correspond with readings taken from brain cells in behaving animals ( [3], [5]). However to my knowledge none of these has dealt in particular with matrix-vector multiplication.

Multichannel networks are also extremely simple, especially compared to the neural symbol systems of, say, Touretzky (1990) or Anderson and Lebiere (1998), which makes them good candidates to arise natural selection (especially given the preexistence of vector representations). These factors weigh in favor of neural plausibility for multichannel networks. In the current state of science, however, plausibility is all that can be claimed for any theory of cognitive neuroscience; and we certainly make no claims for the neural reality of multichannel networks.

 

           

7.  Conclusion

 

Multichannel networks provide a massively parallel, non-symbolic framework for the storage and retrieval of atomic statements of predicate logic. Unlike traditional (scalar) network models, multichannel networks can represent and retrieve propositions based on their compositional semantic structure. Using an appropriate algorithm for spreading activation, retrieval degrades gracefully in case of missing data, in that near matches are retrieved in order of their structural similarity to the desired datum. This graceful degradation yields a capacity for analogical and case-based reasoning in the network.

 

 


References

 

 

 


[1]   Anderson and Lebiere (1998). The atomic   

        components of thought. Lawrence Erlbaum

        Associates, Mahwah, NJ.

 

[2]    Jerry Fodor and Zenon Pylyshin (1988).

        Connectionism  and cognitive architecture:

        A critical analysis. Cognition 28: 3-71.

 

[3]     Apostolos P. Georgopoulous et al. Mental

          rotation of the neuronal population vector.

          Science 243: 234-236.

 

 

[4]    Hans Kamp and Ewe Reyle (1993).  From

         Discourse to Logic. Klewer Academic

         Boston.

 

[5]    A.D. Redish, D. Touretzky. The reaching

         task:  evidence for vector arithmetic in the

         motor system,  Biological Cybernetics

         71(4): 307-317.

 

[6]    David Touretzky (1990). BoltzCONS –

        dynamic symbol structures in a  

        connectionist network. Artificial

        Intelligence 46 (1-2): 5-46


 



[1] The mulit-channel approach is partly inspired by the (non-connectionist) work of Kamp and Reyle (1993), and the channels play a role similar to Kamp and Reyle’s discourse variables. The number of channels essentially bounds the number of distinct objects which can be simultaneously entertained in short term memory.