Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements

Shun'ichi Amari 1972

1. The Threshold Element

A threshold element has $n$ inputs with weights $w_1, \ldots, w_n$ and threshold $h$. Inputs $y_1, \ldots, y_n$ take values $\pm 1$. Output:

$$x = \operatorname{sgn}\!\left(\sum_i w_i y_i - h\right) \tag{1}$$

where

$$\operatorname{sgn}(u) = \begin{cases} +1 & u > 0 \\ -1 & u < 0 \end{cases} \tag{2}$$

Two elements $E(w_i;\, h)$ and $E(cw_i;\, ch)$ have the same characteristics. So normalize:

$$|w_i| \leq 1, \qquad \max_i |w_i| = 1.$$

2. The Net

An autonomous net composed of $n$ threshold elements $E_1, \ldots, E_n$. The output of $E_i$ is $x_i$. Each output feeds into all others with unit time delay.

Weight matrix $W = (w_{ji})$, size $n \times n$. Threshold vector $h = (h_j)$.

State = column vector $x = (x_i)$, each $x_i = \pm 1$. The input–output relation of each element gives:

$$x_i' = \operatorname{sgn}\!\left(\sum_j w_{ij}\, x_j - h_i\right) \tag{3}$$

Denoting the state-transition operator by $T$, the next state is written as

$$x' = Tx.$$

Equilibrium

A state $x = (x_i)$ which is invariant under $T$,

$$x = Tx$$

is called an equilibrium state.

Cycles

An ordered set $B = \{x_1, \ldots, x_m\}$ with $x_{i+1} = Tx_i$ for $i = 1, \ldots, m{-}1$, is a state-transition sequence of length $m$. If $Tx_m = x_1$ holds, it is a length-$m$ cycle.

Since the state space is finite, the sequence of states beginning at any $x_0$ either converges to an equilibrium or falls into a cycle within a finite number of transitions. Equilibrium states and cycles are patterns that the net can retain persistently.

3. Distance and the $u_i$ Functions

Let $\mathrm{dis}(x, y)$ be the Hamming distance between two states $x = (x_i)$ and $y = (y_i)$:

$$\mathrm{dis}(x, y) = \tfrac{1}{2} \sum_i |x_i - y_i|$$

the number of components that differ. The set of states within distance $k$ of $x$ is the $k$-neighborhood:

$$N(x, k) = \{ y \mid \mathrm{dis}(y, x) \leq k \}.$$

Define $n$ functions $u_i(x, y)$ ($i = 1, 2, \ldots, n$) of two states $x$ and $y$ by

$$u_i(x, y) = y_i\!\left(\sum_j w_{ij}\, x_j - h_i\right) \tag{4}$$

$u_i(x, y) > 0$ when the $i$th component of $Tx$ coincides with $y_i$, and is negative otherwise.

4. Stability of State Transition

The operators $\min(k)$ play an important role. For $n$ real numbers $u_1, u_2, \ldots, u_n$, rearranged in order of magnitude as

$$u_{i_1} \leq u_{i_2} \leq \cdots \leq u_{i_n}$$

$\min(k)\{u_i\}$ denotes the $(k{+}1)$st smallest number:

$$\min(k)\{u_i\} = \begin{cases} u_{i_{k+1}} & 0 \leq k < n \\ u_{i_n} & k \geq n \end{cases}$$

Lemma 1

A necessary and sufficient condition that state $x$ falls in the $k$-neighborhood of $y$ after a single step, i.e. $Tx \in N(y, k)$, is

$$\min(k)\{u_i(x, y)\} > 0 \tag{5}$$

The number of negative $u_i$'s equals $\mathrm{dis}(Tx, y)$. So the inequality holds iff at most $k$ of the $u_i$'s are negative.

$k$-stability number

The $k$-stability number $s(x, k)$ of the transition $x \to Tx$ is defined by

$$s(x, k) = \lfloor r(x, k) \rfloor \tag{6}$$

$$r(x, k) = \tfrac{1}{2}\, \min(k)\{u_i(x, Tx)\} \tag{7}$$

where $\lfloor r \rfloor$ denotes the integer part of $r$.

Lemma 2

When $z$ belongs to the $s(x, k)$-neighborhood of $x$, $Tz$ belongs to the $k$-neighborhood of $Tx$. In particular, if $z$ belongs to the $s(x, 0)$-neighborhood of $x$, then $Tz = Tx$.

The stability number is a basin radius: perturbations within $s(x, k)$ don't knock the trajectory out of the $k$-ball around the target.

5. Stability of Equilibrium States

Theorem 1

State $x$ is equilibrium iff

$$\min_i \{u_i(x, x)\} > 0.$$

This is Lemma 1 with $y = x$ and $k = 0$.

Stability number of an equilibrium

The first stability number $s_1(x)$ is defined by

$$s_1(x) = s(x, 0) = \left\lfloor \tfrac{1}{2}\, \min(0)\{u_i(x, x)\} \right\rfloor.$$

The $j$th stability number is recursively defined by

$$s_j(x) = s\!\big\{x,\, s_{j-1}(x)\big\}. \tag{8}$$

The sequence $s_1(x), s_2(x), \ldots$ is monotonically nondecreasing, converging to the limit $s(x)$ within a finite number of terms. We call $s(x)$ the stability number of equilibrium state $x$. An equilibrium is stable if $s(x) > 0$ and unstable when $s_1(x) = 0$, $s(x) = 0$.

Stability domain

The $k$th stability domain of $x$ is

$$D_k(x) = N\{x,\, s_k(x)\} \tag{9}$$

Let $D_0(x)$ and $D(x)$ be, respectively,

$$D_0(x) = N(x, 0) = \{x\} \tag{10}$$ $$D(x) = N\{x,\, s(x)\}. \tag{11}$$

We call $D(x)$ simply the stability domain. These nest:

$$D_0(x) \subset D_1(x) \subset D_2(x) \subset \cdots \subset D(x)$$

Theorem 2

Let $x$ be an equilibrium state. Then the net arrives at $x$ after a finite number of state transitions if its initial state belongs to $D(x)$. More specifically, the net arrives at $x$ within $k$ transitions if the initial state belongs to $D_k(x)$.

6. Example: A 3-Element Net

A net of three elements. The weight matrix and threshold vector are

$$W = \begin{pmatrix} 0.6 & 1.0 & 0.5 \\ 1.0 & 0.6 & 0.6 \\ 0.5 & 1.0 & 0.8 \end{pmatrix}, \qquad h = \begin{pmatrix} 0 \\ -1.8 \\ -4.0 \end{pmatrix}$$

There are eight states. For $x = (1, 1, 1)^t$, where ${}^t$ denotes transposition,

$$u_1(x, x) = 2.1, \qquad u_2(x, x) = 4.0, \qquad u_3(x, x) = 6.3$$

$\min\{u_i\} = 2.1 > 0$, so $x$ is equilibrium. It is easy to show there are no other equilibrium states. The stability numbers of $x = (1, 1, 1)^t$ are calculated recursively:

$$\begin{aligned} s_1(x) &= \lfloor \tfrac{1}{2} \times 2.1 \rfloor = 1 \\ s_2(x) &= \lfloor \tfrac{1}{2} \times \min(1)\{u_i\} \rfloor = 2 \\ s_3(x) &= \lfloor \tfrac{1}{2} \times \min(2)\{u_i\} \rfloor = 3 \end{aligned}$$

So $s(x) = 3$. The stability domain is the entire state space: from any initial state, the net converges to $(1, 1, 1)$.

7. Stability of Sequences and Cycles

Let $B = \{y_1, y_2, \ldots, y_m\}$ be a state-transition sequence satisfying $y_{i+1} = Ty_i$. Starting at $s_B(y_m) = 0$, the following quantities are calculated recursively:

$$s_B(y_{i-1}) = s\!\big\{y_{i-1},\, s_B(y_i)\big\}, \qquad i = m, m{-}1, \ldots, 2 \tag{12}$$

We call $s_B(y_i)$ the stability number of $y_i$ in state-transition sequence $B$. If the initial state belongs to the $s_B(y_i)$-neighborhood of $y_i$, the net passes successively through the $s_B(y_j)$-neighborhoods of $y_j$ ($j = i{+}1, \ldots, m{-}1$) and arrives at $y_m$ after $m - i$ transitions.

Cycles

For a cycle $C = \{y_1, \ldots, y_m\}$ (where $Ty_m = y_1$), define $s^1(y_i)$ from one pass around the cycle, then $s^2(y_i)$ from a second pass, etc. The sequence $s^1(y_i), s^2(y_i), \ldots$ is monotonically nondecreasing, converging within a finite number of terms. The limit

$$s_C(y_i) = \lim_{k \to \infty} s^k(y_i) \tag{13}$$

is the stability number of $y_i$ in cycle $C$.

Theorem 3

Let $s_C(y_i)$ be the stability number of $y_i$ in cycle $C$. Then the state of the net falls into cycle $C$ after a finite number of state transitions, provided the initial state belongs to the $s_C(y_i)$-neighborhood of $y_i$ for some $i$.

The stability number of the cycle:

$$s(C) = \min_i\, s_C(y_i)$$

8. Self-Organization

Patterns are applied to the net from outside. Let $x(t)$ be the stimulus pattern at time $t$. The weight increments are represented by

$$\Delta w_{ij}(t) = -\alpha\, w_{ij}(t) + \beta\, f_{ij}(t) \tag{14}$$

where $\alpha, \beta > 0$ are small positive constants.

Type I net

$$f_{ij}(t) = x_i(t)\, x_j(t)$$

Weight $w_{ij}$ connecting the output of $E_j$ to the input of $E_i$ increases by a unit when the stimulus pattern coincides with the $j$th component, and decreases when they differ.

Type II net

$$f_{ij}(t) = x_i(t)\, x_j(t-1)$$

$w_{ij}$ increases when the $i$th component of the current pattern coincides with the $j$th component of the previous one.

Solution

Eq. (14) can be solved to yield

$$w_{ij}(t) = (1 - \alpha)^t\, w_{ij}^0 + \beta \sum_{\tau=0}^{t-1} (1 - \alpha)^{t - \tau - 1}\, f_{ij}(\tau)$$

where $w_{ij}^0$ is the initial value. As $t$ becomes large, the first term attenuates and the second dominates.

The correlation matrix

When $\alpha$ is sufficiently small, the weighted time average of $f_{ij}(\tau)$ can be approximated by the ordinary time average. Therefore,

$$w_{ij}(t) \approx \frac{\beta}{\alpha}\, k_{ij}$$

holds approximately for large $t$, where $K = (k_{ij})$ is the correlation matrix of the stimulus patterns. When $\alpha = \beta$, $w_{ij}(t)$ converges to $k_{ij}$.

9. Learning a Single Pattern Under Noise

Type I net. Initial weights $w_{ij}^0 = \delta_{ij}$ (identity). Threshold $h_i = 0$. Every component threshold element is self-dual: $Tx = y$ implies $T(-x) = -y$. So $x$ and $-x$ are treated as the same pattern.

When only pattern $x$ is applied repeatedly, the weight matrix converges to

$$K = \bar{x}\bar{x}^t \tag{15}$$

Noise model

Let $\tilde{x}$ be a random pattern whose components are obtained by reversing the sign of each component of $x$ independently with probability $p$. The correlation entries are

$$k_{ij} = \bar{x}_i \bar{x}_j = \begin{cases} 1 & i = j \\ (1 - 2p)^2\, x_i x_j & i \neq j \end{cases} \tag{16}$$

By putting $1 - \sigma^2 = (1 - 2p)^2$, or

$$\sigma^2 = 4p(1 - p) \tag{17}$$

we obtain

$$K = (1 - \sigma^2)\, xx^t + \sigma^2 E \tag{18}$$

where $E$ is the unit matrix. The weight matrix eventually coincides with the above $K$.

Theorem 4

Under noise disturbances of intensity $\sigma^2$, the net is self-organized as follows.

a) Pattern $x$ is stored in the net as an equilibrium state and its stability number is

$$s(x) = \left\lfloor \frac{1 - \sigma^2}{2}\, n + \frac{\sigma^2}{2} \right\rfloor \tag{19}$$

b) Those $y$ satisfying

$$|\cos(x, y)| < \frac{\sigma^2}{n(1 - \sigma^2)} \tag{20}$$

remain unstable equilibrium states with stability number 0. All other patterns change to $x$ in a single state transition.

Here $\cos(x, y) = \tfrac{1}{n}\, x \cdot y$ is the normalized inner product. The net remembers $x$ correctly: even when the intensity of noise is very strong, the net can remember $x$ correctly as an equilibrium if $n$ is sufficiently large. When a pattern $y$ not satisfying (20) is given from the outside, the net changes state automatically to $x$ and remains in it.

10. Learning Many Patterns

$m$ patterns $x_\alpha$ ($\alpha = 1, 2, \ldots, m$) are applied repeatedly to the net with relative frequency $\lambda_\alpha$ under noise disturbance of intensity $\sigma_\alpha^2$. Put

$$\tilde{\lambda}_\alpha = (1 - \sigma_\alpha^2)\, \lambda_\alpha \tag{21}$$

$$\sigma^2 = \sum_\alpha \lambda_\alpha\, \sigma_\alpha^2 \tag{22}$$

Then by learning these patterns, the weight matrix converges to

$$W = \sum_\alpha \tilde{\lambda}_\alpha\, x_\alpha x_\alpha^t + \sigma^2 E$$

Cross-talk

Let $\varepsilon_{\alpha\beta} = \cos(x_\alpha, x_\beta)$ be the cosine of the angle between two patterns. Let $\tilde{\varepsilon}_\alpha$ be

$$\tilde{\varepsilon}_\alpha = \sum_{\beta \neq \alpha} \tilde{\lambda}_\beta\, |\varepsilon_{\alpha\beta}|$$

a weighted sum of the absolute values of the cosines between $x_\alpha$ and all other patterns. When $x_\alpha$ is orthogonal to all the other patterns, $\tilde{\varepsilon}_\alpha = 0$.

Theorem 5

The net remembers $x_\alpha$ as an equilibrium state when

$$\tilde{\lambda}_\alpha > \tilde{\varepsilon}_\alpha - \frac{\sigma^2}{n} \tag{23}$$

Moreover, the net recalls $x_\alpha$ correctly from $z$ belonging to $N(x_\alpha, s_\alpha)$ where

$$s_\alpha = \frac{n}{2}\!\left(\tilde{\lambda}_\alpha - \tilde{\varepsilon}_\alpha\right) + \frac{\sigma^2}{2} \tag{24}$$

As $\lambda_\alpha$ becomes large, or $\tilde{\varepsilon}_\alpha$ becomes small, $x_\alpha$ can be remembered better. When the $x_\alpha$ are orthogonal to one another, $\tilde{\varepsilon}_\alpha = 0$ holds and all the patterns are remembered. If their relative frequencies are the same, $x_\alpha$ is recalled from a pattern belonging to $N(x, s_\alpha)$, where

$$s_\alpha = \frac{(1 - \sigma_\alpha^2)n}{2m} + \frac{\sigma^2}{2}\,.$$

Capacity

A type I net can store $n$ mutually orthogonal patterns, each with $n$ bits of information. The number of independent synaptic weights is $\tfrac{1}{2}n(n+1)$ — the net stores as many bits as it has parameters.

11. Learning Pattern Sequences

Type II net. When a sequence of patterns $B_\alpha = \{x_1^\alpha, \ldots, x_{m_\alpha+1}^\alpha\}$ is applied for many times as stimuli, weight matrix $W$ is expected to converge to

$$W = \sum_\alpha \frac{\lambda_\alpha}{m_\alpha} \sum_{i=1}^{m_\alpha} x_{i+1}^\alpha\, x_i^{\alpha\, t} \tag{25}$$

Cycles

Let $B_\alpha$ ($\alpha = 1, 2, \ldots, k$) be $k$ pattern sequences, $B_\alpha$ being of length $m_\alpha$. Some of the $B_\alpha$ may be cycles.

Theorem 6

By applying $k$ sequences, $B_\alpha$ with relative frequency $\lambda_\alpha$, the net remembers $B_\alpha$ as a state-transition sequence or cycle when

$$\lambda_\alpha > m_\alpha\, \varepsilon_j^\alpha \tag{26}$$

for all $j = 1, 2, \ldots, m_\alpha{+}1$, where $\varepsilon_j^\alpha$ is the cross-talk of $x_j^\alpha$ with all other patterns. The stability number of $x_j^\alpha$ in $B_\alpha$ is not less than

$$s_j^\alpha = \left\lfloor \frac{n}{2}\!\left(\frac{\lambda_\alpha}{m_\alpha} - \varepsilon_j^\alpha\right) \right\rfloor$$

Sequential recall

After the remembrance of $B_\alpha$, if a pattern $z$ resembling $x_j^\alpha$ is given as a stimulus, the net successively produces $x_{j+1}^\alpha, x_{j+2}^\alpha, \ldots$ and arrives at $x_{m_\alpha}^\alpha$. The net recalls the part of the sequence after the cue, not before it. This property of sequential recalling resembles that of our memory in the brain.

Corollary

When the patterns constituting sequences are orthogonal to one another, all sequences are remembered, and the stability number of $B_\alpha$ is

$$s_\alpha = \left\lfloor \frac{n\lambda_\alpha}{2m_\alpha} \right\rfloor.$$

12. Formation of Representative Patterns

A number of stimulus patterns are randomly applied to a type II net. Let $S_X = \{x_1, \ldots, x_m\}$ be a set of patterns, all equally likely. The weight matrix converges to

$$W = \frac{1}{m^2} \sum_{i,j} x_i\, x_j^t$$

If we put the mean pattern

$$X = \frac{1}{m} \sum_i x_i$$

$W$ is rewritten as $W = XX^t$. Since the components of $X$ are not necessarily $\pm 1$, define the representative pattern

$$x = \operatorname{sgn} X, \qquad (x_i = \operatorname{sgn} X_i) \tag{27}$$

where every component $X_i$ of $X$ is assumed not to vanish.

Theorem 7

When pattern set $S_X$ is applied, $x = \operatorname{sgn} X$ becomes an equilibrium state of the net, and

$$Tx_i = x \quad \text{for all } i, \text{ provided } x_i \cdot X > 0.$$

The net forms a "concept" — a representative pattern — from the set of stimuli.

Two sets of patterns

Consider the case where two sets of patterns are applied. Let $S_X = \{x_1, \ldots, x_m\}$ and $S_Y = \{y_1, \ldots, y_k\}$ be two sets, and let $X$ and $Y$ be their means respectively. After organization, the weight matrix converges to

$$W = \tfrac{1}{2}(XX^t + YY^t) \tag{28}$$

where terms relating to small patterns are ignored. Put

$$y = \operatorname{sgn} Y \tag{29}$$

Theorem 8

A necessary and sufficient condition for $x = \operatorname{sgn} X$ to be equilibrium is

$$\varepsilon_x > 0 \tag{32}$$

where $\varepsilon_x$ and $\varepsilon_y$ are respectively

$$\varepsilon_x = \min_i \tfrac{1}{2}\!\big\{|X_i|(x \cdot X) + x_i Y_i(x \cdot Y)\big\} \tag{30}$$ $$\varepsilon_y = \min_i \tfrac{1}{2}\!\big\{|Y_i|(y \cdot Y) + y_i X_i(y \cdot X)\big\} \tag{31}$$

The two equilibrium states are separately formed, each corresponding to $S_X$ and $S_Y$, when $|x \cdot Y|$ is much smaller than $\sum |X_i|$. The result is similar to that of the four-layer perceptron. The present net may be regarded as a model for concept formation.

13. The Big Picture

The whole paper is one idea iterated at increasing levels of complexity:

Object	What the net does	Measured by
equilibrium state	remembers a pattern	$s(x)$
cycle	remembers a sequence	$s(C)$
stability domain $D(x)$	recalls from noisy input	Hamming radius
correlation matrix $K$	emerges from repeated stimuli	weight convergence
representative pattern	forms a concept from a set	$\operatorname{sgn}$ of mean

The stability number $s(x)$ is the central invariant. It converts the qualitative question "does the net remember?" into a quantitative one: "how large is the basin?"

The capacity result: a type I net with $n$ elements and $\tfrac{1}{2}n(n+1)$ independent weights can store $n$ orthogonal patterns with $n$ bits each. Information capacity = number of parameters.

The bridge to Hopfield (1982): Hopfield's energy function $E = -\tfrac{1}{2} \sum w_{ij}\, x_i x_j$ is implicit here. Amari's stability numbers give the same convergence guarantees, but from the state-transition side rather than the energy side.