This paper was published in *ACM Sigplan Notices*, Vol 29 No 9, September 1994, pp 159-167.

This WWW version was prepared by the author from the text submitted for publication, and appears by permission of the copyright holder. Copyright © 1994 by the Association for Computing Machinery, Inc. Please click here for full ACM copyright notice.

**Brian Meek**

**King's College London**

*b.meek@hazel.cc.kcl.ac.uk*

[Note added in 1996: now *brian.meek@kcl.ac.uk*]

**Introduction**

This is the second article based on language-independent standardisation
work being carried out by international standards working group ISO/IEC
JTC1/SC22 WG11 (Language Bindings). The first, *Programming languages -
towards greater commonality* [Meek 1994] was an overview, which briefly
described the various and the relationships between them. This article goes
into one of these projects in greater detail. Both articles are based on
presentations to the DECUS (UK and Ireland) symposia, this one given in May
1994. A third article, *What is a procedure call?*, based on a similar
presentation in 1993 [Meek 1993], is projected.

**What exactly is a datatype?**

It is surprisingly difficult to get people to agree on what constitutes
a "datatype". Most have a clear idea of what a datatype is - an idea, anyway -
but as with many things in this field - like what a procedure call is,
discussed in the projected third paper - they are often not the *same*
idea.

The datatype concept crops up in a wide range of contexts, data being
ubiquitous, and different contexts - databases, communications, many different
programming languages - have their own culture and conventions. Such a variety
of background is almost certain to lead to different perceptions. The taxonomy
described here is based, though informally and fairly loosely, on that used in
Draft International Standard 11404, *Language Independent Datatypes*
(LID); the "language independent" (LI) indicates the attempt to avoid the
presuppositions usually present in the culture and conventions of any
particular language - or indeed any other use of the concept. Interpretations
and details are the author's own, and should not be assumed to be present in
the standard itself.

In constructing any taxonomy it is necessary to decide not only what a datatype is but what it is not. Despite the apparent derivation of the word, it is not simply "a type of data", but a concept in its own right. (This is the main reason for removing the space in "data type".) Things that have a datatype may not themselves be an item of data (at least of that datatype; it may be a value of some other datatype). An integer variable is not itself an integer. If that distinction is not clear (and some programmers find it difficult to see), an X channel (for I/O) in occam is not an X value - it is something that transports X values and only X values. Something "of datatype X" has properties (depending on its own nature) that in some way relate to the type of data that X values are. This is fine distinction, but it emphatically is not nitpicking: it is crucial to any taxonomy if it is to have any hope of encompassing generically the many different views that there are.

The next thing is to abandon any representational view of a datatype. Data is an abstraction, capable of representation in many forms, and "datatype" is a more abstract still; it is disastrous to link the concept to particular representational forms. This applies as much to second-order representational assumptions (e.g. a complex value as an ordered pair of real values, or an array as a contiguous block of values) as it does to first-order assumptions such as bit patterns. Such assumptions may be useful, even essential, in particular contexts, but a generic taxonomy cannot afford such restrictions.

Other things that have to go from the definition of datatype are the operations that can be performed on the data values. Many languages people are startled by this, even horrified. To them, the values and the operations on them are inseparable. In that data in a program is of limited use if you cannot do operations, this is an understandable attitude, but it overlooks two things.

One is that you can distinguish between static and dynamic aspects -
programming languages separate static data declaration and dynamic procedural
aspects, implicitly if not explicitly. The point of this is that some
functionality related to languages is purely static as far as the data is
concerned. For example, a data transmission channel operates *with* the
data (dynamic) but not *on* the data, which is unchanged (if the channel
is working correctly!). Operations such as addition and subtraction are an
irrelevance here, and can even be a nuisance, for example when talking of
conformity to a standard.

The other overlooked aspect is that it is not always obvious what the
operations should be. Languages do not always agree, and some with "derived
datatype" facilities allow default operations to be suppressed or replaced, and
others added. An example is that, for a **new** **integer** called
**year**, add and multiply are suppressed (meaningless) and subtraction is
replaced; the subtraction is normal but the datatype of the result is not
**year** but a datatype representing number of years.

I am not saying you could not include operations in a taxonomy, but to do it properly needs the full machinery of object-orientation. In standards work you need to set yourselves a specific achievable target, not make an open-ended commitment, so DIS 11404 does not attempt it. A taxonomy based on values alone is infinite, but manageably infinite, allow operations, and it becomes hard to see how or where you would ever stop, quite apart from (in a standard) specifying testable conformity requirements.

**The basis of the taxonomy**

The taxonomy follows the common practice of starting with a number of primitive datatypes and then using these to construct others. There are three main kinds of constructed datatypes: subtypes, generated datatypes, and aggregates. (In fact aggregates are technically also generated datatypes, but important enough to deserve separate classification.)

**Primitive datatypes**

Primitive datatypes are datatypes whose values are regarded fundamental
- not subject to any reduction. They just "are". Those values are what some
languages have called "atomic", or "plain values". Many primitive datatypes
are also *generic*, in the sense that they have an unlimited number of
values, and hence the datatypes often used in practice are confined to a finite
subset of them. The reason that they are used in the taxonomy, rather than
"actual" achievable datatypes, is threefold: it is a convenient way to identify
a class of datatypes which is infinite in extent; language definitions
commonly use them, meaning that they simplify the binding between the LI
datatypes and the ones used by specific languages; and it allows for the
possibility of actually supporting them if a language is designed to do so (in
the sense, for example, that an integer of any arbitrary magnitude can be
accommodated, even if in practice at some point you will run out of storage or
time).

It is important here to distinguish "a language" from any given implementation
of that language. A language definition will usually specify **integer** as
a datatype, but leave the actual range of integers supported to the
implementation. The most they are likely to specify is that the range must be
of continuous values and cover at least a specific minimal range; some do not
even go that far. However, for the LI datatype taxonomy, each one of all the
possible subsets of the integer domain is a distinct datatype. If you are
specifying a LI service in terms of LI datatypes, you cannot just bind
**integer** to **integer** and leave it at that - you won't always be
able to be certain that a service will work. For some services, a mismatch, or
certain kinds of mismatch, will not matter, but for some (or for some uses of
the service) it can matter. Programmers have long experienced such problems
when transferring applications, even from one implementation of a language to
another - and even when both conform to the language standard, if the standard
has weak conformity requirements in this area.

The primitive datatypes of the taxonomy are **Boolean**, **State**,
**Enumerated**, **Character**, **Ordinal**, **Date-and-time**,
**Integer**, **Rational**, **Scaled**, **Real**, **Complex** and
**Void**. This is a much longer list than that which most languages
designate as "primitive", sometimes because they classify datatypes
differently, sometimes because they represent some in terms of others (or
assume that the programmer will). A simple example of such a represented
datatype is **complex** as an ordered pair of real numbers, the "real" and
"imaginary" parts. However, LID eschews this form of "representation" as well
as the more obvious bit-level form. A LID datatype is just a set of values,
and using the cartesian form to identify them is only a convenience for some
purposes - think of the polar form, for example. Similarly, **Rational** is
not a directly supported datatype of any well-known language (though Forth goes
some way towards it), but its value-space is distinctive, and here again the
integer-pair relationship runs into trouble, e.g. through multiple
representations (2/4, 3/6, 34/68 etc) and special cases (1/0, etc). It is best
treated as primitive in its own right.

Most of these primitive datatypes are generic, the actual specific, usable
datatypes being derived by the use of parameters or qualifiers. **Boolean**
and **Void** are exceptions. (An earlier paper [Meek 1990] discusses the
multiplicity of ways two-valued datatypes like **Boolean** can be - and are
- handled.)

**Subtypes**

In the taxonomy subtypes are created by modifying the value-space of a
"base" datatype in various ways - specifying a range or size; selecting
values; excluding values; extending the value-space; or defining explicitly
how the value-space is constructed from that of a "base" datatype. Any
combination of these is possible too. These are fairly self-explanatory, even
though eyebrows might be raised at the idea of extending, where the subtype
ends up with a wider value space than the base datatype! However, in the
taxonomy *any* datatype can be used as the base, not just the primitive
ones, and in that context extension is a useful subtype constructor e.g. you
can make a new subtype by extending an existing one.

**Generated datatypes**

The (non-aggregate) generated datatypes in the taxonomy are pointer,
procedure, and choice datatypes. Such datatypes are produced from other
datatypes by the methods familiar from languages that include them, e.g. in the
case of a procedure datatype it is constructed from the datatypes of the
procedure parameters and of the returned result (if any). The primitive
datatype **Void** is useful here for subroutine procedure datatypes that do
not return a value. **Void** is also used in choices, where not making any
choice between alternative "proper" datatypes is an allowed option.

**Aggregate datatypes**

An aggregate datatype is one whose values are made up of a number, in general more than one, of component values, each of which is a value of another datatype. In many ways this the most complex and interesting subclass of datatypes, simply because of the number of combinations and variations that are possible.

It is important for any taxonomy to get clear what qualifies as an aggregate, what you regard as a component of it, and what it means to talk about a value of such a datatype. Here, the statement that each value of an aggregate has "in general" more than one component value does not preclude cases where the "aggregate" value has only one component, or even none at all. (As far as the LID standard is concerned, languages may allow such "degenerate" cases in their own right, regard them as equivalent to non-aggregate types, or forbid them altogether.) Next, a component value may itself be of an aggregate datatype. The difference between aggregate and non-aggregate components is immaterial when considering the properties of the composite; while "inside" the aggregate the individual components remain as single entities ("closed boxes"). Similarly, the aggregate as whole is regarded as having a single value, a closed box, whose datatype is determined by the datatypes of its components and the structure of the aggregate.

The approach adopted is to start with the most general form of aggregate datatype, capable of containing anything, and then describing additional properties or constraints used to identify various kinds of aggregate datatype that are encountered computationally. These properties and constraints are not all mutually orthogonal; they may interact with others, in various ways. The particular mix of properties used for a given aggregate datatype will depend on the envisaged computational uses of the datatype and its values. This taxonomy shows a way of building any of the commonly-found forms of aggregate, and how to construct others if needed, by appropriate mixing and matching of a relatively small number of properties. The taxonomy is in fact capable of expressing bizarre conceptual datatypes of no practical interest, including unlimited numbers of components, values not in practice representable, and so on - again sharing this property with many definitions of programming languages.

The main kinds of aggregate in the taxonomy are **Bag**, **Set**,
**Record**, **Sequence**, **Array**, and **Table**. These will be
introduced as they appear in the discussion of the various properties
aggregates can have. They are technically also datatype generators, though we
shall call them datatypes for simplicity - the components will be of one or
more base datatypes from which the aggregate datatype values are constructed.

*The most general aggregate datatype*

The most general aggregate datatype is one whose values are each
made up of any number of component values (including zero or one component
values), every such component value being of any datatype at all. This kind of
aggregate datatype is called a **Bag**. It is completely unstructured, with
no internal relationships at all. It is of limited practical interest but is
useful in the taxonomy since all other aggregate datatypes can be expressed in
terms of constraints and properties applied to it.

*Distinguishing components by component values*

Since a **Bag** is completely unstructured, it is not in
general possible to distinguish between different component values of a
**Bag** value. This is because the components may have the same datatype
and the same value of that datatype: a **Bag** value may contain duplicate
components. An example is the simple probability exercise: a bag contains
four white balls and eight blacks balls, what is the probability that the first
two balls removed from the bag will be of the same colour?

Usually, component values of aggregate values are distinguished by their
relationships with the structure of the aggregate datatype as a whole, and
relationships which exist between the components within it the aggregate as a
result of their membership of the aggregate (*rather than as values of their
own datatype)*.

A **Set**, in this taxonomy, is a **Bag** which has the constraint added
that there are no duplicate values, i.e. given two components, either their two
datatypes are different, or they have the same datatype but have two different
values of that datatype.

*Homogeneity*

Homogeneity is another constraint that can be added, either to
**Bag** or **Set**. This means that the component values are drawn
solely from one datatype, known as the base datatype. If the base datatype is
**B**, the datatypes resulting are **Bag of** **B** and **Set of**
**B** respectively.

Between the total generality of unconstrained **Bag** and **Bag of**
**B** there would seem to be a limitless number of intermediate cases where
component values may be drawn from a range of different datatypes but not all
possible datatypes. In this taxonomy many such possibilities can be
accommodated by using **Choice**.

*Size*

Returning to the most general **Bag**, another constraint that
can be placed on values is the size of the aggregate value, i.e. the allowed
number of components. In fact this consists of two constraints, one being the
lower limit on the number of components, which must exist, and the other being
the upper limit, which may or may not exist. If these two are the same, every
value of the aggregate datatype has the same size, if they are not then any
size between the upper and lower limits is possible. The size constraint on
aggregate values should not be confused with the number of different possible
such values, which depends on the number of possible combinations of possible
values of the components within their own datatypes.

The lower limit on size must be at least zero, the upper limit at least the same as the lower limit. For the purposes of the taxonomy, it is assumed that both limits are actually achieved; if no upper limit exists, this means that values of the aggregate datatype of any arbitrary size are valid. Where no upper limit exists, this means that the aggregate datatype has an unlimited (infinite) set of possible values. In any practical case of course, only a finite (though perhaps unspecified) number actual aggregate values, each of a finite (though perhaps not predetermined) size will be used.

More complicated situations are possible where the size of each aggregate value
may be any one of two or more possibilities not expressible in terms of lower
and upper bounds of a contiguous range: e.g. the size might be defined as a
multiple of three (3 components, 6 components, ....). Again, the taxonomy can
be extended to cover this by use of **Choice**.

It is at this point that it becomes evident that there may be interaction
between the various constraints. For example, if **Set of** **B** is
defined where the base datatype **B** has a finite number **n** of
distinct values, then because of the constraint "every component value is
distinct" in the definition of **Set**, **n** is the largest number of
components that any value of datatype **Set of** **B** can possibly
have.

In general from now on, additional properties or constraints used to identify various kinds of aggregate datatype may interact with others, in various ways.

*The aggregate datatype and its base datatype*

It is very important in this taxonomy to distinguish between
properties of the aggregate datatype and those that the base datatype has in
its own right. In general, any property of the base datatype does not induce
the same or a similar property in any aggregate datatype whose values are
composed of its values, except perhaps where it interacts with a property of
that aggregate datatype. For example, the finiteness of the base datatype in
the above example does induce a constraint on the upper limit of size of values
of the aggregate datatype, but only through interaction with another constraint
on the aggregate datatype. It does not induce a similar constraint for
**Bag**. Similarly, a base datatype may have a specific ordering, but this
does not induce a similar ordering of the values of the aggregate, or of the
components within any aggregate value. In fact, as will be seen, either or
both such orderings at the aggregate level will normally be different from any
ordering of the base datatype values, and indeed can validly exist whether a
base datatype ordering is defined or not.

*Distinguishing components by tagging*

It is possible to distinguish the different components by
"tagging" each one; for example, a language may provide a means of defining
names (identifiers) for the various components. The term "tag" is used here
simply to avoid any implication that the distinguishing syntax is necessarily
an alphanumeric identifier - it could, for example, be a numerical label. The
essence of tagging is that the tags themselves are not in themselves members of
a datatype, and have no significance except as a means of referring to the
aggregate component concerned. In particular, they do not imply any ordering
of the components within the aggregate. (For example, numerical labels *used
as tags* are not members of an arithmetic datatype, and have no arithmetic
properties.) However, the combination of a tag with a value of the relevant
aggregate datatype will of course have a datatype, that of the component
selected. In this sense the tag *does* have an associated datatype, that
of the aggregate component it tags, but on its own it is not a "value" of
anything. Of course, the datatype of any given tagged component can be a
**Choice** datatype, of any required degree of complexity.

Theoretically, tagging can be partial, i.e. only some of the components are given tags. This taxonomy could be extended to include it, but for current purposes the complications this would entail are not justified by the benefits.

*Record datatypes*

A **Record** datatype in this taxonomy can be described as a
**Bag** in which each component is tagged. However, tagging imposes so many
constraints on what is in the **Bag**, that the unstructured nature of the
unconstrained **Bag** only remains through the absence of homogeneity (see
above) and of ordering (see below). In this taxonomy, tagging determines the
number of components (i.e. it fixes the size of the aggregate), and it fixes
the datatypes of them. However, as noted earlier, flexibility of component
datatype can be achieved through using **Choice** datatypes for them, and
similarly **Choice of Record** can be used to achieve size variability, and
other alternatives such as "variant records" which may be needed whether or not
the number of components is variable.

*Ordered datatypes*

In **Bag**s, **Set**s and **Record**s, the components of
the aggregate are *unordered*. That means that, given any pair of
components, it is meaningless to ask whether either comes before (precedes) or
after (succeeds) the other. If ordering is a property which is added to the
aggregate, this means that an ordering relationship exists between the various
components. This can be done simply by giving mutual ordering properties
between the components (e.g. suitable operations on them); by giving them a
position (in some sense) in the overall aggregate structure, which induces an
ordering; or both. Any precedence relationship between components is
independent of any precedence relationship between the *values* of those
components. For example if *C1* and *C2* are components of an
aggregate and *C1* comes before *C2* in the aggregate, then this says
nothing about ordering of the values. If the aggregate is inhomogeneous,
*C1* and *C2* may have different datatypes. If it is homogeneous,
*C1* and *C2* may be members of a datatype which is unordered. If
the datatype of *C1* and *C2* is an ordered datatype, the value of
*C1* need not precede the value of *C2* by the ordering rules of that
datatype. Ordering of components within aggregates does not result in an
ordering of values of the aggregate datatype.

In this taxonomy, the values of any datatype either are totally ordered, or are unordered. Theoretically datatypes can exist which are partially ordered. Extending the taxonomy to include such a concept is possible but would add considerable complication with little apparent benefit. For the purposes of this taxonomy it is unnecessary to pursue this further.

(Few if any actual programming languages seem to support partially ordered datatypes, and in practical applications needing it, often it is adequate to use total ordering but then either ignore ordering when not required, or check that it is meaningful in a particular case.)

In this taxonomy, **Sequence** is an aggregate with a strict and unique
ordering, but whose components in general are not distinguishable from one
another in any other way than by this ordering. By "strict and unique" is
meant that every component (except the first) has one and only one immediate
predecessor, and (except the last) one and only one immediate successor. A
**Sequence** can be homogeneous (**Sequence of** **B**) or not, and
its size may be fixed or variable in any of the ways discussed earlier.

Starting from the first component and taking successor after successor until
the last component is reached, each component can be related to the
**Sequence** as a whole, in terms of "distance" from one end of the
**Sequence** or the other. This does not imply that any given component can
be accessed in any other way than by systematic searching for it through the
ordering. There may not even be any direct means of identifying which is
"first", though commonly in languages this can be indicated through the lexical
ordering of the definition of aggregate values. Notionally components may be
distinguished (keyed or indexed, see below) by implicit association with
ordinal values *first*, *second*, *third*, ..., but in this
taxonomy a **Sequence** does not have a general such method of picking out
any individual component. Ways may be provided (e.g. special operations on the
aggregate as a whole) of finding (say) the first or the last, but not for all.
Operations to find the (immediate) successor or predecessor are characterising
of **Sequence** in this taxonomy, and those do apply to all components. In
this taxonomy, therefore, **Sequence** is distinct from **Vector**, which
has a similar structure but is indexed, as shown below.

The taxonomy can be extended to include cyclic datatypes (where the choice of "first" component is arbitrary and the successor of the "last" is the "first"), and cases with more complicated topology (directed graphs or digraphs). Here too, for current purposes the complications this would entail are not justified by the benefits. If it is really needed in a particular case, it can be done at the operations level, but suitably modifying those for obtaining the predecessor or successor of a component.

*Recursive aggregates*

A special situation arises when a datatype is defined
recursively, i.e. where the base datatype of the components is a **Choice**
which includes, as one of the alternatives, the aggregate datatype itself.
(There has to be a **Choice**, to enable the recursion to terminate.) While
in principle this can apply generally, in this taxonomy only one case appears
explicitly, namely **Tree** identified as a recursively defined
**Sequence**, i.e. **Tree of** **B** is **Sequence of (Choice of
(B,Tree of B)**). In some programming languages this kind of aggregate is
called a list, but the word "list" is avoided here it is sometimes taken to
mean what in this taxonomy is called a **Sequence**.

**Sequence** can be regarded as a "flat" form of **Tree**, where the
recursive property is not used; this is clear from the definition of
**Tree**. In fact, assuming other constraints (such as size constraints)
are the same for both, in general the values of the datatype **Tree of**
**B** will include all of the values of the datatype **Sequence of**
**B**.

*Distinguishing components by keying*

Tagging was introduced earlier as a means of identifying each individual component of an aggregate, and it was noted that in this taxonomy the tags are not values of a datatype, and have no significance except as names for the components they reference.

Keying is similar to tagging except that keys are values of a datatype; i.e., there is a one-to-one correspondence between values of the key datatype and the components of the aggregate datatype. The key datatype may or may not be ordered, and this ordering may or may not be used in relation to the keying; that is, though the datatype the keys are taken from is an ordered datatype, this property need not be used by the aggregate datatype being keyed.

Keying may either be *internal* or *external* to the aggregate
datatype. If the keying is internal, this means that the values of the keys
appear themselves explicitly in the aggregate as components in their own right.
In general, this results in the aggregate datatype having more than one
dimension, so further discussion of internal keying will be deferred until the
discussion of dimensionality.

If keying is external, the key values are used more like tags but with the added feature that they have properties of their own. In particular, if the key datatype is ordered, it induces an equivalent ordering on the components of the aggregate. In the case of internal keying, any ordering of the key datatype will not in general induce an ordering of the aggregate components, since that will depend on where the keys appear as components themselves.

In this taxonomy, a keyed aggregate is homogeneous. The base datatype, as
usual, can be made a **Choice**, for greater flexibility. As with tagging,
keying could be partial, but this taxonomy is not extended to include that for
the same reasons as before.

*Distinguishing components by indexing*

Indexing is a special case of keying where the values of the key datatype are an uninterrupted range of successive values from a base datatype from which the keys are drawn. The number and values of the keys are defined in terms of the lower and upper bounds of the range. Indexing is commonly used in languages since it can provide a bridge between mathematical concepts such as indexed variables, which are useful in many applications, and an efficient representational method using contiguous storage locations and simple means of calculating the storage position of an individual indexed component.

A **Vector** is an aggregate with one index, where indexing is understood in
the sense described here. Each value of the index identifies a unique
component of the aggregate datatype, and the ordering of the index datatype
induces an ordering on those components. A **Vector** becomes a
**Sequence** if the index datatype values identifying the components are
discarded, but the ordering remains, though to be able to extract individual
components some means other than indexing of identifying individual components
will be needed.

Some languages merge, or regard as interchangeable, the concepts of
**Sequence** and **Vector**. Some treat all **Vector**s of the same
length as indistinguishable, and some place constraints on the index datatype
and/or on the bounds. Examples of constraints are restricting the index
datatype to **integer**, and fixing the lower bound.

Some languages allow (or assume) that **Vector**s (and/or **Array**s, see
below) are of variable bounds or size. In this taxonomy, this can be allowed
for by suitable use of **Choice**. Variable **Vector**s could equally be
provided by extending the taxonomy to allow either or both of the actual bounds
of any aggregate value to be one of a range. It could also be done by
"padding" shorter aggregate values with meaningless values for the "missing"
components, or by accompanying each aggregate value by other values identifying
the meaningful values, but in this taxonomy these are regarded as
representational matters.

As with tagging and keying, indexing could be partial, but this taxonomy is not extended to include that for the same reasons as before.

To summarise the differences between tagging, keying and indexing, tags
individually identify components but are not values of a datatype, while keys
do the same but are values of a datatype. Keys may or may not be ordered, and
can be provided externally or internally. Indexes are external keys forming a
consecutive range from an ordered datatype. Partial use (some but not all) has
already been referred to, and the use of a mixture (e.g. some components are
indexed, the rest tagged), though theoretically possible, is left out of this
taxonomy for similar reasons. However, occasion may arise where hybrids occur,
for example aggregates which are both keyed and tagged, so components can be
identified in two different ways. That the same aggregate value can be used
either as a **Record** or as a **Vector** is not in itself a problem, but
in general it will be necessary to determine which of the two is required for
an external mapping.

*Aggregates with more than one dimension*

The discussion so far has been of aggregates either with no
structure or where the structure is essentially one-dimensional. This is a
consequence of the property of ordering. A Tree might be regarded as an
exception, but in this taxonomy it is regarded as one-dimensional even if some
components have substructure. Indeed, since the components which are
themselves **Tree**s can be regarded as one-dimensional in the same way, the
substructures can be "unpacked" recursively just as they are built recursively,
until no substructures remain, in which case a **Sequence** of base datatype
components has been produced. This process, sometimes called
*flattening*, is similar to one that can be used in list processing as a
simple form of searching, though in that case the substructures are "visited"
but not removed. This is often displayed lexically by use of parentheses,
where e.g. *(a, (b, c), (d, (e, f), (g, h)))* is a **Tree** and *(a,
b, c, d, e, f, g, h)*, with the parentheses removed, is the flattened
version. Note that this "unpacking" of the **Tree** components of the
original **Tree** releases their own components to the parent aggregate
value, but does not destroy the ordering of them.

In this taxonomy, an aggregate datatype is multidimensional if more than one piece of information is needed to identify each component. For example, one might need two indexes, or two keys, or one key and then an ordering, or one key and one index. Each of these would be two-dimensional. An aggregate needing three indexes, or two indexes and a key, would be three-dimensional, and so on.

Multidimensionality can either be *inherent* or *induced*.
Multidimensionality is *inherent* if the aggregate datatype is defined
directly in terms of using more than one piece of information to identify each
component. The most common example of inherent multidimensionality is the
**Array** datatype. **Array**
(*lb1*:*ub1*,*lb2*:*ub2*) **of** **B**, where
**B** is the base datatype and *lb1*, *ub1*, *lb2*, and
*ub2* are values of an index datatype, is a two-dimensional aggregate
datatype, and two index values, one for each dimension; this can readily be
extended to any required number of dimensions. The index datatype is the same
for all dimensions. Aggregate datatypes are possible are defined similarly but
do not have this constraint, but in this taxonomy it is not called an
**Array**. A **Vector** is identical to an **Array** with only one
dimension.

With inherent multidimensionality, *no ordering of dimensions, and hence no
general ordering of components*, though the components along each dimension,
when the indexes for the other dimensions are fixed, *are* ordered
(induced, as usual, by the ordering of the index values). The ordering of
definition of the bounds as shown above is purely lexical, and the use of
*1* and *2* etc in the names used for the bounds is purely a
linguistic convenience without any implied ordering.

Multidimensionality is *induced* by defining an aggregate datatype as an
aggregate of components which are themselves aggregates, but which are then
unpacked, so that its own components are thereafter regarded as components of
the larger datatype. Here, *unpacked* is used in the sense that, though
the outermost structure boundaries have been removed, the "exposed" components
retain any properties they had as part of that (component) aggregate, and make
them available in the overall aggregate. For example, an unpacked
**Record** will retain its keys, a **Vector** will retain its indexing,
and an ordered aggregate will retain the ordering of the components that came
from it. Various kinds of two-dimensional and higher dimensional "tabular"
datatypes can be produced in this way.

We have already used the concept of unpacking in relation to flattening of a
**Tree**, where it can be noted that the ordering of components within
subtrees is retained on unpacking, hence making the flattened **Tree**
ordered and hence a **Sequence**.

Note that the definitions make it possible to produce a multidimensional
aggregate with the same components either by direct definition (inherent) or by
unpacking aggregates of aggregates. In this taxonomy, the two are kept
distinct, and in particular properties like ordering may be different between
the two. The definition of inherent multidimensionality as not implying an
ordering of dimension, whereas unpacking of aggregates of (ordered) aggregates
preserves the ordering. Thus, for example, a **Vector** of **Vector**s
is one-dimensional before unpacking, and two-dimensional and ordered after
unpacking, whereas the equivalent two-dimensional **Array** is not (totally)
ordered. These distinctions are maintained in this taxonomy because it is
logically possible and because languages exist that make such distinctions.
However, some languages may regard them as equivalent, or even *define*
the multidimensional aggregates as being built up of successive aggregating and
unpacking of one-dimensional aggregates.

The definition of unpacking has further consequences when higher dimensionality
than two is involved, with various intermediate stages. For example, a
three-dimensional **Array** will have corresponding structures of the kind
**Vector of Vector of Vector**, **Vector** of two-dimensional
**Array**, and two-dimensional **Array of Vector**. Each of these will
have their own ordering rules.

Finally, a **Table** is a special form of multidimensional aggregate with
internal keying. It is easiest to visualise in its two-dimensional form, with
one or more columns (say) holding keys to identify particular rows. The
complications possible in building **Table**s, using similar elaborations to
those already discussed, can be left for the reader to imagine!

*Subaggregates*

As noted, it is possible to obtain an aggregate of lower
dimensionality by fixing one or more components in other dimensions. For
example a two-dimensional **Array** can yield "row" **Vector**s and
"column" **Vector**s, depending on which of the two indices is fixed.

Other methods of producing subaggregates are selecting subsets or subranges of
keys or indexes, which can apply equally to one-dimensional aggregates. Mostly
the various possibilities will be clear from the nature of the original
datatype. Note that the properties of the subaggregate may not necessarily be
the same as those of the original. Clearly a reduction of dimensionality
changes that property, but it can change others too, for example a
**Vector** derived from an **Array** of higher dimensions will have the
induced ordering of the indexing whereas the original was not (totally)
ordered. Properties may also be lost, e.g. a subaggregate produced from a
**Vector** by selecting specific and non-consecutive values of its index
will no longer be a **Vector**, and the residual index values effectively
form a kind of external keying.

The properties that are preserved, changed, gained or lost can be deduced from consideration of the properties of the original, including index or key datatypes as appropriate, and the method of obtaining the subaggregates.

**Derived datatypes and datatype generators**

The taxonomy allows for **new** datatypes to be produced from
existing ones (copies, or "clones") and for further datatypes and datatype
generators to be derived from the basic primitive and generated ones -
**Tree** (mentioned above) is one example, and **CharacterString** and
**BitString** are others. Non-aggregate derived datatypes include
**Bit**, **Modulo**, and **TimeInterval**. Space precludes detailed
discussion of them, but details are in the standard.

**Conclusion**

The taxonomy has been introduced here to show the kind of way in which the LID and other LI standards can distil the essence of commonality of concept that underlies the often confusing and conflicting versions that exist in programming languages and elsewhere [Meek 1994]. Its value, and the value of the forthcoming standard, if properly exploited, will be to help bridge the gulf between different, incompatible approaches between languages and systems, that have so bedevilled users over the years.

**References**

ISO/IEC DIS 11404, *Language Independent Datatypes*, 1994

B.L. Meek, *Two-valued datatypes*, *Sigplan Notices of the ACM*, Vol
25 No 8, pp 75-79, August 1990

B.L. Meek, *What is a procedure call?*, DECUS Symposium 1993, submission
to *Sigplan Notices of the ACM* in preparation [Note added in 1997: published in 1995,
click on link]

B.L. Meek, *Programming languages - towards greater commonality*,
*Sigplan Notices of the ACM*, April 1994 (based on DECUS Symposium 1992
presentation)

Copyright © 1994 by the Association for Computing Machinery, Inc. Please click here for full ACM copyright notice.

WWW page prepared by Brian Meek,