A taxonomy of datatypes

This paper was published in ACM Sigplan Notices, Vol 29 No 9, September 1994, pp 159-167.

This WWW version was prepared by the author from the text submitted for publication, and appears by permission of the copyright holder. Copyright © 1994 by the Association for Computing Machinery, Inc. Please click here for full ACM copyright notice.

A taxonomy of datatypes

Brian Meek
King's College London
b.meek@hazel.cc.kcl.ac.uk

[Note added in 1996: now brian.meek@kcl.ac.uk]

Introduction

This is the second article based on language-independent standardisation work being carried out by international standards working group ISO/IEC JTC1/SC22 WG11 (Language Bindings). The first, Programming languages - towards greater commonality [Meek 1994] was an overview, which briefly described the various and the relationships between them. This article goes into one of these projects in greater detail. Both articles are based on presentations to the DECUS (UK and Ireland) symposia, this one given in May 1994. A third article, What is a procedure call?, based on a similar presentation in 1993 [Meek 1993], is projected.

What exactly is a datatype?

It is surprisingly difficult to get people to agree on what constitutes a "datatype". Most have a clear idea of what a datatype is - an idea, anyway - but as with many things in this field - like what a procedure call is, discussed in the projected third paper - they are often not the same idea.

The datatype concept crops up in a wide range of contexts, data being ubiquitous, and different contexts - databases, communications, many different programming languages - have their own culture and conventions. Such a variety of background is almost certain to lead to different perceptions. The taxonomy described here is based, though informally and fairly loosely, on that used in Draft International Standard 11404, Language Independent Datatypes (LID); the "language independent" (LI) indicates the attempt to avoid the presuppositions usually present in the culture and conventions of any particular language - or indeed any other use of the concept. Interpretations and details are the author's own, and should not be assumed to be present in the standard itself.

In constructing any taxonomy it is necessary to decide not only what a datatype is but what it is not. Despite the apparent derivation of the word, it is not simply "a type of data", but a concept in its own right. (This is the main reason for removing the space in "data type".) Things that have a datatype may not themselves be an item of data (at least of that datatype; it may be a value of some other datatype). An integer variable is not itself an integer. If that distinction is not clear (and some programmers find it difficult to see), an X channel (for I/O) in occam is not an X value - it is something that transports X values and only X values. Something "of datatype X" has properties (depending on its own nature) that in some way relate to the type of data that X values are. This is fine distinction, but it emphatically is not nitpicking: it is crucial to any taxonomy if it is to have any hope of encompassing generically the many different views that there are.

The next thing is to abandon any representational view of a datatype. Data is an abstraction, capable of representation in many forms, and "datatype" is a more abstract still; it is disastrous to link the concept to particular representational forms. This applies as much to second-order representational assumptions (e.g. a complex value as an ordered pair of real values, or an array as a contiguous block of values) as it does to first-order assumptions such as bit patterns. Such assumptions may be useful, even essential, in particular contexts, but a generic taxonomy cannot afford such restrictions.

Other things that have to go from the definition of datatype are the operations that can be performed on the data values. Many languages people are startled by this, even horrified. To them, the values and the operations on them are inseparable. In that data in a program is of limited use if you cannot do operations, this is an understandable attitude, but it overlooks two things.

One is that you can distinguish between static and dynamic aspects - programming languages separate static data declaration and dynamic procedural aspects, implicitly if not explicitly. The point of this is that some functionality related to languages is purely static as far as the data is concerned. For example, a data transmission channel operates with the data (dynamic) but not on the data, which is unchanged (if the channel is working correctly!). Operations such as addition and subtraction are an irrelevance here, and can even be a nuisance, for example when talking of conformity to a standard.

The other overlooked aspect is that it is not always obvious what the operations should be. Languages do not always agree, and some with "derived datatype" facilities allow default operations to be suppressed or replaced, and others added. An example is that, for a new integer called year, add and multiply are suppressed (meaningless) and subtraction is replaced; the subtraction is normal but the datatype of the result is not year but a datatype representing number of years.

I am not saying you could not include operations in a taxonomy, but to do it properly needs the full machinery of object-orientation. In standards work you need to set yourselves a specific achievable target, not make an open-ended commitment, so DIS 11404 does not attempt it. A taxonomy based on values alone is infinite, but manageably infinite, allow operations, and it becomes hard to see how or where you would ever stop, quite apart from (in a standard) specifying testable conformity requirements.

The basis of the taxonomy

The taxonomy follows the common practice of starting with a number of primitive datatypes and then using these to construct others. There are three main kinds of constructed datatypes: subtypes, generated datatypes, and aggregates. (In fact aggregates are technically also generated datatypes, but important enough to deserve separate classification.)

Primitive datatypes

Primitive datatypes are datatypes whose values are regarded fundamental - not subject to any reduction. They just "are". Those values are what some languages have called "atomic", or "plain values". Many primitive datatypes are also generic, in the sense that they have an unlimited number of values, and hence the datatypes often used in practice are confined to a finite subset of them. The reason that they are used in the taxonomy, rather than "actual" achievable datatypes, is threefold: it is a convenient way to identify a class of datatypes which is infinite in extent; language definitions commonly use them, meaning that they simplify the binding between the LI datatypes and the ones used by specific languages; and it allows for the possibility of actually supporting them if a language is designed to do so (in the sense, for example, that an integer of any arbitrary magnitude can be accommodated, even if in practice at some point you will run out of storage or time).

It is important here to distinguish "a language" from any given implementation of that language. A language definition will usually specify integer as a datatype, but leave the actual range of integers supported to the implementation. The most they are likely to specify is that the range must be of continuous values and cover at least a specific minimal range; some do not even go that far. However, for the LI datatype taxonomy, each one of all the possible subsets of the integer domain is a distinct datatype. If you are specifying a LI service in terms of LI datatypes, you cannot just bind integer to integer and leave it at that - you won't always be able to be certain that a service will work. For some services, a mismatch, or certain kinds of mismatch, will not matter, but for some (or for some uses of the service) it can matter. Programmers have long experienced such problems when transferring applications, even from one implementation of a language to another - and even when both conform to the language standard, if the standard has weak conformity requirements in this area.

The primitive datatypes of the taxonomy are Boolean, State, Enumerated, Character, Ordinal, Date-and-time, Integer, Rational, Scaled, Real, Complex and Void. This is a much longer list than that which most languages designate as "primitive", sometimes because they classify datatypes differently, sometimes because they represent some in terms of others (or assume that the programmer will). A simple example of such a represented datatype is complex as an ordered pair of real numbers, the "real" and "imaginary" parts. However, LID eschews this form of "representation" as well as the more obvious bit-level form. A LID datatype is just a set of values, and using the cartesian form to identify them is only a convenience for some purposes - think of the polar form, for example. Similarly, Rational is not a directly supported datatype of any well-known language (though Forth goes some way towards it), but its value-space is distinctive, and here again the integer-pair relationship runs into trouble, e.g. through multiple representations (2/4, 3/6, 34/68 etc) and special cases (1/0, etc). It is best treated as primitive in its own right.

Most of these primitive datatypes are generic, the actual specific, usable datatypes being derived by the use of parameters or qualifiers. Boolean and Void are exceptions. (An earlier paper [Meek 1990] discusses the multiplicity of ways two-valued datatypes like Boolean can be - and are - handled.)

Subtypes

In the taxonomy subtypes are created by modifying the value-space of a "base" datatype in various ways - specifying a range or size; selecting values; excluding values; extending the value-space; or defining explicitly how the value-space is constructed from that of a "base" datatype. Any combination of these is possible too. These are fairly self-explanatory, even though eyebrows might be raised at the idea of extending, where the subtype ends up with a wider value space than the base datatype! However, in the taxonomy any datatype can be used as the base, not just the primitive ones, and in that context extension is a useful subtype constructor e.g. you can make a new subtype by extending an existing one.

Generated datatypes

The (non-aggregate) generated datatypes in the taxonomy are pointer, procedure, and choice datatypes. Such datatypes are produced from other datatypes by the methods familiar from languages that include them, e.g. in the case of a procedure datatype it is constructed from the datatypes of the procedure parameters and of the returned result (if any). The primitive datatype Void is useful here for subroutine procedure datatypes that do not return a value. Void is also used in choices, where not making any choice between alternative "proper" datatypes is an allowed option.

Aggregate datatypes

An aggregate datatype is one whose values are made up of a number, in general more than one, of component values, each of which is a value of another datatype. In many ways this the most complex and interesting subclass of datatypes, simply because of the number of combinations and variations that are possible.

It is important for any taxonomy to get clear what qualifies as an aggregate, what you regard as a component of it, and what it means to talk about a value of such a datatype. Here, the statement that each value of an aggregate has "in general" more than one component value does not preclude cases where the "aggregate" value has only one component, or even none at all. (As far as the LID standard is concerned, languages may allow such "degenerate" cases in their own right, regard them as equivalent to non-aggregate types, or forbid them altogether.) Next, a component value may itself be of an aggregate datatype. The difference between aggregate and non-aggregate components is immaterial when considering the properties of the composite; while "inside" the aggregate the individual components remain as single entities ("closed boxes"). Similarly, the aggregate as whole is regarded as having a single value, a closed box, whose datatype is determined by the datatypes of its components and the structure of the aggregate.

The approach adopted is to start with the most general form of aggregate datatype, capable of containing anything, and then describing additional properties or constraints used to identify various kinds of aggregate datatype that are encountered computationally. These properties and constraints are not all mutually orthogonal; they may interact with others, in various ways. The particular mix of properties used for a given aggregate datatype will depend on the envisaged computational uses of the datatype and its values. This taxonomy shows a way of building any of the commonly-found forms of aggregate, and how to construct others if needed, by appropriate mixing and matching of a relatively small number of properties. The taxonomy is in fact capable of expressing bizarre conceptual datatypes of no practical interest, including unlimited numbers of components, values not in practice representable, and so on - again sharing this property with many definitions of programming languages.

The main kinds of aggregate in the taxonomy are Bag, Set, Record, Sequence, Array, and Table. These will be introduced as they appear in the discussion of the various properties aggregates can have. They are technically also datatype generators, though we shall call them datatypes for simplicity - the components will be of one or more base datatypes from which the aggregate datatype values are constructed.

The most general aggregate datatype

The most general aggregate datatype is one whose values are each made up of any number of component values (including zero or one component values), every such component value being of any datatype at all. This kind of aggregate datatype is called a Bag. It is completely unstructured, with no internal relationships at all. It is of limited practical interest but is useful in the taxonomy since all other aggregate datatypes can be expressed in terms of constraints and properties applied to it.

Distinguishing components by component values

Since a Bag is completely unstructured, it is not in general possible to distinguish between different component values of a Bag value. This is because the components may have the same datatype and the same value of that datatype: a Bag value may contain duplicate components. An example is the simple probability exercise: a bag contains four white balls and eight blacks balls, what is the probability that the first two balls removed from the bag will be of the same colour?

Usually, component values of aggregate values are distinguished by their relationships with the structure of the aggregate datatype as a whole, and relationships which exist between the components within it the aggregate as a result of their membership of the aggregate (rather than as values of their own datatype).

A Set, in this taxonomy, is a Bag which has the constraint added that there are no duplicate values, i.e. given two components, either their two datatypes are different, or they have the same datatype but have two different values of that datatype.

Homogeneity

Homogeneity is another constraint that can be added, either to Bag or Set. This means that the component values are drawn solely from one datatype, known as the base datatype. If the base datatype is B, the datatypes resulting are Bag of B and Set of B respectively.

Between the total generality of unconstrained Bag and Bag of B there would seem to be a limitless number of intermediate cases where component values may be drawn from a range of different datatypes but not all possible datatypes. In this taxonomy many such possibilities can be accommodated by using Choice.

Size

Returning to the most general Bag, another constraint that can be placed on values is the size of the aggregate value, i.e. the allowed number of components. In fact this consists of two constraints, one being the lower limit on the number of components, which must exist, and the other being the upper limit, which may or may not exist. If these two are the same, every value of the aggregate datatype has the same size, if they are not then any size between the upper and lower limits is possible. The size constraint on aggregate values should not be confused with the number of different possible such values, which depends on the number of possible combinations of possible values of the components within their own datatypes.

The lower limit on size must be at least zero, the upper limit at least the same as the lower limit. For the purposes of the taxonomy, it is assumed that both limits are actually achieved; if no upper limit exists, this means that values of the aggregate datatype of any arbitrary size are valid. Where no upper limit exists, this means that the aggregate datatype has an unlimited (infinite) set of possible values. In any practical case of course, only a finite (though perhaps unspecified) number actual aggregate values, each of a finite (though perhaps not predetermined) size will be used.

More complicated situations are possible where the size of each aggregate value may be any one of two or more possibilities not expressible in terms of lower and upper bounds of a contiguous range: e.g. the size might be defined as a multiple of three (3 components, 6 components, ....). Again, the taxonomy can be extended to cover this by use of Choice.

It is at this point that it becomes evident that there may be interaction between the various constraints. For example, if Set of B is defined where the base datatype B has a finite number n of distinct values, then because of the constraint "every component value is distinct" in the definition of Set, n is the largest number of components that any value of datatype Set of B can possibly have.

In general from now on, additional properties or constraints used to identify various kinds of aggregate datatype may interact with others, in various ways.

The aggregate datatype and its base datatype

It is very important in this taxonomy to distinguish between properties of the aggregate datatype and those that the base datatype has in its own right. In general, any property of the base datatype does not induce the same or a similar property in any aggregate datatype whose values are composed of its values, except perhaps where it interacts with a property of that aggregate datatype. For example, the finiteness of the base datatype in the above example does induce a constraint on the upper limit of size of values of the aggregate datatype, but only through interaction with another constraint on the aggregate datatype. It does not induce a similar constraint for Bag. Similarly, a base datatype may have a specific ordering, but this does not induce a similar ordering of the values of the aggregate, or of the components within any aggregate value. In fact, as will be seen, either or both such orderings at the aggregate level will normally be different from any ordering of the base datatype values, and indeed can validly exist whether a base datatype ordering is defined or not.

Distinguishing components by tagging

It is possible to distinguish the different components by "tagging" each one; for example, a language may provide a means of defining names (identifiers) for the various components. The term "tag" is used here simply to avoid any implication that the distinguishing syntax is necessarily an alphanumeric identifier - it could, for example, be a numerical label. The essence of tagging is that the tags themselves are not in themselves members of a datatype, and have no significance except as a means of referring to the aggregate component concerned. In particular, they do not imply any ordering of the components within the aggregate. (For example, numerical labels used as tags are not members of an arithmetic datatype, and have no arithmetic properties.) However, the combination of a tag with a value of the relevant aggregate datatype will of course have a datatype, that of the component selected. In this sense the tag does have an associated datatype, that of the aggregate component it tags, but on its own it is not a "value" of anything. Of course, the datatype of any given tagged component can be a Choice datatype, of any required degree of complexity.

Theoretically, tagging can be partial, i.e. only some of the components are given tags. This taxonomy could be extended to include it, but for current purposes the complications this would entail are not justified by the benefits.

Record datatypes

A Record datatype in this taxonomy can be described as a Bag in which each component is tagged. However, tagging imposes so many constraints on what is in the Bag, that the unstructured nature of the unconstrained Bag only remains through the absence of homogeneity (see above) and of ordering (see below). In this taxonomy, tagging determines the number of components (i.e. it fixes the size of the aggregate), and it fixes the datatypes of them. However, as noted earlier, flexibility of component datatype can be achieved through using Choice datatypes for them, and similarly Choice of Record can be used to achieve size variability, and other alternatives such as "variant records" which may be needed whether or not the number of components is variable.

Ordered datatypes

In Bags, Sets and Records, the components of the aggregate are unordered. That means that, given any pair of components, it is meaningless to ask whether either comes before (precedes) or after (succeeds) the other. If ordering is a property which is added to the aggregate, this means that an ordering relationship exists between the various components. This can be done simply by giving mutual ordering properties between the components (e.g. suitable operations on them); by giving them a position (in some sense) in the overall aggregate structure, which induces an ordering; or both. Any precedence relationship between components is independent of any precedence relationship between the values of those components. For example if C1 and C2 are components of an aggregate and C1 comes before C2 in the aggregate, then this says nothing about ordering of the values. If the aggregate is inhomogeneous, C1 and C2 may have different datatypes. If it is homogeneous, C1 and C2 may be members of a datatype which is unordered. If the datatype of C1 and C2 is an ordered datatype, the value of C1 need not precede the value of C2 by the ordering rules of that datatype. Ordering of components within aggregates does not result in an ordering of values of the aggregate datatype.

In this taxonomy, the values of any datatype either are totally ordered, or are unordered. Theoretically datatypes can exist which are partially ordered. Extending the taxonomy to include such a concept is possible but would add considerable complication with little apparent benefit. For the purposes of this taxonomy it is unnecessary to pursue this further.

(Few if any actual programming languages seem to support partially ordered datatypes, and in practical applications needing it, often it is adequate to use total ordering but then either ignore ordering when not required, or check that it is meaningful in a particular case.)

In this taxonomy, Sequence is an aggregate with a strict and unique ordering, but whose components in general are not distinguishable from one another in any other way than by this ordering. By "strict and unique" is meant that every component (except the first) has one and only one immediate predecessor, and (except the last) one and only one immediate successor. A Sequence can be homogeneous (Sequence of B) or not, and its size may be fixed or variable in any of the ways discussed earlier.

Starting from the first component and taking successor after successor until the last component is reached, each component can be related to the Sequence as a whole, in terms of "distance" from one end of the Sequence or the other. This does not imply that any given component can be accessed in any other way than by systematic searching for it through the ordering. There may not even be any direct means of identifying which is "first", though commonly in languages this can be indicated through the lexical ordering of the definition of aggregate values. Notionally components may be distinguished (keyed or indexed, see below) by implicit association with ordinal values first, second, third, ..., but in this taxonomy a Sequence does not have a general such method of picking out any individual component. Ways may be provided (e.g. special operations on the aggregate as a whole) of finding (say) the first or the last, but not for all. Operations to find the (immediate) successor or predecessor are characterising of Sequence in this taxonomy, and those do apply to all components. In this taxonomy, therefore, Sequence is distinct from Vector, which has a similar structure but is indexed, as shown below.

The taxonomy can be extended to include cyclic datatypes (where the choice of "first" component is arbitrary and the successor of the "last" is the "first"), and cases with more complicated topology (directed graphs or digraphs). Here too, for current purposes the complications this would entail are not justified by the benefits. If it is really needed in a particular case, it can be done at the operations level, but suitably modifying those for obtaining the predecessor or successor of a component.

Recursive aggregates

A special situation arises when a datatype is defined recursively, i.e. where the base datatype of the components is a Choice which includes, as one of the alternatives, the aggregate datatype itself. (There has to be a Choice, to enable the recursion to terminate.) While in principle this can apply generally, in this taxonomy only one case appears explicitly, namely Tree identified as a recursively defined Sequence, i.e. Tree of B is Sequence of (Choice of (B,Tree of B)). In some programming languages this kind of aggregate is called a list, but the word "list" is avoided here it is sometimes taken to mean what in this taxonomy is called a Sequence.

Sequence can be regarded as a "flat" form of Tree, where the recursive property is not used; this is clear from the definition of Tree. In fact, assuming other constraints (such as size constraints) are the same for both, in general the values of the datatype Tree of B will include all of the values of the datatype Sequence of B.

Distinguishing components by keying

Tagging was introduced earlier as a means of identifying each individual component of an aggregate, and it was noted that in this taxonomy the tags are not values of a datatype, and have no significance except as names for the components they reference.

Keying is similar to tagging except that keys are values of a datatype; i.e., there is a one-to-one correspondence between values of the key datatype and the components of the aggregate datatype. The key datatype may or may not be ordered, and this ordering may or may not be used in relation to the keying; that is, though the datatype the keys are taken from is an ordered datatype, this property need not be used by the aggregate datatype being keyed.

Keying may either be internal or external to the aggregate datatype. If the keying is internal, this means that the values of the keys appear themselves explicitly in the aggregate as components in their own right. In general, this results in the aggregate datatype having more than one dimension, so further discussion of internal keying will be deferred until the discussion of dimensionality.

If keying is external, the key values are used more like tags but with the added feature that they have properties of their own. In particular, if the key datatype is ordered, it induces an equivalent ordering on the components of the aggregate. In the case of internal keying, any ordering of the key datatype will not in general induce an ordering of the aggregate components, since that will depend on where the keys appear as components themselves.

In this taxonomy, a keyed aggregate is homogeneous. The base datatype, as usual, can be made a Choice, for greater flexibility. As with tagging, keying could be partial, but this taxonomy is not extended to include that for the same reasons as before.

Distinguishing components by indexing

Indexing is a special case of keying where the values of the key datatype are an uninterrupted range of successive values from a base datatype from which the keys are drawn. The number and values of the keys are defined in terms of the lower and upper bounds of the range. Indexing is commonly used in languages since it can provide a bridge between mathematical concepts such as indexed variables, which are useful in many applications, and an efficient representational method using contiguous storage locations and simple means of calculating the storage position of an individual indexed component.

A Vector is an aggregate with one index, where indexing is understood in the sense described here. Each value of the index identifies a unique component of the aggregate datatype, and the ordering of the index datatype induces an ordering on those components. A Vector becomes a Sequence if the index datatype values identifying the components are discarded, but the ordering remains, though to be able to extract individual components some means other than indexing of identifying individual components will be needed.

Some languages merge, or regard as interchangeable, the concepts of Sequence and Vector. Some treat all Vectors of the same length as indistinguishable, and some place constraints on the index datatype and/or on the bounds. Examples of constraints are restricting the index datatype to integer, and fixing the lower bound.

Some languages allow (or assume) that Vectors (and/or Arrays, see below) are of variable bounds or size. In this taxonomy, this can be allowed for by suitable use of Choice. Variable Vectors could equally be provided by extending the taxonomy to allow either or both of the actual bounds of any aggregate value to be one of a range. It could also be done by "padding" shorter aggregate values with meaningless values for the "missing" components, or by accompanying each aggregate value by other values identifying the meaningful values, but in this taxonomy these are regarded as representational matters.

As with tagging and keying, indexing could be partial, but this taxonomy is not extended to include that for the same reasons as before.

To summarise the differences between tagging, keying and indexing, tags individually identify components but are not values of a datatype, while keys do the same but are values of a datatype. Keys may or may not be ordered, and can be provided externally or internally. Indexes are external keys forming a consecutive range from an ordered datatype. Partial use (some but not all) has already been referred to, and the use of a mixture (e.g. some components are indexed, the rest tagged), though theoretically possible, is left out of this taxonomy for similar reasons. However, occasion may arise where hybrids occur, for example aggregates which are both keyed and tagged, so components can be identified in two different ways. That the same aggregate value can be used either as a Record or as a Vector is not in itself a problem, but in general it will be necessary to determine which of the two is required for an external mapping.

Aggregates with more than one dimension

The discussion so far has been of aggregates either with no structure or where the structure is essentially one-dimensional. This is a consequence of the property of ordering. A Tree might be regarded as an exception, but in this taxonomy it is regarded as one-dimensional even if some components have substructure. Indeed, since the components which are themselves Trees can be regarded as one-dimensional in the same way, the substructures can be "unpacked" recursively just as they are built recursively, until no substructures remain, in which case a Sequence of base datatype components has been produced. This process, sometimes called flattening, is similar to one that can be used in list processing as a simple form of searching, though in that case the substructures are "visited" but not removed. This is often displayed lexically by use of parentheses, where e.g. (a, (b, c), (d, (e, f), (g, h))) is a Tree and (a, b, c, d, e, f, g, h), with the parentheses removed, is the flattened version. Note that this "unpacking" of the Tree components of the original Tree releases their own components to the parent aggregate value, but does not destroy the ordering of them.

In this taxonomy, an aggregate datatype is multidimensional if more than one piece of information is needed to identify each component. For example, one might need two indexes, or two keys, or one key and then an ordering, or one key and one index. Each of these would be two-dimensional. An aggregate needing three indexes, or two indexes and a key, would be three-dimensional, and so on.

Multidimensionality can either be inherent or induced. Multidimensionality is inherent if the aggregate datatype is defined directly in terms of using more than one piece of information to identify each component. The most common example of inherent multidimensionality is the Array datatype. Array (lb1:ub1,lb2:ub2) of B, where B is the base datatype and lb1, ub1, lb2, and ub2 are values of an index datatype, is a two-dimensional aggregate datatype, and two index values, one for each dimension; this can readily be extended to any required number of dimensions. The index datatype is the same for all dimensions. Aggregate datatypes are possible are defined similarly but do not have this constraint, but in this taxonomy it is not called an Array. A Vector is identical to an Array with only one dimension.

With inherent multidimensionality, no ordering of dimensions, and hence no general ordering of components, though the components along each dimension, when the indexes for the other dimensions are fixed, are ordered (induced, as usual, by the ordering of the index values). The ordering of definition of the bounds as shown above is purely lexical, and the use of 1 and 2 etc in the names used for the bounds is purely a linguistic convenience without any implied ordering.

Multidimensionality is induced by defining an aggregate datatype as an aggregate of components which are themselves aggregates, but which are then unpacked, so that its own components are thereafter regarded as components of the larger datatype. Here, unpacked is used in the sense that, though the outermost structure boundaries have been removed, the "exposed" components retain any properties they had as part of that (component) aggregate, and make them available in the overall aggregate. For example, an unpacked Record will retain its keys, a Vector will retain its indexing, and an ordered aggregate will retain the ordering of the components that came from it. Various kinds of two-dimensional and higher dimensional "tabular" datatypes can be produced in this way.

We have already used the concept of unpacking in relation to flattening of a Tree, where it can be noted that the ordering of components within subtrees is retained on unpacking, hence making the flattened Tree ordered and hence a Sequence.

Note that the definitions make it possible to produce a multidimensional aggregate with the same components either by direct definition (inherent) or by unpacking aggregates of aggregates. In this taxonomy, the two are kept distinct, and in particular properties like ordering may be different between the two. The definition of inherent multidimensionality as not implying an ordering of dimension, whereas unpacking of aggregates of (ordered) aggregates preserves the ordering. Thus, for example, a Vector of Vectors is one-dimensional before unpacking, and two-dimensional and ordered after unpacking, whereas the equivalent two-dimensional Array is not (totally) ordered. These distinctions are maintained in this taxonomy because it is logically possible and because languages exist that make such distinctions. However, some languages may regard them as equivalent, or even define the multidimensional aggregates as being built up of successive aggregating and unpacking of one-dimensional aggregates.

The definition of unpacking has further consequences when higher dimensionality than two is involved, with various intermediate stages. For example, a three-dimensional Array will have corresponding structures of the kind Vector of Vector of Vector, Vector of two-dimensional Array, and two-dimensional Array of Vector. Each of these will have their own ordering rules.

Finally, a Table is a special form of multidimensional aggregate with internal keying. It is easiest to visualise in its two-dimensional form, with one or more columns (say) holding keys to identify particular rows. The complications possible in building Tables, using similar elaborations to those already discussed, can be left for the reader to imagine!

Subaggregates

As noted, it is possible to obtain an aggregate of lower dimensionality by fixing one or more components in other dimensions. For example a two-dimensional Array can yield "row" Vectors and "column" Vectors, depending on which of the two indices is fixed.

Other methods of producing subaggregates are selecting subsets or subranges of keys or indexes, which can apply equally to one-dimensional aggregates. Mostly the various possibilities will be clear from the nature of the original datatype. Note that the properties of the subaggregate may not necessarily be the same as those of the original. Clearly a reduction of dimensionality changes that property, but it can change others too, for example a Vector derived from an Array of higher dimensions will have the induced ordering of the indexing whereas the original was not (totally) ordered. Properties may also be lost, e.g. a subaggregate produced from a Vector by selecting specific and non-consecutive values of its index will no longer be a Vector, and the residual index values effectively form a kind of external keying.

The properties that are preserved, changed, gained or lost can be deduced from consideration of the properties of the original, including index or key datatypes as appropriate, and the method of obtaining the subaggregates.

Derived datatypes and datatype generators

The taxonomy allows for new datatypes to be produced from existing ones (copies, or "clones") and for further datatypes and datatype generators to be derived from the basic primitive and generated ones - Tree (mentioned above) is one example, and CharacterString and BitString are others. Non-aggregate derived datatypes include Bit, Modulo, and TimeInterval. Space precludes detailed discussion of them, but details are in the standard.

Conclusion

The taxonomy has been introduced here to show the kind of way in which the LID and other LI standards can distil the essence of commonality of concept that underlies the often confusing and conflicting versions that exist in programming languages and elsewhere [Meek 1994]. Its value, and the value of the forthcoming standard, if properly exploited, will be to help bridge the gulf between different, incompatible approaches between languages and systems, that have so bedevilled users over the years.

References

ISO/IEC DIS 11404, Language Independent Datatypes, 1994

B.L. Meek, Two-valued datatypes, Sigplan Notices of the ACM, Vol 25 No 8, pp 75-79, August 1990

B.L. Meek, What is a procedure call?, DECUS Symposium 1993, submission to Sigplan Notices of the ACM in preparation [Note added in 1997: published in 1995, click on link]

B.L. Meek, Programming languages - towards greater commonality, Sigplan Notices of the ACM, April 1994 (based on DECUS Symposium 1992 presentation)

Copyright © 1994 by the Association for Computing Machinery, Inc. Please click here for full ACM copyright notice.


WWW page prepared by Brian Meek, brian.meek@kcl.ac.uk, 30 June 1995, last updated 23 April 1997