Start Lecture #7
Remark: Midterm exam next recitation.
Homework assigned today is due the beginning of next class
not next recitation.
Remark: Recall from last lecture that a type system consists of:
A value must have a type compatible with the context in which the value is used. For most languages this notion is significantly weaker than equivalence; that is, there may be several non-equivalent types that can legally occur at a given context.
We first need to ask what are the contexts in which only a type
compatible with the expected
type can be used.
Three important contexts are
As these examples indicate, there is often a natural
type
for a context and the question is whether the type that actually
appears is compatible with this natural type.
Thus the question reduces from
is type T compatible at this context?
to
Is type T compatible with type S?
.
The definition of type compatibility varies greatly from one language to another. Ada is quite strict. Two Ada types S and T are compatible in only three cases.
Pascal is slightly more lenient; it permits a integer to appear
where a real number is expected.
Note that permitting an integer to stand for a real implies that a
type conversion will be performed under the covers
.
Fortran and C are more liberal and will perform more automatic type conversions.
Such implicit, or automatic type conversions are called coercions.
Widespread coercions are a controversial feature of C and other languages since they significantly weaken the type system.
Note that in coercing a value of type T into a value of type S, the system may need to
A language like Ada never performs a coercion of the second kind (types T and S that would require this kind of conversion are not compatible). Instead, the programmer must explicitly request such conversions explicitly by using the name of the target type as a function. For example Y:=float(3); would be used if Y is of type float.
In contrast C permits the above without the float() and even permits a naked x=3.5; when x is an int.
A recent trend is to have less coercions and thus stronger types.
In this regard, consider the C++ feature permitting
user-defined coercions
.
Specifically, C++ permits a class definition to give coercion
operations to/from other types.
Should this be called a coercion at all?
That is, are user-defined coercions really automatic?
Definition: Overloading occurs when there are multiple definitions for the same name that are distinguished by their type.
Definition: The decision which of these definitions applies at a given use of the name is called overload resolution.
Normally, overloading is used only in static type systems and the overloaded name is that of a function. In resolving an overloaded function name uses the signature of each of the individual functions, i.e.,
Overloading is related to coercion. Instead of saying that the int in int+real is coerced to a float, we define another addition operator named + that takes (int,float) and produces float. Naturally, this operator works by first converting the int to a float and then does a floating point addition. This proposal is not practical when there are many types and/or many parameters. For example sum4(a,b,c,d) would need 34=91 definitions if it is to be work when each argument could be integer, positive, or natural.
int main (void) { int f(int x); int f(char c); return f(6.2); } int f(int x) { return 10; } int f(char c) { return 20; }
Alternatively, one could eliminate all overloading of addition by just defining float+float and coercing int+int in addition to int+float and float+int.
The combination of overloading and coercion can lead to ambiguity as shown by the C++ example on the right. The g++ compiler gives the following remarkably clear error msg when handed this program.
tt.c: In function 'int main()': tt.c:4: error: call of overloaded 'f(double)' is ambiguous tt.c:2: note: candidates are: int f(int) tt.c:3: note: int f(char)The problem is that a C/C++ double (such as 6.2) can be coerced into either int or char. This same ambiguity would occur if, instead of 6.2, we used a variable of type double.
Literals Either literals must be considered to be of many types (e.g., 6 would be an integer, a positive, a natural, and a member of any user types derived from integer) or literals must be members of a special type universal integer that is compatible with any of the above types. Ada uses this second solution.
Several languages provide a
universal reference type (also called a
generic reference type).
Elements of this type can contain references to any other type.
In C/C++ the universal reference type is void * and
can be thought of as a pointer to anything
.
In an object-oriented language the analogous type is object
in C# and Object in Java.
Since objects are normally heap allocated, one could consider
object and Object to also be pointers to anything,
or references of anything
.
It is type safe to put any reference into an object of universal reference type: Since the type of an object referred to by a universal reference is unknown, the compiler will not allow any operations to be performed on that object. For example, gcc complains about the pair of C statements
void *x; strlen(x);
However the reverse assignment in which a universal reference is assigned to a more specific reference type is problematic since now operations appropriate to this specific reference can be applied to a value that belongs to the general reference.
A Java-like example would go as follows. The standard java.util library contains a Stack container class. A programmer uses the constructor and obtains stack a member of the class. Any object can be pushed onto stack since it is a stack of Object's. The programmer pushes a rectangle and a parallelogram onto the stack. This is fine. Members of the stack are Objects and any operation applicable to an Object can be legally applied to a parallelogram or to a rectangle. Now the programmer wants to pop the stack and have the top member assigned to a rectangle. Java will not permit this since the member is an Object and objects cannot be assigned to rectangles. The programmer must use a cast when doing the pop, telling the system that the top member is a rectangle. Java actually keeps tags with objects so can verify that the top Object is actually a rectangle. If it is in fact only a parallelogram, a run time error is reported. This is dynamic type checking. If the top member of the stack was a rectangle and the programmer tried to pop it onto a parallelogram object, Java would detect that the top stack member belongs to a subclass of rectangle so no error occurs. Type safety has been preserved.
Summary: In object oriented terminology x=y; is safe if y is more specific than x (i.e., y has more operations defined). In contrast y=x; cannot be permitted without checking that the value in x is suitable for y.
The real problems arise with a language like C (and void *) since no tags are kept and the system cannot determine whether or not the pointer being popped does reference a rectangle or parallelogram.
Hence the programmer must use an unchecked type conversion
between
void * and (struct rectangle) *.
Type safety has been lost.
In this section we emphasize type synthesis, i.e, determining the type of an expression given the type of its constituents. Next lecture, when we study ML, we will encounter type inference, where the type of the constituents in determined by the type of the expression.
Sometimes it is easy to synthesize the type of the expression. For example,
with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; procedure Ttada is subtype T1 is Integer range 0..20; subtype T2 is Integer range 10..20; A1 : T1; A2 : T2; begin Get (A1); Get (A2); Put (A1+A2); end Ttada;
What would be the type of A1+A2 in the Ada program on the right? Two possibilities come to mind.
This is also true of types with range constraints, but our use of Ada will only add ranges to subtype definitions, not type definitions
Next lecture will be on ML, including the type system.
Homework: CYU 15, 16, 18
Records are a common feature in modern programming languages; however the terminology differs among the languages. They are referred to as records in Pascal and Ada, just types in ML and Fortran, and structs (short for structures) in C. They are similar to methodless classes in Java, C#, C++.
In all cases a record is a set of typed fields. Some choices to be made in defining records for a specific language.
Ada type atomic_element is record Name : String (1..2); Atomic_Number : Integer; Atomic_Weight : Float; end record; C struct atomic_element { char name [2]; int atomic_number; double atomic_weight; }; ML type atomic_element = { name : string, atomic_number : int, atomic_weight : real};
On the right are specifications of the same record type in three languages, Ada, C, and ML. Note that in ML there is no , after real. Like Algol this ML punctuation is a separator not a terminator.
If you want to play with ML, the command name is sml, which is
short for standard ML of New Jersey
.
As noted above the order of the fields in a record is not
significant in ML.
It appears to me that sml alphabetizes the fields; that is,
the example is reordered to atomic_weight, atomic_number, name.
Records in Scheme are somewhat more complicated; we won't use them.
The issue here concerns alignment of fields within a record.
Today essentially all computer architectures are byte addressable
and hence, in principle, each field could start at the byte right
after the last byte of a the previous field.
However, architectures normally have alignment requirements
something like
a datatype requiring n-bytes to store must begin at an address
that is a multiple of n
.
For example, a 4-byte quantity (e.g., an int in C) cannot occupy
bytes 10,002-10,005.
In most cases the truth is that an int in C could occupy any 4 consecutive bytes, but accessing the int would be significantly faster if it was properly aligned so that its starting address is a multiple of 4. We will not be so subtle and will simply use the crude approximation given by the quote in the previous paragraph.
Note that this issue is not so severe with arrays as with records. Since arrays are homogeneous, if the first element is aligned and the elements are packed with no padding in between, then all elements will be aligned. With records, such as atomic_element it is harder. Assume the record itself is aligned on a doubleword (8-byte) boundary. That means that the name field will begin on an address like 8800 that is evenly divisible by 8. However, the atomic_number field of the record will begin 2-bytes later and hence will not be properly aligned for integer datatype (which we assume is 4-bytes in size).
Hence atomic_number is not properly aligned and cannot be
accessed efficiently.
As a result a two-byte pad
is inserted between name and
atomic_number.
This wastes space.
Assume that the float/real/double value atomic_weight needs
8-byte alignment.
This alignment is satisfied since name + pad +
atomic_number consume 8 bytes and the record itself is
8-byte aligned.
Thus the total padding required is 2-bytes which is 1/8 of the total
space of the padded record.
Now note that if the record was reordered so the fields were weight, number, name (i.e. decreasing size and required alignment), there would be no padding at all. ML changes the ordering visibly, but does not use size as the ordering criterion. Some languages permit the compiler to reorder. Since arbitrary reordering is impossible for systems programming, where the fields can correspond to specific hardware device registers whose relative addresses are given, systems programmers must choose languages in which such reorderings can be overridden.
One last point. If the padding is garbage (i.e., nothing specific is placed in the pad), then comparing a record requires comparing the fields separately since a single large comparison would require that the two garbage pads be equal. Thus another trade-off occurs. Padding with garbage makes creating a record faster, but padding with zeros (or any other constant) makes comparing two records faster.
Homework: 8.
Definition: A variant record is a record in which the fields act as alternatives, only one of which is valid at any given time. Each of the fields is itself called a variant.
with Ada.Integer_Text_IO; use Ada.Integer_Text_IO; procedure Ttada is type Point is record XCoord : Float; YCoord : Float; end record; type ColorType is (Blue, Green, Red, Brown); type FigureKind is (Circle, Square, Line); type Figure(Kind : FigureKind := Square) is record Color : Colortype; Visible : Boolean; case Kind is -- Kind is the discriminant when Line => Length : Float; Orientation : Float; Start : Point; when Square => LowerLeft : Point; UpperRight : Point; when Circle => Radius : Float; Center : Point; end case; end record; C1 : Figure(Circle); -- Discrimant cannot change S1 : Figure; -- Default discriminant, can change function Area (F : Figure) return Float is begin case F.Kind is when Circle => return 3.14*F.Radius**2; when Square => return F.Radius; -- ERROR when Line => return 0.0; end case; end Area; begin S1 := (Square, Red, True, (1.5, 5.5), (1.2, 3.4)); C1.Radius := 15.0; if S1.LowerLeft = C1.Center then null; end if; -- ERROR to write S1.Kind := Line; -- ERROR to write C1 := S1; end Ttada;
Actually a variant can be a set of fields. For example the Line variant on the right has three fields. As is also shown in the example a record Figure can have several non-variant fields as well as several variant fields.
In some languages (e.g., Ada, Pascal) an explicit field, called the discriminant or tag keeps track of which variant is currently valid. In the example on the right the discriminant is Kind. We will see soon that Scheme employs an implicit discriminant.
Definition: When the variant record contains an explicit or implicit discriminant, the record is called a discriminated union. When no such tag is present, the variant record is called a nondiscriminated union..
Clearly languages with nondiscriminated unions, e.g. C's union construct or Fortran's equivalence, give up type safety since it is the programmer who must ensure that the variant read is the last variant that was written. However, even languages like Pascal and Modula-2, which have discriminated unions, have type weakness in this area. To see the technical details read section 7.3.4 of the 3e (that section is one of those on the CD).
Ada is fairly type safe, but it is complicated. The Modula-3 designers did not like the type dangers in Modula-2 and the complexity of Ada and omitted variant records entirely.
The Ada example illustrates a number of features of the language; here we concentrate on just the variant record aspects. Figure is a variant record with Kind the discriminant. Note that the discriminant is given a default value; that is what permits us to define S1 without giving a discriminant as we did for C1. If Kind did not have a default then the declaration of S1 would be an error.
C1 will always be a circle; see the ERROR comment all the end. S1 can be any figure. Note, however, that you can only change the tag when you change the entire variant record; see the ERROR comment near the end.
Although this program compiles, it is incorrect. The Area function checks the Kind of its argument, but, if the Kind is Square, it references the Radius component. If Area were actually called with a Square, a run-time error would occur.
Finally, null is an Ada statement that does nothing. I used it here to make the if statement legal (although useless). In reality there would be other code in the then arm.
(define X 35) (+ X 5) (define X '(a 4)) (+ X 5)
Recall that Scheme has strong, but dynamic typing. That means, that the variables are not typed but the values are. If you present the example on the right to a Scheme interpreter, you will get an error for the last line. X was first an integer; then a list. Although there is no explicit tag, the system keeps one implicitly and thus can tell that the last line is a type error since you can't add an integer to a list.
Instead of the type-unsafe unions of C, a Java programmer could have a base class of the non-varying components and then extend it to several classes one with each arm of the union.
Definition: An array is a mapping from an index type to an element type (or component type).
Arrays are found in essentially all languages. Most languages require the index type to be an integral type. Pascal, Ada, and Haskell allow the index type to be any discrete type.
Ada: (82, 21, 5) Scheme: #(82, 21, 5)
First-Classness: In C/C++ functions cannot return arrays and there are no array literals (but there are array initializers). Scheme and Ada have array literals as shown on the right
The syntax of array usage varies little across languages. The one distinction is whether to use [] or () for the subscript. Most common is [], which permits array usage to stand out in the program. Fortran uses () since Fortran predates the widespread availability of [] (the 026 keypunch didn't have []). Ada uses () since the designers wanted to emphasize the close relation between subscripted references and function evaluations. A(I) can be thought of as the Ith component of the array A or the value of the function A at argument I. Since an array is a function from the index type, there is merit to this argument.
C float x[50]; Ada type Idx is new Integer range 0..49; X : array (Idx) of Float; Y : array (Integer range 0..49) of Float; Z : array (0..49) of Float; type Vec is array (0..9) of Float; Mat1 : array (0..9) of Vec; Mat1(3)(4) := 5.0; Mat2 : array (0..9, 0..9) of Float; Mat2(3,4) := 5.0;
An array declaration needs to specify the index type and the element type. For a language like C, the index type is always the integers from zero up to some bound. Thus all that is needed is to specify the component type and the upper bound as shown on the right.
With the increased flexibility of Ada arrays comes the requirement
to specify additional information, namely the full index type.
The first two lines, give us 50 Floats, just like the C example, but
the index type is idx so you can't write X(I) where I is an
integer variable.
You can write Y(I) and Z(I) which are of the same
type (the second declaration is an abbreviation
of the first.
Multi-Dimensional Arrays Essentially all languages support multidimensional arrays; the question is whether a two-dimensional (2D) array is just a 1D array of 1D arrays. Ada offers both possibilities as shown on the right. The 2D array Mat2 has elements referenced as Mat2(2,3); whereas the 1D array of 1D arrays Mat1 has elements referenced as Mat1(2)(3). The 2D notation is less clumsy, but slices (see below) are possible only with Mat1. C arrays, are like Mat1: they are 1D arrays of 1D arrays. However C does not support slices.
A slice or section is a rectangular section of an array. Fortran 90 and many scripting languages provide extensive facilities for slicing. Ada provides only 1D support. Z(11..44) is a slice of floats and Mat1(2..6) is a slice of Vec's. Note that this second example is still a 1D slice. You cannot get a general 2D rectangular subset of the 100 floats in Mat1.
Definition: The shape of an array consists of the number and bounds of each dimension.
Definition: A dope vector contains the shape information of the array. It is needed during compile time (not our concern) and for the most flexible arrays is needed during run-time as well.
The binding time of the shape as well as the lifetime of the array determines the storage policy used for the array. The trade-off is between flexibility and run-time efficiency.
The layout for 1D arrays is clear: Align the first element and store all the succeeding elements contiguously (they are then guaranteed to be aligned).
We will discuss 2D arrays; higher dimensional arrays are treated analogously. Assume we have an array A with 6 rows (numbered 0..5) and 3 columns. Clearly A[0][0] will be stored first (and aligned). The big question is what is stored next.
A[0][0], A[0][1], ..., A[0][2], A[1][0], ..., A[5][2]
A[0][0], A[1][0], ..., A[5][0], A[0][1], ..., A[5][2]
The reason the layout is important is that if an array is traversed in an order different from how it is laid out, the execution time increases, perhaps dramatically (cache misses, page faults, stride prediction errors).
Alignment is not a major issue for most arrays; align the first element and store the rest contiguously. The exception is arrays of records. Imagine a record consisting of a 1-byte character and an 8-byte real. The total size of the fields is 9-bytes. If the byte is stored first and the record itself is 8-byte aligned (as it would be in most languages) then 7 bytes of padding are needed before the real. Thus the record size would be 16 bytes and an array of 100 such records would require 1600 bytes, of which only 900 bytes are data.
If the real is stored first, the character is aligned without padding and the record takes 9 bytes, but the next record in the array needs 7 bytes of padding so a 100-element array of these records takes 1593 bytes (there is no padding needed after the last record).
If, instead of an array of records, the data structure was organized as a record of arrays (an array of characters followed by an array of reals), the memory efficiency is better. If the reals are stored first, no padding is needed so the space required is 900 bytes. If the characters are stored first, the array of reals will need a 4-byte pad (since 100/8 leaves a remainder of 4) and the total space required is 904 bytes.
Homework: Consider the following C definitions.
struct s1 { char c1; double d1; char c2; double d2; } A[100]; struct s2 { char c3[100]; double d3[100]; char c4[100]; double d4[100]; } B;Assume char requires 1 byte and double requires 8 bytes and that a datatype must be aligned on a multiple of its size.
An important distinction that must be made before looking at pointers and recursive types is the value model of variables vs the reference model.
In the value model, used by C, Pascal, and Ada, a variable such
as X denotes its value, say 12.
In the reference model, used by Scheme and ML, X denotes a reference
to 12 (in a coarse sense X points to
12).
You might think that with the reference model, expressions would be
difficult, but that is not so.
When you write X+1 the reference X automatically de-referenced to
the value 12.
Java uses the value model for built-in scalar types and uses the reference model for user-defined types.
Turning to recursive types, consider two very common linked data structures: linked lists and trees. Take a trivial example, where a tree node is one character and two pointers (to the left and right subtrees; we are considering only binary trees). So a node is a record with three fields: a character, and two references to child node's.
In ML we would define a tree node tnode as
datatype tnode = empty | node of char * tnode * tnode;The | indicates an alternative. The empty is used for a leaf (we are assuming the data of the tree is the characters stored in the non-leafs). The interesting part is after the |. A tnode is a tuple consisting of a character and two tnode's. But that is not possible! It is really a character and two references to tnode's. But ML uses the reference model so writing tnode means a reference to tnode.
Ada procedure Ttada is type Tnode; -- an incomplete type type TnodePtr is access Tnode; type Tnode is record C : Character; L : TnodePtr; R : TnodePtr; end record; T : TnodePtr := new Tnode'('A',null,null); begin T.L := new Tnode'('L',null,null); T.R := new Tnode'('R',null,null); T.all.R.all.C := 'N'; T.R.C := 'N'; end Ttada; C++ struct tnode { char c; tnode *left; tnode *right; }; struct tnode *t;
With the value model, we need something to get
a reference to
a tnode.
The Ada code is on the right.
The first line declares (but doesn't define) the incomplete type
Tnode.
The next line defines the type reference to Tnode
.
The key is that an access definition does not need the
entire definition of the target.
Then we can define Tnode and build a simple tree with a root
and two leaves for children.
A tree is just a pointer to
the root.
The C++ structure definition is shorter since you can mention the
type being defined within the definition.
However, when two different types contain references to each other,
the C++ code uses the same idea of incomplete declaration
followed by complete definition
that Ada does.
Pointers and Dereferencing: Looking again at the Ada example, T is a reference to a tree-node. That is nice, but we also want to refer to the tree-node itself and to its components. That is we want to dereference T. For this we use .all and write T.all, which is a record containing three fields.
As with any record, we write .L to refer to the field named L and so T.all.L is the left component of the root. Similarly T.all.R is the right component and T.all.R.all.C:='N' changes the character field of the right child to 'N'. The construct .all.fieldName occurs often and can be abbreviated to .fieldName as shown.
In C/C++ the dereferencing is done by *. Thus t points to a tree-node *t is the tree node and (*t).c is the character component. You can abbreviate *(). by -> and write t->c.
void f (int *p) {...} int a[10]; f(a); f(&a[0]); f(&(a[0])); int *p = new int[4]; ... p[0] ... ... *p ... ... p[1] ... ... *(p+1) ... ... p[10] ... ... *(p+9) ...
In C/C++ pointers and arrays are closely related. Specifically, using an array name often is the same as using a pointer to the first element of the array. For example, the three function calls on the right are all the same. Similarly the references involving p are the same for both members of each pair. Also note that the last pair reference past the end of the array; they are an undetected errors.
There are, however, places where pointers and arrays are different
in C, specifically in definitions.
Consider int x[100], *y;
Although both x and y point to integers, the first definition
allocates 100 (probably 4-byte) integers, while the second allocated
one (probably 4-byte or 8-byte) pointer.
It is quite easy to get into trouble with pointers if low-level manipulation is permitted. For example.
int *p1,p2; p1=new int[10]; p2=p1; delete[] p1; p2[5]=...
First we need to define these three related concepts.
Definition: A list is an ordered collection of elements.
Definition: A set is a collection of elements without duplicates and with fast searching.
Definition:A map is a collection of (key,value) pairs with fast key lookup.
Normally, only very high level languages have these built in. They are often in standard libraries and can be programmed in essentially all languages. Specifically,
Assigning types to procedures is needed if procedures can be passed
as arguments or returned by functions.
A statically typed language like Ada does need the signature of
procedures so that it can type check calls.
However, it does not need to assign a type to a
procedure itself since no procedure can be passed as an argument.
Thus you can't say in Ada something like
type proc_type is procedure (x : integer);
As noted above statically typed languages like Ada require procedures to be declared with full signatures. C supports procedures with varying numbers of arguments, which is a loophole in the C type system. Java supports a limited version of varargs that is type safe. Specifically all the optional arguments must be of the same (declared) type.