Programming Languages

Start Lecture #7

Remark: Midterm exam next recitation.
Homework assigned today is due the beginning of next class not next recitation.

Remark: Recall from last lecture that a type system consists of:

7.2.2; Type Compatibility

A value must have a type compatible with the context in which the value is used. For most languages this notion is significantly weaker than equivalence; that is, there may be several non-equivalent types that can legally occur at a given context.

We first need to ask what are the contexts in which only a type compatible with the expected type can be used. Three important contexts are

As these examples indicate, there is often a natural type for a context and the question is whether the type that actually appears is compatible with this natural type. Thus the question reduces from is type T compatible at this context? to Is type T compatible with type S?.

The definition of type compatibility varies greatly from one language to another. Ada is quite strict. Two Ada types S and T are compatible in only three cases.

Pascal is slightly more lenient; it permits a integer to appear where a real number is expected. Note that permitting an integer to stand for a real implies that a type conversion will be performed under the covers.

Fortran and C are more liberal and will perform more automatic type conversions.

Coercions

Such implicit, or automatic type conversions are called coercions.

Widespread coercions are a controversial feature of C and other languages since they significantly weaken the type system.

Note that in coercing a value of type T into a value of type S, the system may need to

  1. Check if the value meets a constraint.
    For example, an integer must be ≥0 in order to be considered a positive).

  2. Convert the low-level representation of the value.
    For example, a 32-bit signed integer (type T) must have its representation changed significantly when coerced into an IEEE-standard, double-precision floating point number (type S).

    In this example no constraint checking is needed and no precision can be lost since every 32-bit signed integer has a (unique) representation as a (normalized) value in S.

    The reverse coercion (S to T) is more involved. Most floats are too big in absolute value to be represented in T and most that do fit lose precision in the process.

A language like Ada never performs a coercion of the second kind (types T and S that would require this kind of conversion are not compatible). Instead, the programmer must explicitly request such conversions explicitly by using the name of the target type as a function. For example Y:=float(3); would be used if Y is of type float.

In contrast C permits the above without the float() and even permits a naked x=3.5; when x is an int.

A recent trend is to have less coercions and thus stronger types. In this regard, consider the C++ feature permitting user-defined coercions. Specifically, C++ permits a class definition to give coercion operations to/from other types. Should this be called a coercion at all? That is, are user-defined coercions really automatic?

Overloading and Coercion

Definition: Overloading occurs when there are multiple definitions for the same name that are distinguished by their type.

Definition: The decision which of these definitions applies at a given use of the name is called overload resolution.

Normally, overloading is used only in static type systems and the overloaded name is that of a function. In resolving an overloaded function name uses the signature of each of the individual functions, i.e.,

Overloading is related to coercion. Instead of saying that the int in int+real is coerced to a float, we define another addition operator named + that takes (int,float) and produces float. Naturally, this operator works by first converting the int to a float and then does a floating point addition. This proposal is not practical when there are many types and/or many parameters. For example sum4(a,b,c,d) would need 34=91 definitions if it is to be work when each argument could be integer, positive, or natural.

int main (void) {
  int f(int x);
  int f(char c);
  return f(6.2);
}
int f(int x) {
  return 10;
}
int f(char c) {
  return 20;
}

Alternatively, one could eliminate all overloading of addition by just defining float+float and coercing int+int in addition to int+float and float+int.

The combination of overloading and coercion can lead to ambiguity as shown by the C++ example on the right. The g++ compiler gives the following remarkably clear error msg when handed this program.

tt.c: In function 'int main()':
tt.c:4: error: call of overloaded 'f(double)' is ambiguous
tt.c:2: note: candidates are: int f(int)
tt.c:3: note:                 int f(char)
The problem is that a C/C++ double (such as 6.2) can be coerced into either int or char. This same ambiguity would occur if, instead of 6.2, we used a variable of type double.

Literals   Either literals must be considered to be of many types (e.g., 6 would be an integer, a positive, a natural, and a member of any user types derived from integer) or literals must be members of a special type universal integer that is compatible with any of the above types. Ada uses this second solution.

Universal Reference Types

Several languages provide a universal reference type (also called a generic reference type). Elements of this type can contain references to any other type. In C/C++ the universal reference type is void * and can be thought of as a pointer to anything.

In an object-oriented language the analogous type is object in C# and Object in Java. Since objects are normally heap allocated, one could consider object and Object to also be pointers to anything, or references of anything.

It is type safe to put any reference into an object of universal reference type: Since the type of an object referred to by a universal reference is unknown, the compiler will not allow any operations to be performed on that object. For example, gcc complains about the pair of C statements

    void *x;     strlen(x);
  

However the reverse assignment in which a universal reference is assigned to a more specific reference type is problematic since now operations appropriate to this specific reference can be applied to a value that belongs to the general reference.

A Java-like example would go as follows. The standard java.util library contains a Stack container class. A programmer uses the constructor and obtains stack a member of the class. Any object can be pushed onto stack since it is a stack of Object's. The programmer pushes a rectangle and a parallelogram onto the stack. This is fine. Members of the stack are Objects and any operation applicable to an Object can be legally applied to a parallelogram or to a rectangle. Now the programmer wants to pop the stack and have the top member assigned to a rectangle. Java will not permit this since the member is an Object and objects cannot be assigned to rectangles. The programmer must use a cast when doing the pop, telling the system that the top member is a rectangle. Java actually keeps tags with objects so can verify that the top Object is actually a rectangle. If it is in fact only a parallelogram, a run time error is reported. This is dynamic type checking. If the top member of the stack was a rectangle and the programmer tried to pop it onto a parallelogram object, Java would detect that the top stack member belongs to a subclass of rectangle so no error occurs. Type safety has been preserved.

Summary: In object oriented terminology x=y; is safe if y is more specific than x (i.e., y has more operations defined). In contrast y=x; cannot be permitted without checking that the value in x is suitable for y.

The real problems arise with a language like C (and void *) since no tags are kept and the system cannot determine whether or not the pointer being popped does reference a rectangle or parallelogram.

Hence the programmer must use an unchecked type conversion between
   void * and (struct rectangle) *.
Type safety has been lost.

7.2.3: Type Inference/Synthesis

In this section we emphasize type synthesis, i.e, determining the type of an expression given the type of its constituents. Next lecture, when we study ML, we will encounter type inference, where the type of the constituents in determined by the type of the expression.

Easy Cases

Sometimes it is easy to synthesize the type of the expression. For example,

with Ada.Integer_Text_IO; use Ada.Integer_Text_IO;
procedure Ttada is
  subtype T1 is Integer range  0..20;
  subtype T2 is Integer range 10..20;
  A1 : T1;
  A2 : T2;
begin
  Get (A1); Get (A2);
  Put (A1+A2);
end Ttada;

Subranges

What would be the type of A1+A2 in the Ada program on the right? Two possibilities come to mind.

Ada chooses the first approach. Indeed, in Ada values are associated with types, not subtypes. It is variables that can be associated with subtypes. As a result in Ada, assigning to a subtype that has a range constraint may require a run-time check.

This is also true of types with range constraints, but our use of Ada will only add ranges to subtype definitions, not type definitions

Composite Types

7.2.4: The ML Type System

Next lecture will be on ML, including the type system.

Homework: CYU 15, 16, 18

7.3: Records (Structures) and Variants (Unions)

Records are a common feature in modern programming languages; however the terminology differs among the languages. They are referred to as records in Pascal and Ada, just types in ML and Fortran, and structs (short for structures) in C. They are similar to methodless classes in Java, C#, C++.

In all cases a record is a set of typed fields. Some choices to be made in defining records for a specific language.

Ada
type    atomic_element is record
  Name          : String (1..2);
  Atomic_Number : Integer;
  Atomic_Weight : Float;
end record;
C
struct atomic_element {
  char   name [2];
  int    atomic_number;
  double atomic_weight; };
ML
type atomic_element = {
  name          : string,
  atomic_number : int,
  atomic_weight : real};

7.3.1: Syntax and Operations:

On the right are specifications of the same record type in three languages, Ada, C, and ML. Note that in ML there is no , after real. Like Algol this ML punctuation is a separator not a terminator.

If you want to play with ML, the command name is sml, which is short for standard ML of New Jersey. As noted above the order of the fields in a record is not significant in ML. It appears to me that sml alphabetizes the fields; that is, the example is reordered to atomic_weight, atomic_number, name.

Records in Scheme are somewhat more complicated; we won't use them.

7.3.2: Memory Layout and Its Impact

The issue here concerns alignment of fields within a record. Today essentially all computer architectures are byte addressable and hence, in principle, each field could start at the byte right after the last byte of a the previous field. However, architectures normally have alignment requirements something like a datatype requiring n-bytes to store must begin at an address that is a multiple of n. For example, a 4-byte quantity (e.g., an int in C) cannot occupy bytes 10,002-10,005.

In most cases the truth is that an int in C could occupy any 4 consecutive bytes, but accessing the int would be significantly faster if it was properly aligned so that its starting address is a multiple of 4. We will not be so subtle and will simply use the crude approximation given by the quote in the previous paragraph.

Note that this issue is not so severe with arrays as with records. Since arrays are homogeneous, if the first element is aligned and the elements are packed with no padding in between, then all elements will be aligned. With records, such as atomic_element it is harder. Assume the record itself is aligned on a doubleword (8-byte) boundary. That means that the name field will begin on an address like 8800 that is evenly divisible by 8. However, the atomic_number field of the record will begin 2-bytes later and hence will not be properly aligned for integer datatype (which we assume is 4-bytes in size).

Hence atomic_number is not properly aligned and cannot be accessed efficiently. As a result a two-byte pad is inserted between name and atomic_number. This wastes space. Assume that the float/real/double value atomic_weight needs 8-byte alignment. This alignment is satisfied since name + pad + atomic_number consume 8 bytes and the record itself is 8-byte aligned. Thus the total padding required is 2-bytes which is 1/8 of the total space of the padded record.

Now note that if the record was reordered so the fields were weight, number, name (i.e. decreasing size and required alignment), there would be no padding at all. ML changes the ordering visibly, but does not use size as the ordering criterion. Some languages permit the compiler to reorder. Since arbitrary reordering is impossible for systems programming, where the fields can correspond to specific hardware device registers whose relative addresses are given, systems programmers must choose languages in which such reorderings can be overridden.

One last point. If the padding is garbage (i.e., nothing specific is placed in the pad), then comparing a record requires comparing the fields separately since a single large comparison would require that the two garbage pads be equal. Thus another trade-off occurs. Padding with garbage makes creating a record faster, but padding with zeros (or any other constant) makes comparing two records faster.

Homework: 8.

7.3.3: with Statements

7.3.4: Variant Records (Unions)

Definition: A variant record is a record in which the fields act as alternatives, only one of which is valid at any given time. Each of the fields is itself called a variant.

with Ada.Integer_Text_IO; use Ada.Integer_Text_IO;
procedure Ttada is
  type Point is record
    XCoord : Float;
    YCoord : Float;
  end record;
  type ColorType  is (Blue, Green, Red, Brown);
  type FigureKind is (Circle, Square, Line);
  type Figure(Kind : FigureKind := Square) is record
    Color   : Colortype;
    Visible : Boolean;
    case Kind is         -- Kind is the discriminant
      when Line   => Length      : Float;
                     Orientation : Float;
                     Start       : Point;
      when Square => LowerLeft   : Point;
                     UpperRight  : Point;
      when Circle => Radius      : Float;
                     Center      : Point;
    end case;
  end record;
  C1 : Figure(Circle);   -- Discrimant cannot change
  S1 : Figure;   -- Default discriminant, can change
  function Area (F : Figure) return Float is
  begin
    case F.Kind is
      when Circle => return 3.14*F.Radius**2;
      when Square => return F.Radius;       -- ERROR
      when Line   => return 0.0;
    end case;
  end Area;
begin
  S1 := (Square, Red, True, (1.5, 5.5), (1.2, 3.4));
  C1.Radius := 15.0;
  if S1.LowerLeft = C1.Center then null; end if;
  -- ERROR to write S1.Kind := Line;
  -- ERROR to write C1 := S1;
end Ttada;

Actually a variant can be a set of fields. For example the Line variant on the right has three fields. As is also shown in the example a record Figure can have several non-variant fields as well as several variant fields.

In some languages (e.g., Ada, Pascal) an explicit field, called the discriminant or tag keeps track of which variant is currently valid. In the example on the right the discriminant is Kind. We will see soon that Scheme employs an implicit discriminant.

Definition: When the variant record contains an explicit or implicit discriminant, the record is called a discriminated union. When no such tag is present, the variant record is called a nondiscriminated union..

Safety

Clearly languages with nondiscriminated unions, e.g. C's union construct or Fortran's equivalence, give up type safety since it is the programmer who must ensure that the variant read is the last variant that was written. However, even languages like Pascal and Modula-2, which have discriminated unions, have type weakness in this area. To see the technical details read section 7.3.4 of the 3e (that section is one of those on the CD).

Ada is fairly type safe, but it is complicated. The Modula-3 designers did not like the type dangers in Modula-2 and the complexity of Ada and omitted variant records entirely.

Variants in Ada

The Ada example illustrates a number of features of the language; here we concentrate on just the variant record aspects. Figure is a variant record with Kind the discriminant. Note that the discriminant is given a default value; that is what permits us to define S1 without giving a discriminant as we did for C1. If Kind did not have a default then the declaration of S1 would be an error.

C1 will always be a circle; see the ERROR comment all the end. S1 can be any figure. Note, however, that you can only change the tag when you change the entire variant record; see the ERROR comment near the end.

Although this program compiles, it is incorrect. The Area function checks the Kind of its argument, but, if the Kind is Square, it references the Radius component. If Area were actually called with a Square, a run-time error would occur.

Finally, null is an Ada statement that does nothing. I used it here to make the if statement legal (although useless). In reality there would be other code in the then arm.

(define X 35)
(+ X 5)
(define X '(a 4))
(+ X 5)

Discriminated Unions with Dynamic Typing

Recall that Scheme has strong, but dynamic typing. That means, that the variables are not typed but the values are. If you present the example on the right to a Scheme interpreter, you will get an error for the last line. X was first an integer; then a list. Although there is no explicit tag, the system keeps one implicitly and thus can tell that the last line is a type error since you can't add an integer to a list.

The Object-Oriented Alternative

Instead of the type-unsafe unions of C, a Java programmer could have a base class of the non-varying components and then extend it to several classes one with each arm of the union.

7.4: Arrays

Definition: An array is a mapping from an index type to an element type (or component type).

Arrays are found in essentially all languages. Most languages require the index type to be an integral type. Pascal, Ada, and Haskell allow the index type to be any discrete type.

  Ada:    (82, 21, 5)
  Scheme: #(82, 21, 5)

First-Classness: In C/C++ functions cannot return arrays and there are no array literals (but there are array initializers). Scheme and Ada have array literals as shown on the right

7.4.1: Syntax and Operations

The syntax of array usage varies little across languages. The one distinction is whether to use [] or () for the subscript. Most common is [], which permits array usage to stand out in the program. Fortran uses () since Fortran predates the widespread availability of [] (the 026 keypunch didn't have []). Ada uses () since the designers wanted to emphasize the close relation between subscripted references and function evaluations. A(I) can be thought of as the Ith component of the array A or the value of the function A at argument I. Since an array is a function from the index type, there is merit to this argument.

  C
  float x[50];
  Ada
  type Idx is new Integer range 0..49;
  X : array (Idx) of Float;
  Y : array (Integer range 0..49) of Float;
  Z : array (0..49) of Float;
  type Vec is array (0..9) of Float;
  Mat1 : array (0..9) of Vec;
  Mat1(3)(4) := 5.0;
  Mat2 : array (0..9, 0..9) of Float;
  Mat2(3,4) := 5.0;

Declarations

An array declaration needs to specify the index type and the element type. For a language like C, the index type is always the integers from zero up to some bound. Thus all that is needed is to specify the component type and the upper bound as shown on the right.

With the increased flexibility of Ada arrays comes the requirement to specify additional information, namely the full index type. The first two lines, give us 50 Floats, just like the C example, but the index type is idx so you can't write X(I) where I is an integer variable. You can write Y(I) and Z(I) which are of the same type (the second declaration is an abbreviation of the first.

Multi-Dimensional Arrays Essentially all languages support multidimensional arrays; the question is whether a two-dimensional (2D) array is just a 1D array of 1D arrays. Ada offers both possibilities as shown on the right. The 2D array Mat2 has elements referenced as Mat2(2,3); whereas the 1D array of 1D arrays Mat1 has elements referenced as Mat1(2)(3). The 2D notation is less clumsy, but slices (see below) are possible only with Mat1. C arrays, are like Mat1: they are 1D arrays of 1D arrays. However C does not support slices.

Slices and Array Operations

A slice or section is a rectangular section of an array. Fortran 90 and many scripting languages provide extensive facilities for slicing. Ada provides only 1D support. Z(11..44) is a slice of floats and Mat1(2..6) is a slice of Vec's. Note that this second example is still a 1D slice. You cannot get a general 2D rectangular subset of the 100 floats in Mat1.

7.4.2: Dimensions, Bounds, and Allocation

Definition: The shape of an array consists of the number and bounds of each dimension.

Definition: A dope vector contains the shape information of the array. It is needed during compile time (not our concern) and for the most flexible arrays is needed during run-time as well.

The binding time of the shape as well as the lifetime of the array determines the storage policy used for the array. The trade-off is between flexibility and run-time efficiency.

7.4.3: Memory Layout

The layout for 1D arrays is clear: Align the first element and store all the succeeding elements contiguously (they are then guaranteed to be aligned).

We will discuss 2D arrays; higher dimensional arrays are treated analogously. Assume we have an array A with 6 rows (numbered 0..5) and 3 columns. Clearly A[0][0] will be stored first (and aligned). The big question is what is stored next.

The reason the layout is important is that if an array is traversed in an order different from how it is laid out, the execution time increases, perhaps dramatically (cache misses, page faults, stride prediction errors).

Alignment and Memory Usage

Alignment is not a major issue for most arrays; align the first element and store the rest contiguously. The exception is arrays of records. Imagine a record consisting of a 1-byte character and an 8-byte real. The total size of the fields is 9-bytes. If the byte is stored first and the record itself is 8-byte aligned (as it would be in most languages) then 7 bytes of padding are needed before the real. Thus the record size would be 16 bytes and an array of 100 such records would require 1600 bytes, of which only 900 bytes are data.

If the real is stored first, the character is aligned without padding and the record takes 9 bytes, but the next record in the array needs 7 bytes of padding so a 100-element array of these records takes 1593 bytes (there is no padding needed after the last record).

If, instead of an array of records, the data structure was organized as a record of arrays (an array of characters followed by an array of reals), the memory efficiency is better. If the reals are stored first, no padding is needed so the space required is 900 bytes. If the characters are stored first, the array of reals will need a 4-byte pad (since 100/8 leaves a remainder of 4) and the total space required is 904 bytes.

Homework: Consider the following C definitions.

    struct s1 {
      char   c1;
      double d1;
      char   c2;
      double d2;
    } A[100];
    struct s2 {
      char   c3[100];
      double d3[100];
      char   c4[100];
      double d4[100];
    } B;
  
Assume char requires 1 byte and double requires 8 bytes and that a datatype must be aligned on a multiple of its size.
  1. How much actual data (not counting padding) is occupied by all of A? by all of B?
  2. How large is A and B including the padding?
  3. Reorder the fields of A and B to minimize padding? How much space did you save.

7.5: Strings

7.6: Sets

7.7: Pointers and Recursive Types

An important distinction that must be made before looking at pointers and recursive types is the value model of variables vs the reference model.

In the value model, used by C, Pascal, and Ada, a variable such as X denotes its value, say 12. In the reference model, used by Scheme and ML, X denotes a reference to 12 (in a coarse sense X points to 12). You might think that with the reference model, expressions would be difficult, but that is not so. When you write X+1 the reference X automatically de-referenced to the value 12.

Java uses the value model for built-in scalar types and uses the reference model for user-defined types.

7.7.1: Syntax and Operations

Turning to recursive types, consider two very common linked data structures: linked lists and trees. Take a trivial example, where a tree node is one character and two pointers (to the left and right subtrees; we are considering only binary trees). So a node is a record with three fields: a character, and two references to child node's.

Reference Model

In ML we would define a tree node tnode as

    datatype tnode = empty | node of char * tnode * tnode;
  
The | indicates an alternative. The empty is used for a leaf (we are assuming the data of the tree is the characters stored in the non-leafs). The interesting part is after the |. A tnode is a tuple consisting of a character and two tnode's. But that is not possible! It is really a character and two references to tnode's. But ML uses the reference model so writing tnode means a reference to tnode.

Ada
procedure Ttada is
  type Tnode;        -- an incomplete type
  type TnodePtr is access Tnode;
  type Tnode is record
    C : Character;
    L : TnodePtr;
    R : TnodePtr;
  end record;
  T : TnodePtr := new Tnode'('A',null,null);
begin
  T.L := new Tnode'('L',null,null);
  T.R := new Tnode'('R',null,null);
  T.all.R.all.C := 'N';
  T.R.C := 'N';
end Ttada;
C++
struct tnode {
  char  c;
  tnode *left;
  tnode *right;
};
struct tnode *t;

Value Model

With the value model, we need something to get a reference to a tnode. The Ada code is on the right. The first line declares (but doesn't define) the incomplete type Tnode. The next line defines the type reference to Tnode. The key is that an access definition does not need the entire definition of the target. Then we can define Tnode and build a simple tree with a root and two leaves for children. A tree is just a pointer to the root.

The C++ structure definition is shorter since you can mention the type being defined within the definition. However, when two different types contain references to each other, the C++ code uses the same idea of incomplete declaration followed by complete definition that Ada does.

Pointers and Dereferencing: Looking again at the Ada example, T is a reference to a tree-node. That is nice, but we also want to refer to the tree-node itself and to its components. That is we want to dereference T. For this we use .all and write T.all, which is a record containing three fields.

As with any record, we write .L to refer to the field named L and so T.all.L is the left component of the root. Similarly T.all.R is the right component and T.all.R.all.C:='N' changes the character field of the right child to 'N'. The construct .all.fieldName occurs often and can be abbreviated to .fieldName as shown.

In C/C++ the dereferencing is done by *. Thus t points to a tree-node *t is the tree node and (*t).c is the character component. You can abbreviate *(). by -> and write t->c.

  void f (int *p) {...}
  int a[10];
  f(a); f(&a[0]); f(&(a[0]));
  int *p = new int[4];
  ... p[0]   ...
  ... *p     ...
  ... p[1]   ...
  ... *(p+1) ...
  ... p[10]  ...
  ... *(p+9) ...

Pointers and Arrays in C

In C/C++ pointers and arrays are closely related. Specifically, using an array name often is the same as using a pointer to the first element of the array. For example, the three function calls on the right are all the same. Similarly the references involving p are the same for both members of each pair. Also note that the last pair reference past the end of the array; they are an undetected errors.

There are, however, places where pointers and arrays are different in C, specifically in definitions.
Consider   int x[100], *y;   Although both x and y point to integers, the first definition allocates 100 (probably 4-byte) integers, while the second allocated one (probably 4-byte or 8-byte) pointer.

Dangers with Pointers

It is quite easy to get into trouble with pointers if low-level manipulation is permitted. For example.

7.8: Lists (and Sets and Maps)

First we need to define these three related concepts.

Definition: A list is an ordered collection of elements.

Definition: A set is a collection of elements without duplicates and with fast searching.

Definition:A map is a collection of (key,value) pairs with fast key lookup.

Normally, only very high level languages have these built in. They are often in standard libraries and can be programmed in essentially all languages. Specifically,

7.A: Procedures and Types

Procedure Types

Assigning types to procedures is needed if procedures can be passed as arguments or returned by functions. A statically typed language like Ada does need the signature of procedures so that it can type check calls. However, it does not need to assign a type to a procedure itself since no procedure can be passed as an argument. Thus you can't say in Ada something like
type proc_type is procedure (x : integer);

Varargs

As noted above statically typed languages like Ada require procedures to be declared with full signatures. C supports procedures with varying numbers of arguments, which is a loophole in the C type system. Java supports a limited version of varargs that is type safe. Specifically all the optional arguments must be of the same (declared) type.

7.9: Files and Input/Output

7.10: Equality Testing and Assignment