Programming Languages

G22.2110 Spring 1998

 

Class #4

Conventional Imperative Programming Languages - Part II

 

Homework #4: Sethi Exercises 4.9, 4.10, 5.3, 5.7, 5.10

Read Sethi’s Chapters 4, 5, 15.1, 15.2, handouts and references

 

Contents

 

C4.0. Current Assignments, PL Project

C4.1. Statement-Level Control Structures (quick review of material covered in class #2)

C4.2. Data Types

C4.3. Names, Bindings, Type Checking, and Scopes

C4.4. Procedures and their implementation

C4.5. Comments on Sethi's Chapters 4, 5, 15.1, 15.2

 

C4.0. Current Assignments, PL Project

 

All homeworks including #3 should have been turned in - solutions to HW#1 & 2 on the web.

 

PL Project - Part I, due date 3/2/98

 

Office Hours: as posted, appointments after business hours also possible with Instructor and TAs

 

C4.2. Data Types

 

Introduction


In early languages, all problem space data structures had to be modeled with a few basic language-supported data structures (e.g., pre-90 FORTRAN where linked lists, nonlinked lists, and binary trees were modeled with arrays).

 

COBOL was first to allow the specification of the accuracy of decimal data values, and to provide a structured data type for records of information.

 

PL/I extended the concept of accuracy specification to integer and floating-point types.

 

ALGOL 68 introduced an important advance in the evolution of data type design:

 

Primitive Data Types

 

They correspond to data types which are not defined in terms of other types.

 

Primitive data types are used, along with one or more type constructors, to provide the structured types.

 

Some of the primitive types are merely reflections of the hardware.

 

Numeric Types

Integer

 

Most computers support several sizes of integers (e.g., VAX-11 has byte, word (2 bytes), longword (4 bytes), and quadword (8 bytes).

 

These capabilities are reflected in some programming languages (e.g., Ada allows SHORT INTEGER, INTEGER, and LONG INTEGER, C includes unsigned integers types, etc.).

 

Integer value represented in a computer by a string of bits, one of which (usually the leftmost) represents the sign.

 

Integer types are supported directly by the hardware.

 

Storage of negative integer:

Floating-Point

 

Model real numbers.

 

Representations are only approximations for most real values (e.g., p cannot be correctly represented in floating point notation).

 

On most computers, floating-point numbers are stored in binary, which makes the problem worse (e.g., even 0.1 in decimal cannot be represented by a finite number of binary digits).

 

Another problem with floating-point types is the loss of accuracy through arithmetic operations.

 

Floating-point values are represented as fractions, and exponents, with different implementations choosing different formats, usually modeling the hardware representations of the computer system.

 

Languages that are designed to support scientific programming usually include two floating-point types, often called real (standard single memory word size) and double-precision (provided for situations where larger fractional parts are needed; such variables occupy twice as much storage as real variables and provide at least twice the number of bits of accuracy).

 

Floating-point types have range values that are defined in terms of precision and range:

 

Floating-point values have been stored in a variety of different formats which all have the same basic components:

 

IEEE Floating-Point Standard 754 format:

 

Smaller computers that do not have hardware for floating-point operations emulate them in software (10 to 100 times slower).

 

Decimal

 

Decimal data types store a fixed number of decimal digits, with the decimal point at a fixed position in the value.

 

Primary data types for business data processing (e.g., essential to COBOL).

Advantages:

 

Disadvantages:

 

Decimal types storage:

 

Operations on decimal values are done in hardware on machines with such capabilities, or they are supported in software.

 

Boolean Types

 

Simplest of all types.

 

Range of values only has two elements, one for true, and one for false.

 

Introduced in ALGOL 60.

 

Two popular exceptions are C and C++:

- numeric type variables and constants can be referenced in expressions used as conditionals as if they were Boolean (operands with nonzero values are considered true, and zero is considered false).

 

Boolean types are often used to represent switches or flags in programs.

 

Boolean values could be represented by a single bit but, for practical reasons, they are often stored in the smallest efficiently addressable cell of memory, i.e. a byte.

 

Character Types

 

Stored in computers as numeric codings.

 

Most commonly used coding is ASCII, which uses the values 0..127 to code 128 different characters.

 

Many programming languages include a primitive for single characters to provide the means of processing codings of single characters.

 

Character String Types

 

Values consist of sequences of characters.

 

Design Issues

 

Should strings be a primitive type or simply a special kind of character array ?

Should strings have static or dynamic length ?

 

Strings and Their Operations

 

Languages in which strings are not defined as primitive type

 

Substring reference in Ada:

 

Character string catenation in Ada:

 

Strings in C and C++:

 

Languages in which strings are defined as primitive types:

 

Assigments and comparison operations on character strings are complicated by the possibility of assigning and comparing operands of different lengths.

 

Pattern matching:

 

LETTER = 'abcdefghijklmnopqrstuvwxyz'

WORDPAT = BREAK(LETTER) SPAN(LETTER) . WORD

 

LETTER is a variable with the value of a string of all lowercase letters.

WORDPAT is a pattern that describes words by first skipping until a letter is found, then

span letters until a nonletter is found. The pattern also includes a "." operator which

specifies that the string that matches the pattern is to be assigned to the variable WORD.

 

TEXT WORDPAT (statement which attempts to find a string of letters in the string value

of the variable TEXT.

 

String Length Options

 

Length can be static and specified in the declaration. (static length string). Such is the choice in FORTRAN 77, FORTRAN 90, COBOL, Pascal, and Ada.

 

e.g., FORTRAN 90

CHARACTER(LEN = 15) NAME1, NAME2

 

Strings can have varying length up to a declared and fixed maximum set by the variable's definition (limited dynamic length). Such is the choice in C and C++.

 

Strings can have varying length with no maximum (dynamic length strings). Such is the choice in SNOBOL4). This flexibility requires the overhead of dynamic storage allocation and deallocation.

 

Evaluation

 

Addition of strings as a primitive type to a language is not costly, in terms of either language or compiler complexity.

 

Difficult to justify the omission of primitive string types in some contemporary languages.

 

Implementation of Character String Types

 

Sometimes supported directly in hardware or software is used to implement string storage, retrieval, and manipulation.

 

Static character string types:

 

Limited dynamic strings:

 

Dynamic length strings:

 

 

User-Defined Ordinal Types

 

An ordinal type is a type in which the range of possible values can be easily associated with the set of positive integers (e.g., integer, char, and Boolean primitive types in Pascal and Ada).

 

Enumeration Types:

 

All of the possible values, which are symbolic constants (could also be character literals in Ada), are enumerated in the definition (e.g., type DAYS is (Mon, Tue, Wed, Thu, Fri, Sat, Sun);).

 

Is a literal constant allowed to appear in more than one type definition ? If so, how is the type of an occurrence of that literal in the program checked ?

 

Design:

 

Pascal:

 

 

e.g., type colortype = (red, blue, green, yellow)

var color: colortype;

…

color := blue;

if (color > red) …

 

boolean expression above will evaluate to true.

 

ANSI C / C++:

 

 

Ada:

 

 

e.g., type LETTERS is (‘A’, ‘B’, ‘C’, ‘D’, ‘E’, ‘F’, ‘G’, ‘H’, ‘I’, …);

type VOWELS is (‘A’, ‘E’, ‘I’);

 

for LETTER in ‘A’ .. ‘I’ loop ???

should be

for LETTER in VOWELS’ (‘A’) .. VOWELS’ (‘I’) loop

 

 

Evaluation:

 

 

Subrange Types:

 

Contiguous sequence of an ordinal type (e.g., 12..14 is a subrange of an integer type).

 

Supported in Pascal, Modula-2, and Ada.

 

No particular design issues.

 

Designs:

 

Connection of a subrange type to its parent type is established by matching the values in the subrange definition to those in previously declared or built-in ordinal types (e.g., index = 1 .. 100, index is defined to be a subrange of the integers).

 

In Ada, subranges are included in the class of types called subtypes (new names for possibly restricted, or constrained versions of existing types). All operations defined for the parent type are also defined for the subtype.

 

e.g., subtype WEEKDAYS is DAYS range Mon .. Fri;

 

Subranges of ordinal types are commonly used (e.g., indexes of arrays, loop variables, etc.).

 

Evaluation:

 

Enhanced readability (variables of subtypes can store only certain range of values).

 

Increased reliability (range errors are detected by the runtime system).

 

Implementation of User-Defined Ordinal Types:

 

Enumeration types are usually implemented by associating a non negative integer value with each symbolic constant in the type (i.e., 0 for first enumeration value, 1 for second, etc.).

 

Subrange types are implemented in the same way as their parent types, except that range checks must be included in every assignment at the expense of increased code size and execution time (well worth it usually, and optimizing compiler may optimize the checking).

 

Array Types

 

Homogeneous aggregate of data elements in which the individual element is identified by its position in the aggregate, relative to the first element.

 

Individual elements are of some previously defined type, either primitive or not.

 

Design Issues:

 

What types are legal for subscripts ?

When are subscript ranges bound ?

When does array allocation take place ?

How many subscripts are allowed ?

Can arrays be initialized when they have their storage allocated ?

What kind of slices are allowed, if any ?

 

Arrays and Indexes:

 

References are obtained by means of a two-level syntactic mechanism:

 

If all the indexes in a reference are constant, the selector is static, otherwise it is dynamic.

 

Arrays are sometimes called "finite mappings"

i.e., array_name[index_value_list] -> element

 

Brackets are used in Pascal, C, C++,and Modula-2 to delimit array indices v.s. parenthesis in order to avoid confusing array subscripts with subprogram parameters.

 

The designers of Ada chose to use parenthesis to ensure uniformity between array references and function calls in expressions (the ambiguities are easily handled in Ada, as the compiler has access to information from previously compiled programs about all names that can be referenced in a program unit that is being compiled). Pre-90 FORTRAN and PL/I designers chose parenthesis as no other suitable characters were available at the time, and the compiler had to resolve the ambiguities.

 

Two distinct types involved in an array type: the element type and the subscripts type. Boolean, character, and enumeration types are allowed as subscripts types in languages such as Pascal, Modula-2, and Ada.

 

Subscript Bindings and Array Categories:

 

In some languages the lower bound of the subscript range is implicit (e.g., fixed to 0 in C, fixed to 1 in pre 77 FORTRAN, defaults to 1 in FORTRAN 77 & 90).

 

The binding of the subscript type to an array is usually static, but the subscript value ranges are sometimes dynamically bound.

 

Four categories of arrays based on the binding to subscript value ranges, and the binding to storage:

 

e.g., FORTRAN 77 arrays.

 

 

e.g., arrays declared in Pascal procedures and C functions (without the static specifier).

 

 

e.g., Ada arrays as in the following

 

GET (LIST_LEN);

declare

LIST : array (1 .. LIST_LEN) of INTEGER;

begin

…

end;

 

 

e.g., FORTRAN 90

INTEGER, ALLOCATABLE, ARRAY (: , :) : : MAT

ALLOCATE (MAT(10, NUMBER_OF_COLS))

DEALLOCATE (MAT)

 

e.g., C / C++

malloc, free heap allocation/deallocation operations can be used for C/C++ arrays

operators new and delete are used to manage heap storage

No index range checking in C/C++, therefore the length of an array is of no

interest to the runtime system, so array sizes are easily changed.

 

e.g., Pascal conformant arrays

procedure sumlist ( var sum : integer;

list : array [lower .. upper : integer] of integer);

var index : integer;

begin

…

end;

 

var scores : array [1 .. 100] of integer;

…

sumlist (sum, scores)

 

The Number of Subscripts in Arrays:

 

Limited to three in FORTRAN I (efficiency reasons), and up to seven after FORTRAN IV.

 

No justification for FORTRAN’s limitations.

 

C supports multidimensional arrays, or arrays that can have arrays as elements.

e.g., int mat [5][4];

 

Array Initialization:

 

Some languages provide the means to initialize arrays at the time their storage is allocated.

 

e.g., FORTRAN &&

INTEGER LIST (3)

DATA LIST /0, 5, 5/

 

e.g., C / C++

int list [ ] = {4, 5, 7, 83}

here the compiler sets the length of the array.

 

char name [ ] = "freddie"

here name is a char array

 

char *names [ ] = {"Jo", "Bob", "Jake", "Darcie"};

here names is an array of pointers to characters (name[1] is a pointer to the letter ‘B’ in

the literal character array that contains the characters ‘B’, ‘o’, ‘b’, and null.

 

Pascal and Modula-2 do not allow array initialization in the declaration sections of programs.

 

Ada provides two mechanisms for initializing arrays in the declaration statement:

e.g.,

LIST : array (1 .. 5) of INTEGER := (1, 3, 5, 7, 9);

Here initialization values are listed in the order in which they are to be stored.

 

BUNCH : array (1 .. 5) of INTEGER := (1 => 3, 3 => 4, others => 0);

Here assignment is performed according to an index position using the => operator.

These collections of values, delimited by parenthesis, are called aggregate values.

 

Array Operations:

 

Operation that operates on an array as a unit.

 

FORTRAN 77 provides no array operations.

 

Ada allows array assignment, catenation, and equality/inequality relational operations.

 

FORTRAN 90 includes "elemental" array operations (e.g., add operator between two arrays resulting in an array of the sums of the element pairs between the two arrays). It also includes intrinsic, or library, functions for matrix multiplication, matrix transpose, and vector dot product.

 

APL is the most powerful array-processing languages ever devised. Its arithmetic operations are defined for vectors, matrices, as well as scalar operands (e.g., "+"). APL includes a collection of unary operators for vectors and matrixes (e.g., vector elements reversal, matrix columns reversal, matrix rows reversal, matrix transposition, matrix inversion). Finally, APL includes several special operators that take other operators as operands (e.g., A +.x B is the sum of the inner products of A and B if A and B are vectors, or the matrix multiplication of A and B if A and B are matrices).

 

Slices:

 

A slice of an array is some substructure of that array (e.g., last row and first column of a matrix).

 

A slice is not a new type but rather a mechanism for referencing part of an array as a unit.

 

Design issue: syntax for specifying a reference to a particular slice ?

 

e.g., FORTRAN 90

INTEGER VECTOR(1:10), MAT(1:3, 1:3), CUBE(1:3, 1:3, 1:4)

 

slices:

VECTOR(3:6) four element array with the third through sixth elements of VECTOR

MAT(1:3, 2) (second column of MAT)

CUBE(1:3, 1:3, 2)

 

Slices can appear as the destinations of assignment statements.

e.g., a single-dimensioned array could be assigned to a slice of a matrix

 

Slices can have nonregular arrangements of elements of an existing array.

e.g., VECTOR ((/3, 2, 1, 8/) array of the third, second, first, and eigth elt of VECTOR.

 

Ada only allows highly restricted slices: those consisting of consecutive elements of a single-dimensioned array.

e.g., LIST (1 .. 100)

then LIST (5 .. 10) is a slice of LIST.

 

Note that a slice of a STRING type is called a substring reference.

 

Evaluation:

 

Arrays are simple and have been well developed.

 

Only significant advances since FORTRAN I have been the inclusion of all ordinal types as possible subscript types, and dynamic arrays.

 

Being able to specify a particular substructure of a multidimensional array is not worth the increased difficulty of implementation and readability.

 

Implementation of Array Types:

 

The code to allow accessing of array elements must be generated at compile time.

 

This code is executed at run time to produce element addresses.

 

Access function for an array of the form list[k]:

address(list[k]) = address(list[lower_bound]) + (k-lower_bound)*element_size

or

address(list[k]) = (address(list[lower_bound]) - (lower_bound*element_size))

+ k * element_size.

If the element type is statically bound, and the array is statically bound to storage, the value of the constant part can be computed before run time while the computation of k * element_size and the addition of the result need to be done at run time.

If the base, or beginning address, of the array is not known until run time, the substraction must be done when the array is allocated.

 

Compile-time descriptor for single-dimensioned arrays can have the form:

(array, element type, index type, index lower bound, index upper bound, address)

The information in the descriptor is that required to construct the access function.

If run-time checking of index ranges is not done and the attributes are all static, then only the access function (not the descriptor) is needed.

If run-time checking of index ranges is done, those index ranges may need to be stored in a run-time descriptor.

If the subscript ranges of a particular array type are static, the ranges may be incorporated into the code that does the checking to eliminate the need for the run-time descriptor.

If any of the descriptor entries are dynamically bound, those parts of the descriptor must be maintained at run-time.

 

Multidimensional arrays:

More complex to implement than single dimensioned arrays.

Require mapping onto the single dimensioned hardware memory.

Mapping can be either a row major (e.g., most languages but FORTRAN) or a column major order (e.g., FORTRAN) mapping.

It is sometimes essential to know the storage order of multidimensional arrays (e.g., when arrays processed using pointers in C are EQUIVALENCEd to another array with a different shape in a FORTRAN program). It is important in all languages when execution speed is a major concern, and the computer uses virtual memory as sequential access to matrix elements is faster if they are accessed in the order in which they are stored in order to minimize paging.

 

Access function of a two-dimensional array stored in row major:

Mapping of the array’s base address and a set of index values to the address in memory of the element specified by the index values.

Location (a[i, j]) = (address of a[row_lb, col_lb])+

(((i - row_lb) * n) + (j - col_lb)) * element_size

where n is the number of elements per row, and row_lb, col_lb are respectively the row and column lower bounds.

Or

Location(a[i, j]) = (address of a[row_lb, col_lb]) -

((row_lb*n) + (element_size*col_lb)) +

(((i*n) + j) * element_size

The first two terms are the constant part, the last part is the variable part.

This can be generalized relatively easily to an arbitrary number of dimensions (for each dimension of an array, one add and one multiply instruction is required for the access function).

Accesses to elements of arrays with several subscripts are costly.

 

Compile-tile descriptor for a multi-dimensional array:

(multidimensioned array, element type, index type, number of dimensions, index range 1, …, index range n, address)

 

Slices add another layer of complexity to storage mapping functions:

e.g., INTEGER MAT (1:10, 1:5), LIST (1:10)

…

LIST = MAT (1:3, 3) (column 3 of the matrix is assigned to the array)

storage mapping function for MAT, assuming row major order and an element size of 1 is

location(MAT[i, j]) = (address of MAT[1,1]) + ((i-1)*5 + (j-1))*1

= ((address of MAT[1,1]) - 6) + ((5 * i) + j)

storage mapping function for the slice reference MAT[1:3, 3] is

location(MAT[i, 3]) = (address of MAT[1,1]) + ((i-1)*5 + (3-1))*1

= ((address of MAT[1,1]) - 3) + (5 * i)

This mapping has the same form as any other one-dimensional array access function, although the constant part is different due to the fact that the basic array is two-dimensional (elements of MAT to be assigned to LIST are found by letting I take on the values in the subscript range of the first dimension of MAT).

 

Record Types

 

Possibly heterogeneous aggregate of data elements in which the individual elements are identified by names.

 

Part of most popular languages, except pre-90 versions of FORTRAN, since the early 1960s when they were introduced by COBOL.

 

The Structure of Records

 

Fields are named with identifiers, and references are made using these identifiers (due to the heterogeneity of record elements).

 

Records are also allowed to include unions.

 

COBOL uses level numbers to indicate records

e.g.,

01 EMPLOYEE-RECORD

02 EMPLOYEE-NAME

05 FIRST PICTURE IS X(20).

05 MIDDLE PICTURE IS X(10).

05 LAST PICTURE IS X(20).

02 HOURLY-RATE PICTURE IS 99V99

 

Psacal, Modula-2, and Ada use nesting instead of level numbers

e.g.,

EMPLOYEE_RECORD :

record

EMPLOYEE_NAME :

record_

FIRST : STRING (1..20);

MIDDLE : STRING (1..20);

LAST : STRING(1..20);

end record;

HOURLY_RATE : FLOAT;

end record;

 

C provides records, which are called structures (do not include record variants, or unions like Pascal).

 

FORTRAN 90 record declarations require that any nested records be previously defined as types.

 

References to Record Fields

 

Either by naming the desired field and its enclosing records:

 

field_name OF record_name_1 OF … OF record_name_n

 

e.g., COBOL

MIDDLE OF EMPLOYEE-NAME OF EMPLOYEE-RECORD

 

Or using dot notation for field references (name of largest enclosing record used first)

 

e.g., Ada

EMPLOYEE_RECORD.EMPLOYEE_NAME.MIDDLE

 

FORTRAN 90 uses dot notation with percent signs instead of periods.

 

Fully qualified reference to a record field: one in which all intermediate record names are named in the reference.

 

Elliptical references: field is named, but any or all of the enclosing record names can be omitted as long as the resulting reference is unambiguous in the referencing environment.

e.g., COBOL or PL/I

FIRST, FIRST OF EMPLOYEE-NAME, FIRST OF EMPLOYEE-RECORD

 

Pascal provides with clauses to avoid the necessity of using fully qualified references:

e.g.,

employee.name := 'Bob';

employee.age := 42;

employee.sex := 'M';

employee.salary := 23750.0;

 

with employee do

begin

name := 'Bob';

age := 42;

sex := 'M';

salary := 23750.0

end; { end of with }

 

Operation on Records

 

Pascal & Modula-2 only support assignments.

 

Ada allows assignment and comparison for equality and inequality for records that have compatible types. Additional record operations can be defined as overloaded operators. Ada allows record initialization at compile time, where data aggregates are used as the initial values. Aggregate values can also be assigned to records by assignment statements.

 

C allows address-of and field selection, while C and C++ also allow records to be assigned.

 

COBOL provides the MOVE CORRESPONDING statement for moving records. This statement copies a field of the specified source record to the destination record only if the destination record has a field with the same name.

e.g.,

01 INPUT-RECORD.

02 NAME.

05 LAST PICTURE IS X(20).

05 MIDDLE PICTURE IS X(15).

05 FIRST PICTURE IS X(20).

02 EMPLOYEE-NUMBER PICTURE IS 9(10).

02 HOURS-WORKED PICTURE IS 99.

 

01 OUTPUT-RECORD.

02 NAME.

05 FIRST PICTURE IS X(20).

05 MIDDLE PICTURE IS X(15).

05 LAST PICTURE IS X(20).

02 EMPLOYEE-NUMBER PICTURE IS 9(10).

02 GROSS-PAY PICTURE IS 999V99.

02 NET-PAY PICTURE IS 999V99.

 

MOVE CORRESPONDING INPUT-RECORD TO OUTPUT-RECORD.

 

Evaluation

 

Straightforward design and safe use.

Elliptical references are not clearly readable.

 

Implementation of Record Types

 

Fields are stored in adjacent memory locations.

Because the sizes of the fields are not necessarily the same, the access method used for arrays is not used for records.

Instead, the offset address, relative to the beginning of the record, is associated with each field, and field accesses are all handled using thes offsets.

 

Compile-time descriptor:

field1 (record, name, type, offset) … fieldn (name, type, offset) Address.

 

Run-time descriptors are unnecessary.

 

Union Types

 

Type that may store different type values at different times during program execution.

 

Design Issues

 

Should type checking be required ? Note that any such type checking must be dynamic.

 

Should unions be embedded in records ?

 

FORTRAN Union Types

 

e.g.,

INTEGER X

REAL Y

EQUIVALENCE (X, Y)

 

X and Y are aliases.

No type checking is done.

 

ALGOL 68 Union Types

 

Discriminated union: union with which is associated an additional value called a tag, or discriminant, that identifies the current type value stored in the union.

 

e.g., union (int, real) ir1, ir2

 

union (int, real) ir1;

int count;

…

ir1 := 33;

…

count := ir1;

 

second statement is not legal, as the system cannot statically check the type of ir1 (compiler cannot guarantee that ir1 will actually contain an integer value).

 

Conformity clauses are provided to alleviate the above issue:

e.g.,

union (int, real) ir1;

int count;

real sum;

…

case ir1 in

(int intval) : count := intval,

(real realval) : sum := realval

esac

Allows static type checking of user coded, and dynamic checking of system discriminants in order to disallow erroneous uses of values. Therefore, it is a safe way of implementing discriminated unions.

 

ALGOL 68 only provides assignment as a built-in operation for discriminated unions. Other operations can be constructed by user-defined overloaded operators.

 

Pascal Union Types (also applies to Modula-2)

 

Integration of discriminated unions with a record structure.

 

In that case, the discriminated union is called a record variant, or variant part of a record. The discriminant is a user-accessible variable in the record that stores the current type value in the variant.

 

e.g.,

type shape = (circle, triangle, rectangle);

object =

record

case form : shape of

circle: (diameter : real);

triangle: (leftside : integer;

rightside : integer;

angle : real);

rectangle: (side1 : integer;

side2 : integer)

end;

var figure : object;

figure consists of the tag, which is named form (discriminant), and sufficient storage for its largest variant. At any time during execution, the tag should indicate which variant is currently stored.

 

Issues:

 

e.g.,

type

object =

record

case shape of

circle : (diameter : real);

triangle : (leftside : integer);

…

end

 

Variant records in Pascal are frequently used to get around some of the restrictions of the language.

e.g., pointer arithmetic is not allowed in Pascal, to circumvent this restriction, a pointer can be placed in a variant with an integer, and the integer form can be manipulated as desired.

 

Ada Union Types

 

Safety:

 

Allows "constrained variant variable":

 

e.g.,

type SHAPE is (CIRCLE, TRIANGLE, RECTANGLE);

type OBJECT (FORM : SHAPE) is

record

case FORM is

when CIRCLE =>

DIAMETER : FLOAT;

when TRIANGLE =>

LEFT_SIDE : INTEGER;

RIGHT_SIDE : INTEGER;

ANGLE : FLOAT;

when RECTANGLE =>

SIDE_1 : INTEGER;

SIDE _2 : INTEGER;

end case;

end record;

 

FIGURE_1 : OBJECT; (unconstrained variant record with no initial value)

type can change by assignment of a whole record, including the discriminant

FIGURE_1 := (FORM => RECTANGLE, SIDE_1 => 12, SIDE_2 => 3);

references to fields in unconstrained variants must be dynamically checked.

 

FIGURE_2 : OBJECT ( FORM => TRIANGLE); (constrained variant variable)

 

C and C++ Union Types

 

Free union types:

Evaluation

 

Unions are potentially unsafe constructs in many languages (Oberon and Modula-3 do not include them).

 

Unions are a reason why Pascal, C, C++, and Modula-2 are not strongly typed (do not allow type checking of references to variant parts of records).

 

Unions provide programming flexibility.

 

Unions can be designed so that they can be safely used (e.g., Ada).

 

Implementation of Union Types

 

Discriminated unions:

 

At compile time the complete description of each variant must be stored:

 

e.g., Ada discriminated union

type NODE (TAG : BOOLEAN) is

record

case TAG is

when true => COUNT : INTEGER;

when false => SUM : FLOAT;

end case;

end record;

compile-time descriptor:

(discriminated union, TAG : BOOLEAN, offset, pointer to case table, Address)

case table (true, pointer to true case tag, false, pointer to false case tag)

true case tag (name : COUNT, type : INTEGER)

false case tag (name : SUM, type : FLOAT)

 

 

Set Types

 

Type whose variables can store unordered collections of distinct values from some ordinal type called base type.

 

Design issue: what should the maximum number of elements in a set base type be ?

 

Sets in Pascal and Modula-2

 

Pascal:

 

type colors = (red, blue, green, yellow, orange, white, black);

colorset = set of colors;

var set1, set2 : colorset;

 

set1 := [red, blue, yellow, white];

set2 := [black, blue];

 

Modula-2/Modula-3:

 

e.g.,

TYPE setype1 = SET OF [red, blue, green, yellow];

setype2 = SET OF [blue, yellow]

VAR setvar1 : setype1;

In a program that includes these declarations, {blue} may be ambiguous

However, in

setvar1 := setype1 {blue}

the type of the constant is very clear.

 

Note that set type variables, like enumeration variables, can be neither input nor output in Pascal or Modula-2.

 

Evaluation

 

Ada does not include set types. Instead, Ada provides a set membership operator for its enumeration types.

 

In other languages without set types, set operations must be done with arrays, and the user must write the code to provide the operations (more cumbersome and most likely less efficient).

Implementation of Set Types

 

Sets are usually stored as bit strings in memory.

 

e.g.,

set with ordinal base type ['a'..'p']

variables of this set type can use the first 16 bits of a machine word with each set bit (1) representing a present element, and each clear bit (0) representing an absent element.

set value: ['a', 'c', 'h', 'o'] represented as 1010000100000010

a typical operation such as a set union can be computed as a single machine instruction (logical OR).

set membership can be done in a single instruction when the base set cardinality is less than or equal to the machine's word size (AND operation).

 

Pointer Types

 

Type in which the variables have a range of values that consists of memory addresses and a special value, nil.

 

nil is not a valid address, and is used to indicate that a pointer cannot currently be used to reference another object.

 

Pointers provide two distinct kinds of uses:

 

Dynamic variables: variables that are dynamically allocated from the heap.

Variables without names are called anonymous variables.

Pointers are not structured types, although they are defined using a type operator (* in C and C++, access in Ada, and ^ in Pascal).

Pointers are different from scalar variables, as they are most often used to reference some other variable, rather than being used to store data of some sort.

 

Design Issues

 

What are the scope and lifetime of a pointer variable ?

What is the lifetime of a dynamic variable ?

Are pointers restricted as to the type of object to which they can point ?

Are pointers used for dynamic storage management, indirect addressing, or both ?

 

Pointer Operations

 

Assignment: sets a pointer variable to the address of some object

 

Occurrence of a pointer variable in an expression can be interpreted in two ways:

Dereferencing: takes a reference through one level of indirection.

e.g., ptr -> (7080)

7080 -> (206)

normal reference to ptr yields 7080

dereferenced reference to ptr yields 206.

 

Dereferencing is implicit in ALGOL 68 and FORTRAN 90.

 

Dereferencing in most contemporary languages only occurs when explicitly specified:

 

Pointers to records:

 

Languages that provide pointers for the management of a heap must include an explicit allocation operation. Some also provide an explicit deallocation operation. These are ofte in the form of built-in subprograms, although Ada and C++ use an allocator operator.

 

Pointers and Pointer Problems in PL/I

 

PL/I was the first high-level programming language to include pointer variables.

Pointers can be used to refer to both dynamic variables and other program variables.

PL/I pointers are highly flexible, but their use can lead to several kinds of programming errors.

 

Type Checking

 

Domain type: type of object to which a pointer can point.

A PL/I pointer is not restricted to a single domain type.

This makes static type checking of pointer references impossible.

 

Dangling Pointers

 

Dangling pointer or dangling reference: pointer containing the address of a dynamic variable that has been deallocated.

 

PL/I provides two ways to create dangling pointers:

 

e.g.,

BEGIN

DECLARE POINT POINTER;

…

BEGIN

DECLARE SUM FIXED;

POINT = ADDR(SUM);

…

END;

…

END;

POINT is dangling after its exit from the inner block, as SUM has been deallocated.

 

 

e.g.,

DECLARE SUM FLOAT BASED (POINT);

DECLARE POINT2 POINTER;

…

ALLOCATE SUM;

POINT2 = POINT;

FREE SUM;

POINT2 is dangling after FREE SUM.

 

Lost Objects

 

Allocated dynamic object that is no longer accessible to the user program but may still contain useful data (garbage).

 

Not useful to the user program that allocated them, and cannot be reallocated for some new use by the system.

 

e.g.,

DECLARE SUM FLOAT BASED (POINT);

…

ALLOCATE SUM;

…

ALLOCATE SUM;

 

Pointers in Pascal

 

Only used to access dynamically allocated anonymous variables.

 

Dangling pointers can be created easily with explicit deallocation.

 

Pascal programmer creates dynamic variables with new and destroys them with dispose (would need to find and set all pointers pointing to the dynamic variable being destroyed to nil).

 

Alternatives for explicit deallocation implementation in Pascal:

Pointers in Ada

 

Pointers are called access types.

 

Dangling pointer problem partially alleviated:

 

Lost objects can be created in the same way as in PL/I.

 

All pointers are implicitly initialized to null (Ada's nil).

 

Pointers in C and C++

 

Used much as addresses in assembly languages.

 

Extremely flexible but must be used with great care.

 

No solutions to the dangling pointer or lost object problems.

 

Pointer arithmetic is possible.

 

* denotes the dereferencing operations, and & denotes the operator for producing the address of a variables.

 

e.g.,

int *ptr;

int count, init;

…

ptr = &init;

count = *ptr;

equivalent to count = init;

 

The declaration of a pointer specifies its domain type.

 

Pointers can be assigned the constant zero, which is used for nil

 

Pointer arithmetic:

 

e.g., C/C++ array

array name without subscript refers to the address of the first element (treated as a pointer but is a constant, and therefore cannot be assigned).

ptr = list assigns the address of list[0] to ptr

*(ptr+1) is equivalent to list[1],

*(ptr+index) is equivalent to list[index],

ptr[index] is equivalent to list[index] (pointers to arrays can be indexed as if they were array names).

 

Pointers can point to functions (used to pass functions as parameters to other functions), and can be used for parameter passing.

 

C/C++ include pointers of type void * (can point at objects of any type). At the difference of PL/I, type checking of void * pointers is not a problem because they cannot be dereferenced.

 

Pointers in FORTRAN 90

 

Used to point to both dynamic variables and static variables.

e.g.,

INTEGER, POINTER :: INT_PTR

INTEGER, POINTER, DIMENSION (: ) :: INT_LIST_PTR

 

Implicitly dereferenced in most uses (when pointer appears in normal expression).

 

pointer => target assignment statement used when dereferencing is not desired.

 

e.g.,

INTEGER, TARGET :: APPLE

INTEGER ORANGE

APPLE can be pointed to by INT_PTR, but ORANGE cannot.

 

FORTRAN 90 pointers can easily become dangling, because the DEALLOCATE statement makes no attempt to determine if other pointers are pointing at a dynamic variable that is being deallocated.

 

C++ Reference Types

 

Special kind of pointer type used primarily for the parameters in function definitions.

 

A reference type variable is a constant pointer that is always implicitly dereferenced. It must be initialized with the address of some variable in its definition and can never be set to reference any other variable.

 

e.g.,

int result = 0;

int &ref_result = result;

…

ref_result = 100;

// result and ref_result are aliases

 

When used as parameters in function definitions, reference types provide for two-way communication between the caller function and the called function (called function references parameters that are referenced exactly as are other parameters, and the compiler passes addresses, rather than values, to reference parameters).

 

Evaluation

 

Pointer variables widen the range of memory cells that can be referenced by a variable.

Pointers provide the means by which recursively defined dynamic data structures can be conveniently represented (e.g., linked lists, binary trees, etc.).

Pointers provide the mechanism for managing dynamic storage in a heap.

Pointers create hazards …

 

Implementation of Pointer Types

 

Representation of Pointers

 

In most larger computers, pointers are single values stored in either two or four byte memory cells, depending on the size of the address space of the machine.

 

On microcomputers based on Intel microprocessors, addresses are composed of a segment and an offset, so pointers are implemented in these systems as pairs of 16-bit words, one for each of the two parts of an address.

 

Solutions to the Dangling Pointer Problem

 

Use of extra heap cells, called tombstones:

 

Have all dynamic variables include a special cell, called a tombstone, that is itself a pointer to the dynamic variable.

The actual pointer variable points only at tombstones, and never to dynamic variables.

When a dynamic variable is deallocated, the tombstone remains but is set to nil, indicating that the dynamic variable no longer exists. This prevents a pointer from ever pointing to a deallocated variable.

Tombstones are costly in both time and space (tombstone are never deallocated, access to a dynamic variable through a tombstone requires one more level of indirection / additional machine cycle on most computers).

 

Lock-and-keys approach used in the implementation of UW-Pascal:

 

Pointer values are represented as ordered pairs (key, address) where the key is an integer value.

Dynamic variables are represented as the storage for the variable plus a header cell that stores an integer lock value.

When a dynamic variable is allocated, a lock value is created and placed both in the lock cell of the dynamic variable and in the key cell of the pointer that is specified in the call to new.

Every access to the dereferenced pointer compares the key value of the pointer to the lock value in the dynamic variable. If they match, the access is legal, otherwise it is treated as a runtime error.

Any copies of the pointer value to other pointers must copy the key value.

Therefore, any number of pointers can reference a given dynamic variable.

When a dynamic variable is deallocated with dispose, its lock value is cleared to an illegal lock value. If a pointer other than the one specified in the dispose is dereferenced, its address value will still be intact, but its key value will no longer match the lock, so the access will not be allowed.

 

Heap Management (more of a system programming issue than a language design issue)

 

Heap storage allocation and deallocation in units of a single size

 

Further simplified when every cell (unit of a single size) contains a pointer (e.g., LISP data which consists of cells connected into linked lists).

 

All available cells are linked together using the pointers in the cells, forming a list of available space.

 

Allocation consists of taking the required number of cells from the list as needed.

 

Deallocation:

 

Two distinct opposite processes for reclaiming garbage:

 

Reference counter method:

 

Problems with reference counter method:

 

Garbage collection method:

 

Problem with garbage collection method:

 

Heap storage allocation and deallocation of variable-size segments

 

Additional problems:

 

Conclusions

 

Data types determine language style and use (heart of a language along with control structures).

 

Primitive types: numeric, character, boolean.

 

User-defined: enumeration & subrange types are convenient and make programs more readable.

 

Arrays are part of most programming languages.

 

Records are now included in most languages.

 

Safe discriminated unions are difficult to design.

 

Sets are relatively easy to implement.

 

Pointers have inherent dangers: dangling pointers and garbage collection issues.

 

Pointers are costly to implement if they are used for dynamic storage management and if steps are taken to avoid dangling pointers.

 

Heap management is complicated with variable-size cell allocation and deallocation.

 

 

C4.3. Names, Bindings, Type Checking, and Scopes

 

(Next Class - March 2)

 

C4.4. Procedures and their implementation

 

(Next Class - March 2)

 

C4.5. Comments on Sethi's Chapters 4, 5, 15.1, 15.2