Caml

Document Tree

NAVIGATION

Programming

Caml

RELATED

Contact

Caml is a programming language which is very different from other, "classical"[1] languages. While a program written in a classical languages consists mainly of statements, Caml programs are composed of functions. A statement takes its input values from the environment (such as variables), computes something, and stores the result again in the environment. To achieve more complex behaviour, you can execute one statement after another, or put your statements into loops. Functions are differ from this model, as there is no possibility to store something into the environment.

For example, to determine the biggest number of an array, execute the following C statements:

maximum = a[0];
for (i=1; i<n; i++) {
  if (a[i] > maximum) maximum = a[i];
}

The variable "maximum" contains the biggest number which has been found so far. The characteristics of "maximum" is that it is a store, i.e. the value associated with it can be changed.

In a functional language a different style is preferred[2]. Variables are normally considered immutable, and behave much more like variables in mathematics. Repetitive execution of the same code is expressed by recursive functions instead of loops:

let rec determine_maximum maximum i =
  if i < n then
    if a.(i) > maximum then
      determine_maximum a.(i) (i+1)
    else
      determine_maximum maximum (i+1) 
  else
    maximum
in
  determine_maximum a.(0) 1

Here, "let", "rec", "in", "if", "then", and "else" are keywords. The function "determine_maximum" takes two arguments, "maximum", and "i", and computes the maximum of "maximum" and all array elements starting with "i". This function is a local function, i.e. it is not visible in the whole program, but only in the expression following "in". Some notes to better understand this piece of code:

Every expression returns a value. The whole expression returns the maximum of the array; every function invocation returns the maximum of an array slice; even the "if" constructs returns either the value computed after "then" or after "else" (this kind of "if ... then .. else" is comparable to "... ? ... : ..." in C).
The whole expression is an example of the "let rec fname arg1 arg2 ... = expr1 in expr2" construct. It defines a local recursive function "fname" with several arguments and "expr1" as its function body (the code that is being executed when the function is called), but "fname" is only visible within "expr2". The value of the whole construct is the result of the evaluation of "expr2".
There is no special operator or notation for the function call in Caml; you simply write "f x" to call "f" with one argument, or "f x1 x2 ..." to invoke it with several arguments. As every function must have been defined before it can be used, Caml knows how many arguments to collect. Note that functions must have at least one argument such that "f" without arguments denotes the function itself and not the invocation with empty argument list[3]
The Caml compiler eliminates the recursion in the machine code, because it recognizes that a simple "goto" statement is sufficient to translate the function; especially it is not necessary to put values onto the processor stack while the inner self-invocations are performed. This simply means that the compiled code runs as efficient as the classical formulation.

But what is the advantage of a functional programming style? It seems that you must write much more lines of code to get the same effect. The answer is that Caml functions are more powerful than the functions you find in classical languages such as C, and that the above example does not take advantage of this power. Here an alternative formulation which is much shorter:

Array.fold_left (fun r x -> if x>r then x else r) a.(0) a

The whole expression is just an invocation of the function "Array.fold_left" with three arguments. The idea is that this function runs over the array "a" (given as third argument), and invokes for every element a user-defined function which gets the value of the element "x" and an intermediate result "r" and which computes a new intermediate result that is to be used for the next element. The first argument of "Array.fold_left" is the user function (i.e. the expression in parantheses); the second argument is the initial intermediate value which is used for the first element.

Here, the intermediate result is the maximum of the array elements that have been seen so far. We can use "a.(0)" as the initial value because this number is equal to or smaller than the maximum of the whole array. The user function is a so-called anonymous function, i.e. a function not associated with an identifier serving as the name of the function. Such functions are written "fun arg1 arg2... -> expr" where "fun" is the keyword indicating that a function definition follows. This function takes "arg1", "arg2" and so on as arguments, and the result of the function is computed by evaluating the expression "expr". In our example, the first argument "r" is the intermediate result, and the second argument "x" is the current array element. The function body calculates the maximum of both values.

Of course, it is possible to use a named function, too[4]:

let calculate_maximum r x = 
  if x>r then x else r
in
  Array.fold_left calculate_maximum a.(0) a

The only difference of an anonymous function is the missing name.

The function "Array.fold_left" takes a function as argument. This is important to mention because Caml makes no difference between ordinary values like numbers, and functions. "Functions are first-class values", i.e. they can be used in the same way as input to or output from computations like any other value of the language.

It is time to tell the truth about the "let" construct. I have introduced it as a way to define local functions, but it is not restricted to this usage. In general, "let" binds a name with a value, i.e. every time the name is mentioned after that, the bound value is meant instead. These names are often called variables but this is a misnomer; actually the values bound to the variables cannot be exchanged by different values[5]. For example, these "let" bindings are possible:

let pi = 3.14 in sin pi

let calculate_maximum = 
  fun r x -> if x>r then x else r 
in 
  calculate_maximum 5 6

As you can see in the second example, "calculate_maximum" is just a name for a function, and the literal value of the function is given by an anonymous function just like the literal value of pi is given by a numeric constant. If you write "let f arg1 arg2 ... = expr1 in expr2" this is only an abbreviation of "let f = fun arg1 arg2 ... -> expr1 in expr2".

Although the function "Array.fold_left" is predefined[6] it is possible to define the function yourself, i.e. there is nothing "magic" about it. This is very illustrative as it demonstrates that functions are more powerful than statements; imagine you could define a "for" loop yourself in C ("Array.fold_left" has the same meaning as a "for" loop iterating over all elements of an array). This is a possible definition:

let fold_left f r0 a =
  let n = 
    Array.length a 
  in
    let rec iterate r i =
      if i<n then
        let x = a.(i) in
        iterate (f r x) (i+1)
      else
        r
    in
      iterate r0 0

The function "fold_left" takes the already known three arguments and refers to them by "f", "r0", and "a". Note that we need not to declare that "f" can only be instantiated by a function, it appears as every other variable. The definition of "fold_left" lacks an "in" clause; this means that the name "fold_left" is globally visible and can be used in all subsequent expressions. The function body of "fold_left" has two inner "let" bindings, one for "n", and one for "iterate". The bindings are nested as indicated by the indentation of the lines[7], i.e. "n" is visible in the function body of "iterate". The function "iterate" has two arguments: "r" is the intermediate result, and "i" is the index of the array element which is processed next. "iterate" invokes itself unless the end of the array is exceeded, and the next intermediate result is computed by calling "f" with the current result "r" and the current array element "x" which is "a.(i)".

Observe that the local function "iterate" can access the variables "f", "a", and "n" which are introduced outside "iterate". In the code of "fold_left" there is nothing problematic with this, but in a functional language it is possible to evaluate a function in a different context than it was defined. For example, if we want to compute the maximum of all array elements not bigger than "m", we can simply write

let m =
   ...whatever m is...
in
  Array.fold_left 
    (fun r x -> if (x>r || r>m) && x<=m then x else r) 
    a.(0) 
    a

Note that if this expression returns a value bigger than "m" this means that the maximum does not exist.

The problem with this construct is that the anonymous function passed to "Array.fold_left" contains a reference to "m". When the anonymous function is being evaluated this happens in the context of "Array.fold_left" which has been defined earlier, and "m" did not exist at this moment.

Classical languages that allow functions to be passed to other functions solve this problem simply by forbidding it. For example, Pascal which knows local functions does not allow to pass local functions as arguments. This is not what the programmer wants as the construct makes sense.

Caml solves the problem as follows: It takes the value of "m" at the moment the anonymous function is being defined, and this value is "incorporated" into the function, i.e. it substitutes all occcurences of "m", and the name "m" vanishes. For instance, if "m" happens to be 42 at the moment of definition, the function passed to "Array.fold_left" is actually:

fun x y -> if (x>r || r>42) && x<=42 then x else r

This implicitly created function is called the closure of the original function, because the variable "m" would be a free (unbound, open) variable if the function was passed in its original form. Note that the translation of closures leads to very efficient machine code, as the substitution is deferred until the substituted variable is actually accessed.

Of course, functions are the most interesting concept of Caml, but there are other features worth to be mentioned:

Caml is a statically typed language; this means that the compiler checks typing constraints for every expression such that it is guaranteed at runtime that every operation gets only values it can operate on. These checks are known from Pascal and ANSI C, but Caml again deviates from the main stream. The classical way of handling type checking is that every variable and every procedure must be declared before it can be used; the declaration must contain the typing constraint the programmer wants to apply on the object. Caml has a feature called type inference which avoids explicit declarations in most cases. The compiler is intelligent enough to find out how a variable is used and derives the minimal necessary typing constraints from that. For example, if "x<3" is an expression of the program, the compiler finds out that "x" must be an integer because it is compared with an integer; the comparison operator demands that both operands are of the same type. If later "x<'a'" occurs, a typing error is signaled, because this expression expects that "x" is a character.
Caml has a rich type system. There are many data structures such as arrays, lists, streams, stacks, queues, sets and others already built-in, or provided with the standard library. Furthermore, Caml knows polymorphism which means that you can formulate functions which take input values or produce output of arbitrary type (or constrained in a way that a part of a value has arbitrary type), and that user-defined data structures may contain type variables. For example, you can define lists whose members are of arbitrary type, and provide functions working on these lists. Note that "arbitrary" means that the type of the members is only unspecified at the moment of the definition of the list type, but that every application of such lists must be concrete at last (lists of integers, lists of strings, ...). There is not any notational overhead in order to make a polymorphic type concrete, just use it in a concrete way!
Caml has a garbage collector, i.e. you need not to allocate or free memory yourself, the system does it for you. Caml knows references (pointers), but not the NULL pointer, or even uninitialized references. It is impossible to produce "segfaults".
Caml has advanced language features such as exceptions, modules, functors, and classes.
Caml programs can be multi-threaded.
There is a compiler producing bytecode which is platform-independent, and a compiler generating machine code. Furthermore, it is possible to execute Caml programs as scripts (ad hoc compiled). The resulting binaries have only moderate size. For example, the minimal bytecode interpreter has about 200K; you must add the standard library (about 500K), additional libraries, and your program. It is possible to create customized interpreters that already incorporate libraries which may be shared by several programs.

[1]: The correct term would be "imperative" languages, but I prefer to speak of "classical" languages to emphasize that these languages have the properties that most programmers expect from a language.
[2]: I say "preferred" because every practical functional language allows it to program with statements, too. The reason for this is that there are some algorithms running quicker when formulated with statements instead of functions. Because of this, Caml programs are often a mixture of functions and statements.
[3]: If you do not need an argument, Caml provides the special value "()", called the "thing". There are no operators that would allow to manipulate "()" ; it is just there. For example, write "f ()" to call "f" with a useless argument.
[4]: Because the function is not recursive, we use the "let" construct instead of "let rec". The only difference is that self-invocations are not recognized within "let".
[5]: The reason for the term "variable" is that every "let" binding can be considered as a lambda abstraction that is immediately followed by a lambda application. As you see there is some theory behind the language...
[6]: More exactly, it is defined in the standard library of Caml.
[7]: The "let ... in ..." construct is right-associative.

This web site is published by Informatikbüro Gerd Stolpmann

Plasma	GitLab	Archive
Projects	Blog	Knowledge