Friday, January 9, 2009

What is a Scalable Computer Language?



What is a scalable computer language?

Simply put, a scalable computer language is a language that one can write very large programs in (and extend very large programs that have already been written) without feeling an undue amount of pain. In a scalable programming language, the difficulty of managing the complexity of the program goes up roughly linearly with the size of the program. Conversely, a nonscalable computer language is one in which increasing the size and scope of a problem tends to make the program much harder to manage (i.e. the complexity of the program goes up much more than linearly with its size).
Aspects of a programming language that affect scalability
Garbage collection: very, very good
Garbage collection (GC) means that the computer language (literally, the computer language's runtime system) automatically manages the reclamation ("freeing") of memory that is no longer being used. This is a huge win for the programmer, and can dramatically increase the scalability of the language. The reason is simple. It's usually obvious where one has to allocate memory. It can be very difficult to know exactly when it's safe to free it. This is especially true when references to the allocated memory are passed around, returned from functions, stored in multiple data structures that are later deleted (or not), etc. etc. The effects of this are very insidious. Languages without GC implicitly discourage the programmer from using all but the simplest data structures, because more often than not, the memory management problem quickly becomes intractable when using more complex data structures. What usually happens in practice is that the programmer rolls his/her own (bad) garbage collector (maybe a reference counter), with performance that is usually worse than what you would get if you used a language with GC built in.

I saw a rather stark example of this recently. One of the things I do professionally is teach the C programming language to Caltech undergraduates. I emphasized how important it was to always free memory that had been allocated. However, many of my students simply ignored me, and their code was littered with memory leaks. I got so tired of writing "this code has a memory leak here" on their assignments that I wrote a very simple memory leak checker. They are now required to write code that passes through the memory leak checker without any reported leaks before they submit their assignments. However, I was somewhat dismayed to find that my own answers to the assignments had a couple of subtle memory leaks as well! Since I have more than ten years of C programming experience, and have worked on several very large projects, this suggests to me that manual memory management is much harder than I'd previously supposed it to be.

There is a cost to GC, both in time and space efficiency. Well-designed garbage collectors (especially generational GC) can be extremely efficient (more efficient, for instance, than naive approaches such as reference counting). However, in order to do this they tend to have significantly greater space usages than programs without GC (I've heard estimates on the order of 50% more total space used). On the other hand, a program that leaks memory has the greatest space usage of all. I've wasted way too much of my life hunting down memory leaks in large C programs, and I have no interest in continuing to do so.

In conclusion, I would say that of all the items I'm discussing here, GC is the single most important one to ensure that a programming language is scalable. This is why programmers who move from a language without GC (say C++) to one of roughly equivalent abstractive power but with GC (say Java) invariably say how much happier they are now that they don't have to worry about memory management and can concentrate on the algorithms they're trying to write. Personally, I'd rather pull my own teeth out than write a large project in a language without GC.

Direct access to memory and pointer arithmetic: very bad

Some computer languages, notably C and C++, allow the programmer to directly interact with memory addresses through pointers, as well as allowing pointer arithmetic (incrementing and decrementing pointer variables). This kind of low-level programming is sometimes necessary (e.g. when writing device drivers) and sometimes simply useful (e.g. when micro-optimizing code that has to run as fast as possible). However, my (imperfect) understanding of the issue is that programming with pointers makes precise garbage collection impossible, or nearly so. There is a kind of GC called "conservative" garbage collection (here is a free implementation of the Boehm-Demers conservative GC) which can work with languages like C and C++. It's certainly better than nothing, but there are no guarantees that all memory will be managed correctly (i.e. memory leaks are unlikely, but possible). In practice, this supposedly isn't a problem, but I note with some interest that very little C/C++ code that I've seen actually uses conservative GC, and those that do are usually in code that is implementing a language that includes GC.

The scalability cost of pointers goes far beyond just making GC harder. Pointers (and especially pointer arithmetic) tend to destroy any safety guarantees you might want to be able to make about a program. It's not hard to see why. When you have a pointer to (say) an integer, and you can add 1,000,000 to that pointer and dereference some random region of memory which may or may not be part of your program's run-time image, all hell can break loose. If you're lucky, you'll just get a core dump and your program will terminate. If you're not so lucky, some part of the program memory will be corrupted, leading to mysterious bugs that are extremely difficult to track down, because they manifest themselves far away from where the original problem was. This leads to a huge increase in debugging times, which dramatically hurts programmer productivity. As the program gets larger, the opportunities for this kind of problem increase, which represents a significant barrier to scalability.

The usual argument in favor of direct pointer manipulations is that they make it possible to write faster code. This is often true; I've seen cases where using pointer arithmetic judiciously increased the speed of a program by a factor of five. However, the reverse is also often true; many program optimizations (normally performed automatically by the compiler) are rendered much more difficult or impossible in code that uses pointers. In other words, languages that enable micro-optimizations often make macro-optimizations impossible.

The author of the Eiffel language, Bertrand Meyer, has said that (I'm paraphrasing) "you can have pointer arithmetic, or you can have correct programs, but you can't have both". I agree with him. I think that direct memory access through pointers is the single biggest barrier to programming language scalability.

None of this is meant to imply that pointers and pointer arithmetic don't have their place; they are crucial for low-level close-to-the-metal programming. However, large programs written at a low level are simply not scalable. The right way to use languages like C is to implement small, focused low-level components of applications written primarily in higher-level languages. In fact, this is also the right way to use C++; you write some low-level classes that use pointers in some of their methods and then encapsulate these methods so you never have to expose the pointer manipulations to a user of the class. This would be nearly ideal, except that the compiler has no way to enforce this; you can always use pointers if you want to.

C

C

C combines all the power of assembly language with all the ease of use of assembly language.

-- unknown

I'd just like to take this moment to point out that C has all the expressive power of two dixie cups and a string.

-- Jamie Zawinski, in the source code for xkeycaps

There is a point in your life when you realize that you have written enough destructors, and have spent enough time tracking down a memory leak, and you have spend enough time tracking down memory corruption, and you have spent enough time using low-level insecure functions, and you have implemented way too many linked lists.

-- Miguel de Icaza

Greenspun's Tenth Rule of Programming: "Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified bug-ridden slow implementation of half of Common Lisp."

-- Philip Greenspun

Your superior intellect is no match for our puny weapons.

-- unknown, via Aaron Stern

I'll be blunt: C is a horrible language for developing large projects in. C contains almost all of the bad language features described above and almost none of the good ones. Furthermore, some of the good features it does have (like static type checking) are so corrupted by type casting that they are much less useful than they would otherwise be. C also offers essentially no abstraction capabilities whatsoever except for arrays (which are really just pointers in disguise) and structs (ditto).

Does that mean that C is a useless language? Far from it. It has its place, and its place is a vital one. C is an excellent language for writing code that has to interface directly with the machine ("bare-metal" programming). It is also a good language for implementing better languages in ;-) The Ocaml runtime engine, for instance, is written in C (with some help from assembly language). C is useful for writing the 1% of your application that absolutely, positively, has to run as fast as possible. However, if you're trying to write a large application entirely in C you are in for a miserable time. Instead, you're better off picking a good language that has a C foreign-function interface (FFI), so you can "drop down" into C when you really need to (which hopefully won't be that often).

Some people (particularly novice programmers who have no idea what they're talking about) are under the illusion that because C offers such direct control over the machine, it's the only "real" programming language for "true hackers". If that were true, all "true hackers" (whatever that means) would be writing code in assembly language, which is much closer to the machine than C is. I hope some of the quotes above, some of which are from world-famous open-source hackers, will help to dispel this myth. If not, then I recommend that you try to write a really large program (> 100,000 lines of code) in C, and tell me what you think at the end of that experience. I think you'll find my arguments much more persuasive.

C++

C++

C++ : an octopus made by nailing extra legs onto a dog.

-- off smalltalk.org

Think of C++ as an object-oriented assembly language.

-- off the guile Scheme mailing list

C makes it easy to shoot yourself in the foot. C++ makes it harder, but when you do, you blow your whole leg off.

-- Bjarne Stroustrup

Programming in C++ is premature optimization.

-- off comp.lang.python
C++ adds object-oriented and generic programming features to C, along with a host of other features. This can significantly improve the abstraction level of C++ programs relative to C programs, and this should make the language more scalable. Unfortunately, the presence of pointers and the absence of GC, in my opinion, undoes all the benefits of the useful features and makes C++ non-scalable (for instance, NONE of the standard template library (STL) container classes use GC). In addition, C++ is so mind-bogglingly complex, with so many odd features that interact in peculiar ways, that it's almost impossible to master. I will readily admit that I'd rather write a large application in C++ than in C, but that's like saying I'd rather eat rotting meat than swallow sulfuric acid ;-)

To be fair, I have to point out that there is some really neat stuff possible in C++ using templates (typically lumped under the name "template metaprogramming"; see the book Modern C++ Design by Andrei Alexandrescu for more information.). This is very much wizard-level programming (which is OK), but to my mind it doesn't even come close to compensating for the lack of GC. What would make C++ a more scalable language is (a) including a garbage collector as part of the standard library, and (b) having some way to restrict the use of pointers to particular modules that need direct pointer access (such as very low-level data structures). I'm not very optimistic that this will ever happen, but I hope it does.

Java and C#

Java and C#

Java and C# are more recent languages which have learned from the mistakes of the past. They both have GC, neither have pointers, and they both support object-oriented programming. This makes programming in these languages comparatively non-painful, and they scale well. Both also have interesting mechanisms for achieving platform independence (which is beyond the scope of the present discussion). C# is part of the .NET framework, which (in theory) permits a considerable amount of inter-language interaction; this is in my opinion a huge win, but is also beyond the scope of the present discussion. Both languages tend to be quite verbose, which is off-putting to many people (including me), but that's not that big an issue. C# also contains an interesting mechanism for using unsafe code (i.e. code which uses pointers and doesn't use GC) but encapsulated in modules specifically marked "unsafe" (an idea borrowed from the Modula-3 language, which is now sadly all but defunct). If you feel you can't live without the ability to program with pointers, but you don't need them for most of your code, that's the right way to go.

The abstraction level of both C# and Java is mediocre; it's much better than C, somewhat weaker (!) than C++, and not nearly as good as languages that support both object-oriented and functional programming (such as Lisp and Ocaml). Therefore, I find programming in these languages to be pretty boring and tedious. Many of the scalability features in these languages are not, strictly speaking, part of the languages at all but of the environment(s) built up around the languages. For instance, the support for components, versioning, packaging and documentation generation are all features of the environments. I hope we will soon start to see these kinds of meta-features in better languages than Java or C#.

Introduction

Introduction

Computer programmers often use the word scalable to describe a desirable feature of a program or algorithm. The idea is that something that is scalable not only works for very small scale jobs but is equally effective (or nearly as effective) when the job in question is much larger (or, more commonly, when the job grows from being small to being large). A trivial example of poor scalability might be an inefficient algorithm used in a prototype implementation of a program. It works fine as long as only very small data sets are being used, but it breaks down when larger data sets are used because the computation takes so much time that the program effectively grinds to a halt. This kind of scalability is well understood by any decent programmer. A different kind of scalability is represented by the architecture of a computer program. A program that can't easily be extended to cope with new requirements is called brittle, which is the opposite of scalable. Since effectively all programs that are successful grow new parts, a nonscalable program usually will require a total rewrite when the requirements change, which is very wasteful. The usual reason for this kind of non-scalability is poor design abstraction; too many fundamental design decisions have been hard-wired into the code in so many places that it is difficult to change them all without introducing lots of bugs. Many, many books have been written about good program design; a classic is Abelson and Sussman's Structure and Interpretation of Computer Programs, which covers this and much more interesting material as well and is highly recommended. But that's not what I want to talk about here. What I want to talk about is what makes computer programming languages scalable or not. I want to show that the notion of scalability is every bit as valid when applied to programming languages as it is when applied to programs or algorithms. I'll also discuss several well-known and not so well-known programming languages from this perspective and give some concrete recommendations, as well as discuss some of the social factors which hinder progress in this field.

Overview

Within software engineering, programming (the implementation) is regarded as one phase in a software development process.

There is an ongoing debate on the extent to which the writing of programs is an art, a craft or an engineering discipline. Good programming is generally considered to be the measured application of all three, with the goal of producing an efficient and evolvable software solution (the criteria for "efficient" and "evolvable" vary considerably). The discipline differs from many other technical professions in that programmers generally do not need to be licensed or pass any standardized (or governmentally regulated) certification tests in order to call themselves "programmers" or even "software engineers." However, representing oneself as a "Professional Software Engineer" without a license from an accredited institution is illegal in many parts of the world.

Another ongoing debate is the extent to which the programming language used in writing computer programs affects the form that the final program takes. This debate is analogous to that surrounding the Sapir-Whorf hypothesis linguistics, that postulates that a particular language's nature influences the habitual thought of its speakers. Different language patterns yield different patterns of thought. This idea challenges the possibility of representing the world perfectly with language, because it acknowledges that the mechanisms of any language condition the thoughts of its speaker community.

Said another way, programming is the craft of transforming requirements into something that a computer can execute.

Definitions

Traits often considered important for constituting a programming language:

* Function: A programming language is a language used to write computer programs, which involve a computer performing some kind of computation or algorithm and possibly control external devices such as printers, robots, and so on.

* Target: Programming languages differ from natural languages in that natural languages are only used for interaction between people, while programming languages also allow humans to communicate instructions to machines. Some programming languages are used by one device to control another. For example PostScript programs are frequently created by another program to control a computer printer or display.

* Constructs: Programming languages may contain constructs for defining and manipulating data structures or controlling the flow of execution.

* Expressive power: The theory of computation classifies languages by the computations they are capable of expressing. All Turing complete languages can implement the same set of algorithms. ANSI/ISO SQL and Charity are examples of languages that are not Turing complete, yet often called programming languages.

Some authors restrict the term "programming language" to those languages that can express all possible algorithms; sometimes the term "computer language" is used for more limited artificial languages.

Non-computational languages, such as markup languages like HTML or formal grammars like BNF, are usually not considered programming languages. A programming language (which may or may not be Turing complete) may be embedded in these non-computational (host) languages.