2.2. The alphabet of C
This is an interesting area; alphabets are important. All the same, this
is the one part of this chapter that you can read superficially first time
round without missing too much. Read it to make sure that you've seen the
contents once, and make a mental note to come back to it later on.
2.2.1. Basic Alphabet
Few computer languages bother to define their alphabet rigorously.
There's usually an assumption that the English alphabet augmented by a
sprinkling of more or less arbitrary punctuation symbols will be available
in every environment that is trying to support the language. The
assumption is not always borne out by experience. Older languages suffer
less from this sort of problem, but try sending C programs by Telex
or restrictive e-mail links and you'll understand the difficulty.
The Standard talks about two different character sets: the one that
programs are written in and the one that programs execute with. This is
basically to allow for different systems for compiling and execution,
which might use different ways of encoding their characters. It doesn't
actually matter a lot except when you are using character constants in the
preprocessor, where they may not have the same value as they do at
execution time. This behaviour is implementation-defined, so it must be
documented. Don't worry about it yet.
The Standard requires that an alphabet of 96 symbols is available
for C as follows:
a b c d e f g h i j k l m n o p q r s t u v w x y z |
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z |
0 1 2 3 4 5 6 7 8 9 |
! " # % & ' ( ) * + , - . / |
: ; < = > ? [ \ ] ^ _ { | } ~ |
space, horizontal and vertical tab |
form feed, newline |
Table 2.1. The Alphabet of C
It turns out that most of the commonly used computer alphabets contain
all the symbols that are needed for C with a few notorious exceptions. The
C alphabetic characters shown below are missing from the International
Standards Organization ISO 646 standard 7-bit character set, which is
as a subset of all the widely used computer alphabets.
# [ \ ] ^ { | } ~
To cater for systems that can't provide the full 96 characters
needed by C, the Standard specifies a method of using the
ISO 646 characters to represent the missing few; the technique is the
use of trigraphs.
2.2.2. Trigraphs
Trigraphs are a sequence of three ISO 646 characters that get
treated as if they were one character in the C alphabet; all of the
trigraphs start with two question marks ?? which helps
to indicate that ‘something funny’ is going on. Table 2.1 below shows the trigraphs defined in the Standard.
C character |
Trigraph |
# |
??= |
[ |
??( |
] |
??) |
{ |
??< |
} |
??> |
\ |
??/ |
| |
??! |
~ |
??- |
^ |
??' |
Table 2.2. Trigraphs
As an example, let's assume that your terminal doesn't have the
# symbol. To write the preprocessor line
#define MAX 32767
isn't possible; you must use trigraph notation instead:
??=define MAX 32767
Of course trigraphs will work even if you do have a
# symbol; they are there to help in difficult
circumstances more than to be used for routine programming.
The ? ‘binds to the right’, so in any sequence of
repeated ? s, only the two at the right could possibly be part
of a trigraph, depending on what comes next—this disposes of any
ambiguity.
It would be a mistake to assume that programs written to be highly
portable would use trigraphs ‘in case they had to be moved to systems
that only support ISO 646’. If your system can handle all 96
characters in the C alphabet, then that is what you should be using.
Trigraphs will only be seen in restricted environments, and it is
extremely simple to write a character-by-character translator between the
two representations. However, all compilers that conform to the Standard
will recognize trigraphs when they are seen.
Trigraph substitution is the very first operation that a compiler
performs on its input text.
2.2.3. Multibyte Characters
Support for multibyte characters is new in the Standard. Why?
A very large proportion of day-to-day computing involves data that
represents text of one form or another. Until recently, the rather
chauvinist computing idustry has assumed that it is adequate to provide
support for about a hundred or so printable characters (hence the
96 character alphabet of C), based on the requirements of the
English language—not suprising, since the bulk of the development
of commercial computing has been in the US market. This alphabet
(technically called the repertoire) fits conveniently into 7 or
8 bits of storage, which is why the US-ASCII character set standard
and the architecture of mini and microcomputers both give very heavy
emphasis to the use of 8-bit bytes as the basic unit of storage.
C also has a byte-oriented approach to data storage. The smallest
individual item of storage that can be directly used in C is the byte,
which is defined to be at least 8 bits in size. Older systems or
architectures that are not designed explicitly to support this may incur a
performance penalty when running C as a result, although there are not
many that find this a big problem.
Perhaps there was a time when the English alphabet was acceptable for
data processing applications worldwide—when computers were used in
environments where the users could be expected to adapt—but those
days are gone. Nowadays it is absolutely essential to provide for the
storage and processing of textual material in the native alphabet of
whoever wants to use the system. Most of the US and Western European
language requirements can be squeezed together into a character set that
still fits in 8 bits per character, but Asian and other languages
simply cannot.
There are two general ways of extending character sets. One is to use a
fixed number of bytes (often two) for every character. This is what the
wide character support in C is designed to do. The other method is to use
a shift-in shift-out coding scheme; this is popular over 8-bit
communication links. Imagine a stream of characters that looks like:
a b c <SI> a b g <SO> x y
where <SI> and <SO> mean
‘switch to Greek’ and ‘switch back to English’
respectively. A display device that agreed to use that method might well
then display a, b, c, alpha, beta, gamma, x and y. This is roughly the
scheme used by the shift-JIS Japanese standard, except that once the
shift-in has been seen, pairs of characters together are used as
the code for a single Japanese character. Alternative schemes exist which
use more than one shift-in character, but they are less common.
The Standard now allows explicitly for the use of extended character
sets. Only the 96 characters defined earlier are used for the C part
of a program, but in comments, strings, character constants and header
names (these are really data, not part of the program as such) extended
characters are permitted if your environment supports them. The Standard
lays down a number of pretty obvious rules about how you are allowed to
use them which we will not repeat here. The most significant one is that a
byte whose value is zero is interpreted as a null character
irrespective of any shift state. That is important, because C uses a null
character to indicate the end of strings and many library functions rely
on it. An additional requirement is that multibyte sequences must start
and end in the initial shift state.
The char type is specified by the Standard as suitable to
hold the value of all of the characters in the ‘execution character
set’, which will be defined in your system's documentation. This means
that (in the example above) it could hold the value of
‘a ’ or ‘b ’ or even the "switch to
Greek" character itself. Because of the shift-in shift-out mechanism,
there would be no difference between the value stored in a char that was
intended to represent ‘a ’ or the Greek ‘alpha’
character. To do that would mean using a different representation -
probably needing more than 8 bits, which on many systems would be too big
for a char . That is why the Standard introduces the
wchar_t type. To use this, you must include the
<stddef.h> header, because wchar_t is simply defined as
an alternative name for one of C's other types. We discuss it further in
Section 2.8.
Summary
- C requires at least 96 characters in the source program character
set.
- Not all character sets in common use can stretch to 96 characters,
trigraphs allow the basic ISO 646 character set to be used (at a
pinch).
- Multibyte character support has been added by the Standard, with
support for
- Shift-encoded multibyte characters, which can be squeezed into
‘ordinary’ character arrays, so still have
char
type.
- Wide characters, each of which may use more storage than a regular
character. These usually have a different type from
char .
|