Code Capsules

The Standard C Library, Part 2

Chuck Allison


Chuck Allison is a regular columnist with CUJ and a Senior Software Engineer in the Information and Communication Systems Department of the Church of Jesus Christ of Latter Day Saints in Salt Lake City. He has a B.S. and M.S. in mathematics, has been programming since 1975, and has been teaching and developing in C since 1984. His current interest is object-oriented technology and education. He is a member of X3J16, the ANSI C++ Standards Committee. Chuck can be reached on the Internet at 72640.1507@compuserve.com.

Last month I divided the fifteen headers of the Standard C Library into three Groups, each representing different levels of mastery (see Table 1 through Table 3) . I continue this month by exploring Group II.

Group II: For the "Polished" C Programmer

<assert.h>

Well-organized programs provide key points where you can make assertions, such as "the index points to the next open array element." It is important to test these assertions during development and to document them for the maintenance programmer (which, of course, is often yourself). ANSI C provides the assert macro for this purpose. You could represent the assertion above, for example, as

   #include <assert.h>
   . . .
   assert(nitems < MAXITEMS && i
         == nitems);
   . . .
If the condition holds, all is well and execution continues. Otherwise, assert prints a message containing the condition, the file name, and line number, and then calls abort to terminate the program.

Use assert to validate the internal logic of your program. If a certain thread of execution is supposed to be impossible, say so with the call assert(0), as in:

switch (color)
{
    case RED:
        . . .
    case BLUE:
        . . .
    case GREEN:
        . . .
    default:
        assert(0);
}
assert is also handy for validating parameters. A function that takes a string argument, for example, could do the following:

   char * f(char *s)
   {
       assert(s);
       . . .
   }
Assertions are for logic errors, of course, not for run-time errors. A logic error is one you could have avoided by correct design. For example, no action on the user's part should be able to create a null pointer — that's clearly your fault, so it is appropriate to use assert in such cases. On the other hand, a run-time error, such as a memory failure, requires more bulletproof exception handling.

When your code is ready for production, you should have caught all the bugs, so you should turn off assertion processing. To do so, you can either include the statement

   #define NDEBUG
in the beginning of the code, or define the macro on the command line if your compiler allows it (most use the -D switch). With NDEBUG defined, all assertions expand to a null macro, but the text remains in the code for documentation.

<limits.h>

By definition, portable programs do not depend in any way on the particulars of any one environment. Even assuming that all bytes consist of eight bits is not safe. The header <limits.h> defines the upper and lower bounds for all integer types (see Table 4) . The program in Listing 1 toggles each bit in an integer on and off. It uses the value CHAR_BIT, defined in <limits.h>, as the number of bits in a byte, to determine the number of bits in an integer. As Listing 2 illustrates, you can also use <limits.h> to determine the most efficient data type to use for signed numeric values that must span a certain range.

<stddef.h>

The header <stddef.h> defines three type synonyms and two macros (see Table 5) . When you subtract two pointers which refer to elements of the same array (or one position past the end of the array), you get back the difference of the two corresponding subscripts, which will be the number of elements between the pointers. The type of the result is either an int or a long, whichever is appropriate for your memory model. <stddef.h> defines the appropriate type as ptrdiff_t.

The sizeof operator returns a value of type size_t. size_t is the unsigned integer type that can represent the size of the largest data object you can declare in your environment. Usually an unsigned int or unsigned long is sufficient to represent this size. size_t is usually the unsigned counterpart of the type used for ptrdiff_t. If you look through the headers in the Standard C Library, you'll find extensive use of type size_t. It is good idea to use size_t for all array indices and for pointer arithmetic (i.e., adding an offset to a pointer), unless for some reason you need the ability to count down past zero, which unsigned integers can't do.

The type wchar_t holds a wide character, an implementation- defined integral type for representing characters beyond standard ASCII. You define wide character constants with a preceding L, as in:

    #include <stddef.h>
    wchar_t c = L'a';
    wchar_t *s = L"abcde";
As Listing 3 illustrates, my environment defines a wide character as a two-byte integer. This coincides nicely with the emerging 16-bit Unicode standard for international characters (see the sidebar "Character Sets"). The <stdlib.h> functions listed in Table 6 use type wchar_t. Amendment 1, an official addendum to Standard C accepted in 1994, defines many additional functions for handling wide and multi-byte characters. For more detailed information, see P. J. Plauger's columns in the April 1993 and May 1993 issues of CUJ.

The NULL macro is the universal zero-pointer constant, defined as one of 0, 0L, or (void *) 0. It is almost always a bad idea to assume any one of these definitions in a program — for safety, just include a header that defines NULL (stddef.h, stdio.h, stdlib.h, string.h, locale.h) and let the system figure out the correct representation. <stddef.h> is handy when you need only NULL defined in a translation unit and nothing else.

The offsetof macro returns the offset in bytes from the beginning of a structure to one of its members. Due to address alignment contraints, some implementations insert unused bytes after members in a structure, so you can't assume that the offset of a member is just the sum of the sizes of the members that precede it. For example, the program in Listing 4 exposes a one-byte gap in the Person structure after the name member, allowing the age member to start on a word boundary (a word is two bytes here). Use offsetof if you need an explicit pointer to a structure member:

struct Person p;
int *age_p;
age_p = (int*) ((char*)&p
   + offsetof(struct Person, age));

<time.h>

Most environments provide some mechanism for keeping time. time.h provides the type clock_t, a numeric type that tracks processor time (see Table 7) . The clock function returns an implementation-defined value of type clock_t that represents the current processor time. Unfortunately, what is meant by "processor time" varies across platforms, so clock by itself isn't very useful. You can, however, compare processor times, and then divide by the constant CLOCKS_PER_SEC, thus rendering the number of seconds elapsed between two points in time. The program in Listing 5 thru Listing 7 uses clock to implement such stopwatch functions.

The rest of the functions in <time.h> deal with calendar time. The time function returns a system-dependent encoding of the current date and time as type time_t (usually a long). The function localtime decodes a time_t into a struct tm (see Listing 8) . The asctime function returns a text representation of a decoded time in a standard format, namely

   Mon Nov 28 14:59:03 1994
For more detail, see the Code Capsule "Time and Date Processing in C," CUJ, January 1993.

I'll conclude this series on the Standard C Library next month with a discussion of the functionality of the headers in Group III.

Character Sets

A script is a set of symbols used to convey textual information. There are over 30 major scripts in the world. Some scripts, such as Roman and Cyrillic, serve many languages. World scripts can be categorized according to the hierarchy in Figure 1.

Most scripts are alphabetic. The Han script used by Chinese, Japanese, and Korean, however, is an ideographic (or more accurately, logographic) script. Each Han character represents an object or concept — these languages have no notion of words composed of letters from an alphabet.

A character set is a collection of text symbols with an associated numerical encoding. The ASCII character set with which most of us are familiar maps the letters and numerals used in our culture to integers in the range [32, 126], with special control codes filling out the 7-bit range [0, 127]. As the 'A' in the acronym suggests, this is strictly an American standard. Moreover, this standard only specifies half of the 256 code points available in a single 8-bit byte.

There are a number of extended ASCII character sets that fill the upper range [128, 255] with graphics characters, accented letters, or non-Roman characters. Since 256 code points are not enough to cover even the Roman alphabets in use today, there are five separate, single-byte standards for applications that use Roman characters (see Figure 2) .

The obvious disadvantage of single-byte character sets is the difficulty of processing data from distinct regions, such as Greek and Hebrew, in a single application. Single-byte encoding is wholly unfit for Chinese, Japanese, and Korean, since there are thousands of Han characters.

One way to increase the number of characters in a single encoding is to map characters to more than one byte. A multibyte character set maps a character to a variable-length sequence of one or more byte values. In one popular encoding scheme, if the most significant bit of a byte is zero, the character it represents is standard ASCII; if not, that byte and the next form a 16-bit code for a local character.

Multibyte encodings are storage efficient since they have no unused bytes, but they require special algorithms to compute indices into a string, or to find string length, since characters are represented as a variable number of bytes. To overcome string indexing problems, Standard C defines functions that process multibyte characters, and that convert multibyte strings into wide-character strings (i.e., strings of wchar_t, usually two-byte characters). Unfortunately, these multi-byte and wide-character functions are commonly available only on XPG4-compliant UNIX platforms and Japanese platforms. The recently approved Amendment 1 to the C Standard defines many additional functions for processing seqences of mutli byte and wide characters, and should entice U.S. vendors to step out of their cultural comfort zone.

Code Pages

Since standard ASCII consists of only 128 code points, there are 128 more waiting to be used in an eight-bit character encoding. It has been common practice to populate the upper 128 codes with characters suitable for local use. The combination of values 128-255 together with ASCII is called a code page under MS-DOS. The default code page for the IBM PC in the United States and much of Europe (#437) includes some box-drawing and other graphics characters, and Roman characters with diacritical marks. Other MS-DOS code pages include:

863 Canadian-French

850 Multi-Lingual (Latin-1)

865 Nordic

860 Portuguese

852 Slavic (Latin-2)

Non-U.S. versions of MS-DOS define other code pages. You can switch between code pages in MS-DOS applications, but not in U.S. Microsoft Windows (except in a DOS window). Only one code page remains active for Windows-hosted applications throughout an entire Windows session. Different versions of Windows have code pages appropriate for their region. For example, Windows-J for Japan uses a code page based on Shift-JIS. Windows 95 (a.k.a. Chicago) will support full code-page switching.

Since code pages use code points in the range [128, 255], it is important to avoid depending on or modifying the high-bit value in any byte of your program's data. A program that follows this discipline is called 8-bit clean.

Character Set Standards

Seven-bit ASCII is the world's most widely-used character set. ISO 646 is essentially ASCII with a few codes subject to localization. For example, the currency symbol, code point 0x24, is '$' only in the United States, and is allowed to "float" to adhere to local conventions. ISO 646 is sometimes called the portable character set (PCS) and is the standard alphabet for programming languages.

ISO 8859 is a standard that takes advantage of all 256 single-byte code points to define nine eight-bit mappings, to nine selected alphabets (see Figure 2) . Each of these mappings retains ISO 646 as a subset, hence they differ mainly in the upper 128 code points. Some of these mappings are the basis for MS-DOS code pages.

There is no official ISO standard for multibyte character sets in the Far East. However, each region of the Far East has its own local (national) standards. PC-industry standards, based on national standards, are also in common use in the Far East. Examples include Eten, Big Five, and Shift JIS.

ISO 1O646

To simplify the development of internationalized applications, ISO developed the Universal Multiple-Octet Coded Character Set (ISO 10646), to accommodate all characters from all significant modern languages in a single encoding. An octet is a contiguous, ordered collection of eight bits, which is a byte on most systems. ISO 10646 allows for 2,147,483,648 (231) characters, although only 34,168 have been defined. It is organized into 128 groups, each group containing 256 planes of 65,536 characters each (256 rows x 256 columns.

Any one of the 231 characters can be addressed by four octets, representing respectively the group, plane, row, and column of its location in the four-dimensional space. Consequently, ISO 10646 is a 32-bit character encoding. ASCII code points are a subset of ISO 10646 — you just add leading zeroes to fill out 32 bits. For example, the encoding for the letter 'a' is 00000061 hexadecimal (i.e., Group 0, Plane 0, Row 0, Column 0x61).

Plane 0 of Group 0 is the only one of the 32,768 planes that has been populated to date. It is called the Basic Multi-Lingual Plane (BMP). ISO 10646 allows conforming implementations to be BMP-based, i.e., requiring only two octets, representing the row and column within the BMP. The full four-octet form of encoding is called UCS-4, and the two-octet form UCS-2. Under UCS-2, therefore, the hexadecimal encoding for the letter 'a' is 0061 (Row 0, Column 0x61). Row 0 of the BMP is essentially ISO 8859-1 (Latin-l) with the U.S. dollar sign as the currency symbol.

ISO 10646 also defines combining characters, such as non-spacing diacritics. In conforming applications, combining characters always follow the base character that they modify. The UCS-2 encoding for , then, consists of two 16-bit integers: 0061 0301 (0301 is called the non-spacing acute). For compatibility with existing character sets, there is also a single UCS-2 code point for (00e1).

For the most part, only Roman characters have such dual representations. Some non-Roman languages, such as Arabic, Hindi, and Thai, also require the use of combining characters. ISO-10646 specifies three levels of conformance for tools and applications:

Level 1 combining characters not allowed

Level 2 combining characters allowed for Arabic, Hebrew, and Indic scripts only

Level 3 combining characters allowed with no restrictions

Unicode

Unicode is a 16-bit encoding scheme that supports most modern written languages. It began independently of ISO 10646, but with Unicode version 1.1, it is now a subset of 10646 (to be precise, it is UCS-2, Level 3). Unicode also defines mapping tables to translate Unicode characters to and from most national and international character set standards.

Some applications should readily convert to Unicode. Since ASCII is a subset, it is only necessary to change narrow (eight-bit) characters to wide characters. In C and C++, this means replacing char declarations with wchar_t. Some other character sets, such as Thai and Hangul, appear in the same relative order within Unicode, so you just need to add or subtract a fixed offset. Converting Han characters requires a lookup table.

Vendors are now beginning to support Unicode, and tools are available at both the operating system and API levels. Tools supporting the 32-bit encodings of ISO 10646 are not expected for many years — especially since no planes beyond the BMP have been populated.

Bibliography

"UCS Coexistence/Migration," X/Open Internal Report, Doc. No. SC22/WG20 N252, 1993.

The Unicode Standard. The Unicode Consortium, Addison-Wesley, 1991.

Katzner, Kenneth. The Languages of the World. 1986.

Martin, Sandra. "Internationalization Explored," UniForum, 1992.

Plauger, P. J., "Large Character Sets for C," Dr. Dobb's Journal, August 1992.

Figure 1 World Scripts

European
     Armenian, Cyrillic, Georgian, Greek, Roman
Indic
     Northern
          Bengali, Devanagari, Gujarati, Gurmukhi, Oriya
     Southern
          Kannada (Kanarese), Malayalam, Sinhalese, Tamil,
          Telugu
     Southeast
          Burmese, Khmer, Lao, Thai
     Central Asian
          Tibetan
Middle Eastern
     Arabic, Hebrew
East Asian (Oriental)
     Han, Bopomofo, Kana (Hiragana + Katakana), Hangul
Other Asian
     Lanna Thai, Mangyan, Mongolian, Naxi, Pollard, Pahawh
     Hmong, Tai L, Tai Na, Yi
African
     Ethiopian, Osmanya, Tifinagh, Vai
Native American
     Cree
Miscellaneous
     Chemistry, Mathematics, Publishing Symbols, IPA
     (International Phonetic Alphabet)

Figure 2 Eight-bit Character Set Standards

ISO 8859-1 (Latin-1)   Western European
ISO 8859-2 (Latin-2)   Eastern European
ISO 8859-3 (Latin-3)   Southeastern European
ISO 8859-4 (Latin-4)   Northern European
ISO 8859-5             Cyrillic
ISO 8859-6             Arabic
ISO 8859-7             Greek
ISO 8859-8             Hebrew
ISO 8859-9 (Latin-5)   Western European + Turkish

Table 1 Standard C Headers: Group 1 (required knowledge for every C programmer)

<ctype.h>   Character Handling
<stdio.h>   Input/Output
<stdlib.h>  Miscellaneous Utilities
<string.h>  Text Processing

Table 2 Standard C Headers: Group II (tools for the professional)

<assert.h>   Assertion Support for
             Defensive Programming
<limits.h>   System Parameters for
             Integer Arithmetic
<stddef.h>   Universal Types &
             Constants
<time.h>     Time Processing

Table 3 Standard C Headers: Group III (power at your fingertips when you need it)

<errno.h>    Error Detection
<float.h>    System Parameters for
             Real Arithmetic
<locale.h>   Cultural Adaptation
<math.h>     Mathematical Functions
<setjmp.h>   Non-local Branching
<signal.h>   Interrupt Handling (sort of)
<stdarg.h>   Variable-length Argument
             Lists

Table 4 Interger boundaries defined in <limits.h>

CHAR_BIT             8
SCHAR_MIN            -127
SCHAR_MAX            127
UCHAR_MAX            255

[if char == signed char]
CHAR_MIN             SCHAR_MIN
CHAR_MAX             SCHAR_MAX
[else]
CHAR_MIN             0
CHAR_MAX             UCHAR_MAX

MB_LEN_MAX           1

SHRT_MIN             -32767
SHRT_MAX             32767
USHRT_MAX            65535

INT_MIN              -32767
INT_MAX              32767
UINT_MAX             65535

LONG_MIN             -2147483647
LONG_MAX             2147483647
ULONG_MAX            4294967295

Table 5 Definitions in <stddef.h>

ptrdiff_t  type for pointer subtraction
size_t     type for sizeof
wchar_t    wide-character type
NULL       zero pointer
offsetof   offset in bytes of structure members

Table 6<stdlib.h> functions that use wchar_t

mbtowc    translate multibyte character to wide character
wctomb    translate wide  character to multibyte character
mbstowcs  translate multibyte string to wide string
wcstombs  translate wide string to multibyte string

Table 7 Definitions in <time.h>

                     Macros
----------------------------------------------------
NULL
CLOCKS_PER_SEC

                     Types
----------------------------------------------------
size_t
clock_t          system clock type
time_t           ncoded time/date value
struct tm        decoded time/date components

                     Functions
----------------------------------------------------
difftime         duration between two times
mktime           normalizes a struct tm
time             retrieves current time encoding
asctime          text representation of a time value
ctime            text representation of current time
gmtime           decodes into UTC time
localtime        decodes a time value
strftime         formats a decoded time

Listing 1 A use for CHAR_BIT

/* bit3.c: Toggle bits in a word */
#include <stdio.h>
#include <limits.h>

#define WORD     unsigned int
#define NBYTES   sizeof(WORD)
#define NBITS    (NBYTES * CHAR_BIT)
#define NXDIGITS (NBYTES * 2)

main()
{
   WORD n: 0;
   int i, j;

   for (j = 0; j < 2; ++j)
      for (i = 0; i < NBITS; ++i)
      {
         n ^= (1 << i);
         printf("%0*X\n",NXDIGITS,n);
      }

   return 0;
}

/* Output:
0001
0003
0007
000F
001F
003F
007F
00FF
01FF
03FF
07FF
0FFF
1FFF
3FFF
7FFF
FFFF
FFFE
FFFC
FFF8
FFF0
FFE0
FFC0
FF80
FF00
FE00
FC00
F800
F000
E000
C000
8000
0000
*/

/* End of File */

Listing 2 Uses <limits.h> to choose a suitable numeric type

/* range.c */
#include <stdio.h>
#include <limits.h>

#define LOWER_BOUND <your min here>
#define UPPER_BOUND <your max here>

/* Determine minimal numeric type for range */
#if LOWER_BOUND < LONG_MIN || LONG_MAX < UPPER_BOUND
   typedef double Num_t;
#elif LOWER_BOUND < INT_MIN || INT_MAX < UPPER_BOUND
   typedef long Num_t;
#elif LOWER_BOUND < SCHAR_MIN || SCHAR_MAX < UPPER_BOUND
   typedef int Num_t;
#else
   typedef signed char Num_t;
#endif

main()
{
   Num_t x;

   printf("sizeof(Num_t) == %d\n",sizeof x);
   return 0;
}

/* End of File */

Listing 3 Illustrates wide character strings

#include <stddef.h>
#include <stdio.h>

main()
{
   char str[] = "hello";
   wchar_t wcs[] = L"hello";

   printf("sizeof str = %d\n",sizeof str);
   printf("sizeof wcs = %d\n",sizeof wcs);
   return 0;
}

/* Output:
sizeof str = 6
sizeof wcs = 12
*/
/* End of File */

Listing 4 offsetof exposes alignment within a structure

#include <stddef.h>
#include <stdio.h>

struct Person
{
    char name[15];
    int age;
};

main()
{
    printf("%d\n",offsetof(struct Person, age));
    return 0;
}

/* Output:
16
/*

/ End of File */

Listing 5 Stopwatch Function Prototypes

/* timer.h: Stopwatch Functions */

void timer_reset(void);
void timer_wait(double nsecs);
double timer_elapsed(void);

/* End of File */

Listing 6 Implementation of Stopwatch Functions

/* timer.c:     Stopwatch Functions */

#include <time.h>
#include "timer.h"

static clock_t start = (clock_t) 0;

/* Reset the timer */
void timer_reset(void)
{
   start = clock();
}

/* Wait a number of seconds */
void timer_wait(double secs)
{
   clock_t stop = clock() +
      (clock_t) (secs* CLOCKS_PER_SEC);
   while (clock() < stop);
      ;
}

/* Compute elapsed time in seconds */
double timer_elapsed(void)
{
   return (double)(clock() - start) / CLOCKS_PER_SEC;
}

/* End of File */

Listing 7 Illustrates the stopwatch functions

/* t_timer.c:  Tests the stopwatch functions */

#include <stdio.h>
#include <limits.h>
#include "timer.h"

main()
{

   long i;

   timer_reset();

    /* Delay */
   for (i = 0; i < LONG_MAX; ++i)
        ;

   /* Get elapsed time */
   printf("elapsed time: %lf secs\n", timer_elapsed());
   return 0;
}

/* Output:
elapsed time: 565.070000 secs
*/

/* End of File */

Listing 8 The definition of struct tm from <time.h>

struct tm
{
  int  tm_sec;     /* seconds (0 - 59) */
  int  tm_min;     /* minutes (0 - 59) */
  int  tm_hour;    /* hours (0 - 23) */
  int  tm_mday;    /* day of month (1 - 31) */
  int  tm_mon;     /* month (0 - 11) */
  int  tm_year;    /* years since 1900 */
  int  tm_wday;    /* day of week (0 - 6) */
  int  tm_yday;    /* day of year (0 - 365) */
  int  tm_isdst;   /* daylight savings flag */
};