% This file is part of CWEBx. % This program by Marc van Leeuwen based on earlier versions by % D. E. Knuth., Silvio Levy and Frank Jensen. % It is distributed WITHOUT ANY WARRANTY, express or implied. % CWEB (Revision: 2.0) % Don Knuth, July 1990 % Version 3.x, Marc van Leeuwen, December 1993 % CWEBx 2+1.0, Marc van Leeuwen, August 1994 % CWEBx 3.0, Marc van Leeuwen, Januari 1995 % CWEBx 3.02, Marc van Leeuwen, April 1996 % CWEBx 3.02a, Marc van Leeuwen, September 1996 % CWEBx 3.03, Marc van Leeuwen, January 1998 (ctangle unchanged) % Copyright (C) 1987,1990 Silvio Levy and Donald E. Knuth % Copyright 1994 Marc A. A. van Leeuwen % Permission is granted to make and distribute verbatim copies of this % document provided that the copyright notice and this permission notice % are preserved on all copies. % Permission is granted to copy and distribute modified versions of this % document under the conditions for verbatim copying, provided that the % entire resulting derived work is distributed under the terms of a % permission notice identical to this one. \def\me.{CTANGLE} \def\myroots{\.{TANGLE}} % Here is TeX material that gets inserted after \input cwebxmac \def\ASCII.{\caps{ASCII}} @i intro.inc % Here is some text that matches the start of CWEAVE @d banner "This is CTANGLE (Version x"@+version_string@+")\n" @ The following parameters are specific to \.{CTANGLE}; those which are common to \.{CTANGLE} and \.{CWEAVE} are defined in the file \.{common.inc} and appear below. Some of these values have been decreased with respect to their earlier values which were sufficient in the original \.{WEB} to handle \TeX; a motivation is given at the common declarations. The macro |variant| is used in the include file \.{common.h}, and should therefore be defined early; it refers to the |struct text| that will be declared later. @d max_toks 150000L /* number of bytes in compressed \Cee~code */ @d max_texts 2500 /* number of replacement texts, must be less than 10240 */ @d max_files 50 /* number of auxiliary output files */ @d stack_size_max 50 /* module nesting level during output */ @d max_indent 1000 /* size of buffer for save indentation characters */ @) @d variant @;text @ The program is built from two compilation units, one with source file \.{common.w}, which contains a collection of routines and data shared between \.{CTANGLE} and \.{CWEAVE}, and a second with source file \.{ctangle.w} containing all code specific to \.{\me.}, and whose typeset version you are now reading. All compilation units of the \.{CWEB} system incorporate the file \.{common.inc} containing common declarations. \.{\me.} has a fairly straightforward outline. It operates in two phases: first it reads the source file, saving the \Cee~code in compressed form; then outputs the code, after shuffling it around. It can optionally be compiled with the preprocessor symbol |STAT| defined, in which case it will keep track of how much of \.{\me.}'s resources were actually used. One command line argument is specifically designated to modify the behaviour of \.{\me.}. Normally \.{\me.} will produce machine-oriented output that supplies \&{\#line} directives (to allow compilers and debuggers to locate the original source lines of statements), while preserving line breaks but otherwise ignoring the lay-out and comments in the source text. When the flag `\.{-l}' is supplied on the command line however, the output cannot be traced back easily to the source lines, but will be easier for humans to read: no \&{\#line} directives are produced, but original lay-out and comments are preserved as well as possible. The flag controlling this behaviour is called |line_output|; in the standard mode of operation we have |line_output==true|. @d line_output flags['l'] @c @< Function prototypes used but not defined in the shared code @>@; @< Typedef and enumeration declarations @>@; @< Prototypes @>@; @< Global variables @>@; int main (int argc, char** argv) { program=ctangle; line_output=true; common_init(argc,argv); @ if (show_banner) print(banner); /* print a ``banner line'' */ phase_one(); /* read all the user's text and compress it into |tok_mem| */ phase_two(); /* output the contents of the compressed tables */ wrap_up(); /* and exit gracefully */ return 0; } @i common.inc @* Data structures exclusive to {\tt \me.}. The basic mission of \.{\me.} is a relatively simple one, namely to separate the \Cee~code from the commentary, to forget about the latter, and to output the former almost verbatim, but plugging in the program text corresponding to a module each time a reference is made. To this end the bodies of all the sections are collected during Phase~I, and the proper links are laid, after which the code is linearised for output in a straightforward manner during Phase~II. The \Cee~code itself will be stored as a sequence of bytes (more precisely, of |eight_bits|) in a single large array |tok_mem|, much in the same way as the characters for identifiers and module names are stored in |byte_mem|; the encoding used will be described below. The individual sections are represented by structures of type |text|, which contain pointers to the |tok_mem| array as well as links by which they may be connected into lists; the same structures are also used to represent the replacement texts of macros. All |text| structures reside in an array |text_table|, which is comparable to |id_table| and |mod_table|. @= typedef struct text { eight_bits* tok_start; /* pointer into |tok_mem| */ struct text* text_link; /* relates replacement texts */ } text,* text_pointer; @ Since all allocation is during Phase~I and all deallocation during Phase~II, there is no reason for reclaiming memory when it is no longer needed. Hence allocation is strictly linear, and a single pointer into |tok_mem| for each |text| suffices, since a texts ends where the next one starts. The first position of |tok_mem| that is unoccupied by a replacement text is called |tok_ptr|, and the first unused location of |text_table| is called |text_ptr|. Whenever we are not in the process of adding tokens to |tok_mem|, we have the identity |tok_begin(text_ptr)==tok_ptr|. @d tok_begin(p) (p)->tok_start @d tok_end(p) ((p)+1)->tok_start @d text_table_end (&text_table[max_texts]) @d tok_mem_end (&tok_mem[max_toks]) @= text text_table[max_texts]; text_pointer text_ptr=&text_table[0]; /* first unused position in |text_table| */ eight_bits tok_mem[max_toks]; eight_bits *tok_ptr=&tok_mem[0]; /* first unused position in |tok_mem| */ @ Invariants must be initialised. @= tok_begin(text_ptr)=tok_ptr; @ The following macro and function are used to enter one- and two-byte tokens into |tok_mem| when a replacement text is being generated. @d store_byte(c) @+ if (tok_ptr==tok_mem_end) overflow("token"); @+ else *tok_ptr++=c@; @c void store_two_bytes (sixteen_bits x) { if (tok_ptr+2>tok_mem_end) overflow("token"); *tok_ptr++ = x >> 8; /* store high byte */ *tok_ptr++ = x & 0xFF; /* store low byte */ } @ When by successive calls of |store_byte| and |store_two_bytes| a complete replacement text has been stored, the following code makes it into an official text object by advancing |text_ptr| and storing the starting location of the next text to be generated into the |text_table| array. @< Wrap up the accumulated bytes into a completed text, pointed to by |cur_text| @>= { cur_text=text_ptr++; /* consolidate the replacement text */ if (text_ptr>=text_table_end) overflow("text"); tok_begin(text_ptr)=tok_ptr; /* mark end of replacement text */ } @ The |text_link| field links a section to a possible continuation of it (the next section with the same name, or for unnamed sections the next unnamed section). The set of all sections linked together in such a way will be termed a `module'. Apart from the value |NULL| used to indicate the end of such a list, we need two other distinguished values: |macro_flag|, which indicates that the text is a macro replacement text rather than a section body, and |header_flag| similarly indicating that the text is the (quoted) file name of a header file that is included. By a fortunate coincidence, there are two unused pointer values near the end of |text_table|: since the macro |tok_end| should work for all pointers to a replacement text, the last entry of |text_table| is never used to store a replacement text; its address can be used as one reserved pointer value, and the definition of~\Cee\ generously offers us the address of the non-existent entry following it as another valid pointer value. The pointer to the first section of the unnamed module appears in |text_root|, and the address of the final link of this module is recorded in |last_unnamed|. The named modules are accessed by the extra pointer in the |mod_info| structure for a module name, which is a |text_pointer| pointing to the body of the first section of that module; in~\.{\me.} this field will be called |equiv|. @d macro_flag (text_table_end-1) /* address of unused entry */ @d header_flag text_table_end /* address just beyond |text_table| */ @d next_sec(m) ((m)->text_link) /* next section of the same module, if any */ @d equiv equiv_or_xref /* info corresponding to names */ @= text_pointer text_root=NULL; text_pointer* last_unnamed=&text_root; /* where to link new unnamed sections */ @ Here are the functions specific to \.{\me.}, which are used by the common lookup routines. The function |names_match| decides whether a name of length~|l| starting at position |q| equals the identifier represented by~|x|. The parameter |dummy| is present only for compatibility with \.{CWEAVE}, where it contains the |ilk| code. The common lookup routines refer to separate routines |init_module_name| and |init_id_name| when the data structure grows. Actually |init_id_name| is called only when |program==cweave|, but we need to declare a dummy version so that the linker won't complain about its absence. @c boolean names_match (id_pointer x,char* q,int l,int dummy) {@; char* p=name_begin(x); while (--l>=0) if (*p++!=*q++) return false; return *p=='\0'; } void init_module_name(mod_pointer node) @+{@; node->equiv=NULL; } void init_id_name (id_pointer dummy,int ilk) @+ {} @* Tokens. Clearly, \.{\me.} must hold all the code for a complete program file at the end of Phase~I (not counting the possible multiple use of some modules, which will probably not make a great difference), and so it is worth while to use a compact representation for the internal storage of the \Cee~code. Using full-fledged data compression techniques would probably be overdoing things a bit, but a few simple methods will make a great difference. There are likely to be many identical occurrences of identifiers and keywords of the language, so it is efficient to collect them into a table and replace them in the internal format by references to the table; module names have to be entered in a table anyway for the purpose of associating occurrences of the same name with each other. Furthermore white space and comments can be removed upon storing the texts; any necessary white space can be reinserted during the output process (but note that newlines must be represented internally, in order to be able to match output lines with input lines). @ These compressed texts representing \Cee~code, while physically represented as a sequence of eight-bit bytes in |tok_mem|, logically consist of streams of `tokens', some of which occupy two or more consecutive byte positions, while others take just one byte. If the first byte is |a| and, in case |a>=0x80|, the next byte is~|b|, then the interpretation of the token is as follows. \Y \item{$\bullet$} |0<=a<0x80|: the token represents the character~|a|; \item{$\bullet$} |0x80<=a<0xA8|: the two-byte token represents the identifier with name |id_at((a-0x80)*@t$2^8$@>+b)|; \item{$\bullet$} |0xA8<=a<0xD0|: the two-byte token represents the module with name |mod_at((a-0xA8)*@t$2^8$@>+b)|; \item{$\bullet$} |0xD0<=a<0xF8|: this two-byte token marks the beginning of (a part of) the repacement text for the current module, defined in section number |(a-0xD0)*@t$2^8$@>+b|; \item{$\bullet$} |a==0xF8|: the two-byte token represents the (8-bit) character~|b|. \item{$\bullet$} |a==0xF9|: the token consists of 5 bytes, say |a|,~|b|, |c|, |d| and~|e|, and represents a \&{\#line} directive for line number |b*@t$2^8$@>+c| in the file |id_at(d*@t$2^8$@>+e)|. \item{$\bullet$} |a==0xFA|: the token consists of a single byte, and represents the location, indicated by the \:p control codes in the source file, where the preprocessor directives produced by \:d and \:h will be included in the output; this location will be marked by the presence of the virtual module called `|@p|' in the typeset output. \Y Any \Cee~token that consists of a single 7-bit character is represented by that character itself; in particular, a single-character identifier like `|x|' will be a one-byte token, while all longer identifiers will occupy two bytes. Some of the 7-bit codes will not be present however, since they are control codes and they do not represent a symbol useful for \Cee\ in the \caps{MIT} extension of \ASCII., so we can use them for special purposes. The following symbolic names are used: \yskip \hang |join| denotes the concatenation of adjacent items with no space or line breaks allowed between them (the \:\& operation of \.{CWEB}). \hang |verb_quote| denotes the beginning or end of a sequence of bytes that is to be copied verbatim to the output, i.e., no spaces should be inserted. Such sequences are used to transmit numerical constants and strings (in which case the ordinary string quotes appear within the |verb_quote| bytes), and to implement the \:= operation. @^ASCII code dependencies@> @d verb_quote 0x2 /* takes the place of extended \ASCII. \.{\char2} */ @d join 0x3 /* takes the place of extended \ASCII. \.{\char3} */ @< Global var... @>= boolean atp_seen=false; /* whether \:p occurs anywhere */ @* Phase II processing. We will start to explain the structure of the tokens lists built up in memory by considering how they are output during Phase~II; this is the simpler part of the program. We will proceed in top-down fashion for this part. @= void phase_two (void); /* output the contents of the compressed tables */ void output (text_pointer); /* recursively write out modules */ void output_preproc_directives(void); /* write out all \:d and \:h stuff */ void C_newline (void); /* send a newline to the \Cee~file */ void out_char (eight_bits); /* write a single token */ @ Here is the general control routine. After Phase~I, the quantity |output_file_count| will be equal to the number of extra output files to be written. There should at least be either an unnamed section or one that produces an additional output file, for otherwise there will be no output at all; if the is no unnamed section, \:d~and~\:h commands will only have effect if \:p occurs somewhere. @c void phase_two (void) { phase=2; if (text_root==NULL && output_file_count==0) @/{@; print("\n! No program text was specified."); mark_harmless(); } @.No program text...@> else { if (show_progress) { print("\nWriting the output file%s" @.Writing the output...@> ,(text_root!=NULL)+output_file_count>1 ? "s" : ""); if (text_root!=NULL) printf(" (%s):",C_file_name); update_terminal(); } if (text_root==NULL) C_file=NULL; else { open_output_file(); cur_line=1; if (!atp_seen) output_preproc_directives(); output(text_root); } @ print_progress("\nDone.\n"); } } @ \.{\me.} can write output on multiple files. If a module name is introduced in at least one place by \:( instead of \:<, it is treated as the name of a file. All these special module names are saved in the array |output_files|. @= mod_pointer output_file[max_files]; int output_file_count=0; @ To write the named output files, we proceed as for the unnamed module, the only subtlety is that we have to open each one after closing the previous one. However, if there is no main file, there is no file to close the first time, and this situation can be recognised because we have set |C_file=NULL| in that case. The length of a module name can be at most |longest_name| (which is rather more generous than allowed elsewhere for file names), so we need no test for overflowing |output_file_name|. @= { int i; char output_file_name[longest_name+1]; /* name of the file */ for (i=0; i if (C_file!=NULL) fclose(C_file); if (output_module->equiv==NULL) @/{@; print("\n! Module not present"); @.Module not present@> print_mod(output_module); err_print(""); } else if ((C_file=fopen(output_file_name,"w"))==NULL) @/{@; print("\n! Cannot open \"%s\" as output file",output_file_name); @.Cannot open output file@> err_print(""); } else { if (show_progress) print("\n(%s):",output_file_name); cur_line=1; output(output_module->equiv); } } } @ If the module name defining the output file contains any occurrences of `\.{@@@@}', they are undoubled in forming the actual file name; apart from this should be no control codes in the module name. @< Copy name of |output_module| into |output_file_name| @>= { char* p=output_file_name, *q=name_begin(output_module); while (*q!='\0') if ((*p++=*q++)=='@@') if (*q++!='@@') { print("\n! Illegal control code in file name"); @.Illegal control code in file name@> print_mod(output_module); err_print(""); } *p='\0'; } @ Here is how a file name gets added to the list of output files. @= { int i=0; while(i= typedef struct { text_pointer repl_field; /* replacement text of active section */ eight_bits *byte_field; /* present location within replacement text */ eight_bits *end_field; /* ending location of replacement text */ sixteen_bits sec_nr_field; /* section number */ sixteen_bits indent_field; /* amount of indentation */ } output_state, * stack_pointer; @ The global variable |stack_ptr| tells how many levels of output are currently in progress; the stack grows upwards with |stack_ptr| pointing to the first vacant location above the top of the stack. The entry |stack[0]| at the bottom of the stack is not used; when |stack_ptr| points to it then |cur_state| itself has become invalid and the output process is completed; this condition can be tested as |stack_empty()|. This somewhat strange use of the stack is forced upon us because the \Cee\ language does not guarantee that a pointer value at offset~$-1$ of an allocated array exists. Since the location |stack[0]| is vacant we might as well use it to store |cur_state| in. @d cur_state stack[0] @d stack_end (&stack[stack_size_max]) @d stack_empty() (stack_ptr==&stack[0]) @= output_state stack[stack_size_max]; /* info for non-current levels */ stack_pointer stack_ptr; /* points above top of output state stack */ @ When |line_output==false| we keep track of the current indentation level in an array of white space characters |indent_buffer|. Every time a newline is output, a sequence of |cur_ind| characters from |indent_buffer| are also output by invoking |put_indent|. When relevant non-newline characters are output, a white space counterpart of the character is appended to |indent_buffer| by invoking the macro |append_white|, whose argument should not involve any side effects. The current position of writing the characters is maintained in |ind_i|, and reset to |cur_ind| on output of a newline. @d C_printf(format,x) fprintf(C_file,format,x) @d C_putc(c) putc(c,C_file) @d put_indent() (indent_buffer[ind_i=cur_ind]='\0',C_printf("%s",indent_buffer)) @d append_white(c) if (ind_i>=max_indent) overflow("indent buffer"); else indent_buffer[ind_i++]= isspace(c) ? c : ' ' @; @< Global var... @>= char indent_buffer[max_indent]; /* white space characters to produce indentation */ sixteen_bits ind_i; /* number of those characters needed to reach current position */ @ When the replacement text for name |p| is to be inserted into the output, the following function is called to save the old level of output and get the new one going. The value of |cur_sec| is not set since its value is recorded in the first token to be read after this function is called. It is by calling |push_level| that the current value of |ind_i| becomes significant, by being copied to |cur_ind|. @c void push_level (mod_pointer p) /* suspends the current level */ { if (stack_ptr==stack_end) overflow("output stack"); *stack_ptr++=cur_state; @/cur_repl=p->equiv; cur_byte=tok_begin(cur_repl); cur_end=tok_end(cur_repl); @/cur_ind=ind_i; } @ When we come to the end of a replacement text, the function |continue_or_pop_level| does what its name suggests: it either moves to the continuation of this replacement text (i.e., to the next section of the same module) or returns the state to the most recently stacked level when a module is completed. In the latter case all the fields of |cur_state| are restored; in the former case, like for |push_level|, the value of |cur_sec| is not yet set, but will be adjusted immediately afterwards. @c void continue_or_pop_level (void) { if (cur_repl->text_link != NULL) /* then link to a continuation, staying on the same level */ {@; cur_repl=next_sec(cur_repl); cur_byte=tok_begin(cur_repl); cur_end=tok_end(cur_repl); } else if (--stack_ptr > &stack[0]) cur_state=*stack_ptr; /* otherwise |stack_empty()| holds */ ind_i=cur_ind; /* reset indentation to base of current level */ } @ The function |output| handles output of tokens by sending them to a lower level function |out_char|, until the condition |stack_empty()| holds. The task of |output| is to isolate and decode tokens, to handle stacking and unstacking as necessary, and to issue special calls to |out_char| to mark the beginning and end of replacements texts and to produce appropriate \&{\#line} directives. As its name suggests, |out_char| usually handles the output of single characters, but some special codes will make it perform a few other operations. Some of these, like |verb_quote| and compressed operator codes occur, as bytes with values below |0x80| inside the replacements texts, and these get no special treatment by |output|. A few other ones are generated explicitly by |output|, and these have values from |0x80| upwards, defined in the enumeration below. Information about the precise value of two-byte tokens is passed to |out_char| via the global variable |cur_val|. @= int cur_val; /* additional information corresponding to output token */ @~The code |identifier| is used for identifiers of length two or more, in which case |cur_val| indicates the identifier name. The codes |section_start| or |section_end| are sent at the beginning and end of parts of the replacement text defined in a one section, in which case |cur_val| is the section number; this number will be recorded in a comment to aid human readers of the \Cee~file produced. When a \&{\#line} directive is to be output, |line_mark| is sent to |out_char|, which will fetch the following 4~bytes (specifying the line number and file name) from the replacement text itself; this is the only occasion where |output| does not read a complete token before calling |out_char|. Any 8-bit characters that were escaped by a byte `|0xF8|' are sent directly to |out_char|; they do not interfere with the codes below since the input routine guarantees that they only occur between |verb_quote| tokens, at which times |out_char| does not honour any special codes. @< Typedef and enumeration... @>= enum {identifier=0x80, section_start, section_end, line_mark }; @ Due to the output stack operations, the function |output| can be written iteratively rather than recursively. A user who is not afraid of recursion may however easily rewrite this code into a recursive form, and dispose of the output stack and its routines altogether; the recursion depth will be equal to the level of nesting of modules, which is not likely to cause any problems in practice. We have retained the iterative version out of reverence for the author of the original |WEB| system; it has the additional advantage that, should there be a cycle in the directed graph of module references, a reasonable overflow message will be produced by |push_level|, whereas otherwise system stack overflow would occur, and it would depend on the \Cee~runtime support whether it is properly detected. @c void output (text_pointer repl) /* sends tokens to |out_char| */ { stack_ptr=&stack[1]; @/cur_repl=repl; cur_byte=tok_begin(cur_repl); cur_end=tok_end(cur_repl); cur_ind=ind_i=0; do if (cur_byte==cur_end) { cur_val=cur_sec; continue_or_pop_level(); out_char(section_end); /* output ending section number comment */ } else @< Output the token starting at |*cur_byte|, advancing |cur_byte| correspondingly @> while(!stack_empty()); C_newline(); } @ Most tokens lead to calling |out_char| with an appropriate value, after possibly setting |cur_val| and |cur_sec|. Exceptions are module names, which will push their replacement text on the stack but not produce any direct output, and tokens |0xFA| which will produce the collected preprocessor directives by calling |output_preproc_directives|, after ensuring that this will start on a fresh line of output. @< Output the token starting at |*cur_byte|... @>= { int a = *cur_byte++; if (a<0x80) out_char(a); /* single byte token */ else if (a>=0xF8) if (a<0xFA) out_char(a==0xF8 ? *cur_byte++ : line_mark); else {@; C_newline(); output_preproc_directives(); } /* |a==0xFA| */ else { cur_val=(((a-=0x80)%0x28)<<8)+*cur_byte++; switch (a/0x28) { case 0: out_char(identifier); break; case 1: @ @+ break; case 2: cur_sec=cur_val; out_char(section_start); /* set the correct section number and output comment */ } } } @ If any defining occurrence of a module name was encountered during the Phase~I, its replacement text will have been be linked into to |equiv| field of that name; otherwise that field will contain a null pointer and we must report an error. @= { mod_pointer mod_name=mod_at(cur_val); if (mod_name->equiv!=NULL) push_level(mod_name); else {@; print("\n! Module not present"); print_mod(mod_name); err_print(""); @.Module not present@> } } @ Output of macro definitions and inclusions of header files is sufficiently different from ordinary output that we use a special function |output_preproc_directives| for it. We bypass |output| and call |out_char| directly, since in these cases there can be no nesting of replacement texts. We go through the list of all replacement texts and copy the ones that refer to macros or header files, preceded respectively by \&{\#define} and \&{\#include}. During the output of a single directive any line breaks present in the source file will be protected by backslashes. We use local variables |p| and |end| instead of |cur_byte| and |cur_end|, since the current level of the stack is already in use if this function is called from |output|. @c void output_preproc_directives (void) { text_pointer repl,l; eight_bits* p, *end; protect=true; /* newlines should be preceded by |'\\'| */ for (repl=&text_table[0]; repltext_link)==macro_flag || l==header_flag) { p=tok_begin(repl); end=tok_end(repl); C_printf ("#%se ",l==macro_flag ? "defin" : "includ"); out_state=no_space; while (p C_newline(); /* this newline is not escaped */ } protect=false; } @ The situation here is simpler than in |output|, since module names and \:p control codes are forbidden in the replacement texts handled here, nor will codes for \.{\#line} directives or section number indications have been included in them during input. Should any such token nevertheless turn up, then one of the two calls of |confusion| below is invoked; the diagnostic is not quite exhaustive as to the possible cause in either case, but we do not wish to elaborate on code that will never be executed. @< Output preprocessor token... @>= { int a=*p++; if (a<0x80) out_char(a); /* single byte token */ else if (a>=0xF8) if (a==0xF8) out_char(*p++); @+ else confusion("`@@p' within macro"); @.`@@p' within macro@> else { cur_val=(((a-=0x80)%0x28)<<8)+*p++; if (a<0x28) out_char(identifier); @+ else confusion("module within macro"); @.module within macro@> } } @* Writing characters. The |output| routine above handles the global structure of output generation; we now present the routines that transform the lexical items produced by |output| into characters. First we give a function that is called whenever we want to finish off a line sent to the \Cee~file. It keeps |cur_line| equal to the number of the next line to be output, and displays a progress report every 100 lines. @c void C_newline (void) /* writes one line to output file */ { C_putc('\n'); if (!line_output) put_indent(); if (cur_line%100==0 && show_progress) { if (cur_line%500!=0) print("."); /* progress report */ else print(cur_line%2500==0 ? "%u\n" : "%u",cur_line); update_terminal(); } ++cur_line; } @ The function |out_char| must make sure that the output has the proper ``surface structure''. When |line_output==false| everything should essentially be written as it is stored, but when |line_output==true| only the essential tokens have been stored, and white space has to be inserted where appropriate. Spaces should not occur at certain places (e.g., not in the middle of a string or a constant or an identifier, not at a \:\& position where quantities are being joined together), while in other places they are required (e.g., between identifiers, which for \.{\me.} includes reserved words, and between certain operators that might otherwise be considered as a single token). Such surface structure can very nicely be obtained by attaching a small finite state machine to the output generator, recording information about the most recent token (as a matter of fact, even \TeX's fabulous formatting of math formulae is largely based on such a simple device). In our present case, the state of the output process is recorded in the global variable |out_state|. Furthermore there is a variable |protect| that is explicitly managed by the function calling |out_char| (notably by |output_preproc_directives|), being set to |true| whenever newlines are to be escaped by a backslash. @= eight_bits out_state; /* current status of partial output */ boolean protect; /* should newline characters be quoted? */ @ The output process can be in one of following states: \yskip\hang |no_space| means that no space will precede the following item. This state is set by \:\&, and after the output of anything that could not possibly combine into a larger token, such as punctuation. \yskip\hang |num_or_id| means that the last item in the buffer is a number or identifier, hence a blank space or line break must be inserted if the next item is also a number or identifier. \yskip\hang |operator| means that the last item in the buffer is an operator, hence a blank space or line break must be inserted if the next item is also an operator. \yskip\hang |literal| means we're copying only character tokens, and that they are to be output exactly as stored. This is the case during strings, verbatim constructions and numerical constants; the |verb_quote| character serves as a quoting mechanism which switches this state on~and~off. @< Typedef and enum... @>= enum { no_space, num_or_id, operator, literal}; @ The function |out_char| is a many-way switch on the kind of character or token transmitted to it. The |verb_quote| mechanism is used amongst others to encapsulate numeric constants, so we make the whole construction behave as an identifier with respect to spacing (the user can still defeat any spaces around it by means of \:\&). The code |line_mark| will only be stored (and output) if |line_output==true|. We do not anticipate the need of indentation across literals, operators and such, and assume that module names are preceded on the same line only by white space, identifiers (keywords) and things like braces; apart from identifiers all of these come through the |default| case below. Therefore we only include an invocation of |append_white| in that case and for identifiers; to handle all possible cases the assiduous user could add such invocations to all other calls of |C_putc| and |C_printf|. @c void out_char (eight_bits c) { if (out_state==literal) if (c==verb_quote) out_state=num_or_id; /* close literal mode */ else C_putc(c); /* write literal character, possibly 8-bit */ else if (isalnum(c)&&c<0x80 || c=='_') /* single character identifier or number */ { if (out_state==num_or_id && line_output) C_putc(' '); C_putc(c); out_state=num_or_id; if (!line_output) append_white(c); } else switch (c) { case verb_quote: if (out_state==num_or_id && line_output) C_putc(' '); out_state=literal; break; case join: out_state=no_space; break; case '\n': @< Output a newline @> @+ break; case identifier: @< Output an identifier @> @+ break; case section_start: if (line_output) C_printf("/*%d:*/",cur_val); @+ else C_newline(); out_state=no_space; break; case section_end: if (line_output) C_printf("/*:%d*/",cur_val); @+ else C_newline(); out_state=no_space; break; case line_mark: @< Output a \.{\#line} directive @> @+ break; @\@ @\@ default: C_putc(c); out_state=no_space; if (!line_output) append_white(c); } } @ Newlines are escaped whenever |protect| holds; a space is prepended so that the newline is effectively replaced by a space. This also means that the user who cannot break the habit of escaping newlines in multi-line preprocessor directives, even though this is unnecessary after \:d and \:h, is not punished for this by the creation of the sequence `\.{\\\\}' at the end of the lines, but rather gets `\.{\\\ \\}', which is harmless. @< Output a newline @>= { if (protect) {@; C_putc(' '); C_putc('\\'); } C_newline(); if (out_state!=literal) out_state=no_space; } @ Apart from alphanumeric characters and underscores, we allow identifiers to contain characters in the range |0x80<=c<=UCHAR_MAX|, if a translation into a string of legal characters is known to \.{\me.}; such translations can be specified by \:l directives in limbo. These translations are stored in an array |c_trans|. @d trans_limit 9 /* maximal length of a translation string */ @d trans_of(c) c_trans[(unsigned char)(c)-0x80] @d translation_exists(c) (trans_of(c)[0]!='\0') /* non-empty translation */ @< Global var... @>= char c_trans[UCHAR_MAX+1-0x80][trans_limit+1]; @~In compatibility mode, default translations exist of the form `\.X\\{NN}', where \\{NN} is the (two digit) upper case hexadecimal representation of the character in question. Otherwise the static initialisation will have made all translation strings empty. @< Set init... @>= if (compatibility_mode) @/{@; unsigned char c=UCHAR_MAX; do sprintf(trans_of(c),"X%X",c); while (--c>=0x80); } @ In case of identifiers the name indexed by |cur_val| is written out. The translation of characters in the range from |0x80| upwards is performed here, rather than on input of such characters, since this is slightly more convenient. The check that |translations_exists| was done on input, however. @< Output an identifier @>= { char* p=name_begin(id_at(cur_val)); int l=0; if (out_state==num_or_id && line_output) C_putc(' '); do if ((unsigned char)(*p)<0x80) {@; C_putc(*p);++l; } else {@; char* q=trans_of(*p); do {@; C_putc(*q); ++l; } while (*++q!='\0'); } while (*++p!='\0'); out_state=num_or_id; if (!line_output) do append_white(' '); while (--l>0); } @ As indicated above, a call |out_char(line_mark)| causes four further bytes to be fetched directly from the token memory, increasing |cur_byte| so that |output| will resume at the correct byte. @< Output a \.{\#line} directive @>= { sixteen_bits a; a=(*cur_byte++)<<8; a+=*cur_byte++; /* get the line number */ C_newline(); C_printf("#line %u \"",a); @/ a=(*cur_byte++)<<8; a+=*cur_byte++; /* get the file name index */ C_printf("%s\"",name_begin(id_at(a))); C_newline(); out_state=no_space; } @ In some cases when two operator symbols are adjacent, a space is required in between to avoid interpretation as a compound symbol. The language definition is not explicit about when such a space is mandatory, but examples are the somewhat unusual constructions | x /@, *p |, | a++ + b != a + ++b| and |sum = a - -b|; certain backward \Cee~compilers even think |x=-x| is ambiguous without a space before the minus sign. Rather than trying to detect precisely the problematic cases we take the conservative approach of always putting in a space, unless the second operator is~`\.='; this exception is made because we store operators like `\.{+=}' as two separate tokens, but they form a single token according to the \caps{ANSI/ISO} standard. Note that other multi-character operators like `\.{\&\&}' stay intact because they will be compressed on input, and therefore do not involve the present section. @= case '+': case '-': case '*': case '/': case '%': case'?': case '<': case '>': case '&': case '|': if (out_state==operator && line_output) C_putc(' '); /* fall through */ case '=': C_putc(c); out_state=operator; break; @ Compilers don't mind doing repetitive work, so we use a macro to produce several almost identical cases. Had the cases been much longer or more numerous however, then combining the cases and using an array of strings would have been preferable. @d comp_op(op) (C_printf(out_state==operator && line_output ? " %s" : "%s",op) ,out_state=operator) @= case plus_plus: comp_op("++"); break; case minus_minus: comp_op("--"); break; case minus_gt: comp_op("->"); break; case gt_gt: comp_op(">>"); break; case eq_eq: comp_op("=="); break; case lt_lt: comp_op("<<"); break; case gt_eq: comp_op(">="); break; case lt_eq: comp_op("<="); break; case not_eq: comp_op("!="); break; case and_and: comp_op("&&"); break; case or_or: comp_op("||"); break; @* Introduction to the phase I. We have now seen that \.{\me.} will be able to output the full \Cee\ program, if we can only get that program into the byte memory in the proper format. The input process is something like the output process in reverse, since we compress the text as we read it in and we expand it as we write it out (but of course, as any programmer knows, input requires a bit more work than the corresponding output). To preserve the symmetry we shall proceed in a bottom-up fashion in presenting this part of the program. The basic character input routines are defined in the code shared with \.{CWEAVE}. At the next higher level there are three main input routines. The most interesting is the one that gets the next token of a \Cee\ text; the other two are used to scan rapidly past \TeX\ text in the \.{CWEB} source code. One of the latter routines will jump to the next token that starts with `\.{@@}', and the other skips to the end of a \Cee~comment. @ Control codes in \.{CWEB} begin with `\.{@@}', and the next character identifies the code. Some of these are of interest only to \.{CWEAVE}, so \.{\me.} ignores them; the others are converted by \.{\me.} into internal code numbers. The code numbers have been chosen such that they are distinct from ordinary characters (so that a function can return either a character or a control code), and their ordering is such as to simplify the program logic; larger numbers are given to the control codes that denote more significant milestones. This order is only relevant starting from |format| however. The code \:> is treated as ignored because it should not occur in places where we are not searching for it. The following enumeration lists the non-character values that may be returned from~|get_next|; therefore the two values |id_code| and |constant| are included although they do not belong to any control code. @= enum @/ { ignore=0x80, /* control code of no interest to \.{\me.}, or \:> */ id_code, constant, /* token codes but not control codes */ verbatim, /* control code for \:= */ at_sign_image, /* control code for \:@@ */ join_code, /* control code for \:\& */ ord, /* control code for \:' */ control_text, /* control code for \:t, \:\^, etc. */ include_preproc, /* control code for \:p */ char_trans, /* control code for \:l */ format, /* control code for \:f */ definition, /* control code for \:d */ header, /* control code for \:h */ begin_C, /* control code for \:c */ module_name, /* control code for \:< or \:( */ new_section /* control code for \:\ , \:\~, or \:* */ }; @ The conversion from the character following `\.@@' to the corresponding numeric code is performed by means of the table~|ccode|. Since we don't know whether characters are signed or not, we always access |ccode| via the macro |code_of|. @d code_of(c) ccode[(unsigned char)(c)] @= eight_bits ccode[UCHAR_MAX+1]; /* meaning of a char following `\.@@' */ @~Here we initialise |ccode|. The code code for~\:v is the only case where a value stored in the |ccode| table is an ordinary character; this is done so that it may be used in the \Cee~part of a section to stand for~`|@v|', even though there is no need for using it there. @= { unsigned char c=0; do ccode[c] = isspace (c) ? new_section : ignore; while(c++!=UCHAR_MAX); @/ccode['v']=ccode['V']='|'; @/ccode['=']=verbatim; @/ccode['@@']=at_sign_image; @/ccode['&']=join_code; @/ccode['\'']=ord; @/ccode['^']=ccode['?']=ccode['.']=ccode[':']=ccode['#']= ccode['t']=ccode['T']=ccode['q']=ccode['Q']=control_text; @/ccode['p']=ccode['P']=include_preproc; @/ccode['l']=ccode['L']=char_trans; @/ccode['f']=ccode['F']=ccode['s']=ccode['S']=format; @/ccode['d']=ccode['D']=definition; @/ccode['h']=ccode['H']=header; @/ccode['c']=ccode['C']=begin_C; @/ccode['<']=ccode['(']=module_name; @/ccode['~']=ccode['*']=new_section; if (compatibility_mode) @< Reset some control codes to match Levy/Knuth \.{CWEB} @> } @ In \.{CWEBx} there are a few control codes that also exist in \LKC. but have a different meaning. In compatibility mode we reassign the meaning of these codes to that of \LKC., making their usual function inaccessible, since it is not intended that hybrid programs should be written using the codes of \LKC. together with features particular to \.{CWEBx}. For \.{\me.} the change of meaning of \:: (which is used instead of \:? in \LKC.) has no effect since it introduces a control text anyway. @< Reset some control codes... @>= { ccode['h']=ccode['H']=include_preproc; /* \:h means \:p */ ccode['p']=ccode['P']=begin_C; /* \:p means \:c */ ccode['#']=ignore; /* \:\# means \:) */ } @ The function |skip_ahead| reads through the input at fairly high speed until finding the next non-ignorable control code, which it returns. It is used to skip over \TeX~text, and text in limbo. It uses the fact that |get_line| places a~|' '| at~|*limit|, so that placing the sentinel cannot inadvertently create a token \:@@. @c eight_bits skip_ahead (void) /* skip to next control code */ { eight_bits c; /* control code found */ while (find_char()) { limit[1]='@@'; /* place a sentinel */ while (*loc++!='@@') {} if (loc<=limit && (c=code_of(*loc++))!=ignore) return c; } return new_section; } @ The function |skip_comment| reads through the input until finding the end-comment token `\.{*/}' or a newline, or in case of a \Cpp\ one-line comment (starting with `$/\!/$') just until finding a newline. The |one_liner| parameter tells whether the latter applies, and a boolean result is returned telling whether a newline in the middle of a comment was scanned (which can only happen if |one_liner==false|). Returning such a newline is necessary so that the each newline in the \Cee\ part of a section may be copied to the output; otherwise the \&{\#line} commands inserted into the \Cee\ file by the output routines become useless. When |line_output==false|, all characters encountered are also stored, so that they can be output along with the \Cee~code. The function gives an error message if it encounters a sequence `\.{/*}', since this almost certainly means the user has forgotten to close the comment. In this way we avoid the most pernicious consequence of \Cee's rule that nested comments are forbidden, namely that such an omission often leads to a syntactically correct, but unintended program. This means that one cannot write either `\.{/*}' or `\.{*/}' in a comment; this difficulty can usually be circumvented by inserting something like `\.{\{\}}' between the two characters. As a safety measure, |skip_comment| comes to an end if it runs into the next section, and prints an error message; it is slightly non-trivial that this test does not interfere with searching for the end of the comment. If a module name occurs in a comment (presumably in `\pb'), it is scanned by the normal routines that will be given later: this could be the only place where the name is spelled out in full, and/or using \:(. @c boolean skip_comment (boolean one_liner) /* skips over comments */ { char c; /* current character */ do { if (loc>=limit) if (one_liner) return false; else if (get_line ()) return true; else {@; err_print("! Input ended in mid-comment"); return false; } @.Input ended in mid-comment@> if ((c=*loc++)=='/' && *loc=='*') err_print("! `/*' inside comment, did you forget `*/' before? "); @.`/*' inside comment...@> if (c=='@@') { eight_bits cc=code_of(*loc++); if (cc==new_section) /* watch out for \:\ , \:\~, and \:* */ @/{@; err_print("! Section ended in mid-comment"); loc-=2; return false; } @.Section ended in mid-comment@> if (cc==module_name) @/{@; @< Scan the module name... @> continue; } } if (!line_output) {@; store_byte(c); if (c=='@@' && (c=loc[-1])!='@@') store_byte(c); } } while (c!='*' || *loc!='/' || one_liner); ++loc; @+ if (!line_output) store_byte('/'); return false; } @* Getting the next token. We now come to the function |get_next|, which does more extensive scanning. It will move the input pointer over a single token of input, and if that token consists of a single character that is not an identifier or a number, it will return that character; certain two-character operators are also returned as a single character representing the operator. In the case of longer tokens, it will return one of the codes of the enumeration given earlier, after possibly setting some global variables in such a way that the complete token can be retrieved. For numeric constants |id_first| and |id_loc| will be made to point to the beginning and end of the token within~|buffer|. For string constants, which need not reside in a single line of input, the characters are copied into the separate buffer |mod_text| and this is where |id_first| and |id_loc| will then point. Identifiers and module names are even looked up, and |cur_id| or |cur_mod| is made to point to the name found. In the case of single-character identifiers or bad module names, which are not entered into the name tables, |cur_id| respectively |cur_mod| is set to~|NULL|; in the former case |id_first| is made to point to the character that forms the identifier. @= id_pointer cur_id; /* identifier just scanned */ mod_pointer cur_mod; /* module name just scanned */ @ As one might expect, |get_next| consists mostly of a big switch that branches to the various special cases that can arise. The static flag |preprocessing| is raised while we are scanning an explicit preprocessor directive (i.e., not one introduced by \:d or \:h). The static flag |comment_continues| is raised when |get_next| must return a newline while scanning a comment, so that it will remember what it was doing the next time it is called. Although we allow 8-bit character input, any characters |c>=0x80| should be contained in a comment or a character or string constant, or in case |translation_exists(c)| an identifier, since they cannot otherwise be part of a valid \Cee~symbol. @c eight_bits get_next (void) /* produces the next input token */ { static boolean preprocessing=false; /* did this line start with `\.\#'? */ static boolean comment_continues=false; /* were we scanning a comment? */ eight_bits c; /* the current character */ restart: @< Handle cases involving newlines and comments; if a token is found |return| it, or |goto restart| if it is ignored; otherwise read the next character into |c| @> if (c=='L' && (*loc=='\'' || *loc=='\"')) {@; get_string(); return constant; } if (c<0x80 ? isalpha(c) || c=='_' : translation_exists(c)) @/{@; @< Get an identifier @> return id_code; } if (c>=0x80) {@; err_print("! Illegal 8-bit character"); goto restart; } @.Illegal 8-bit character@> if (isdigit(c) || c=='.' && isdigit((eight_bits)*loc)) @/{@; @< Get a numeric constant @> return constant; } switch(c) { case '\'': case '"': get_string(); return constant; case '@@': @< Get a control code and possibly a complete module name, and either |return| it, or |goto restart| if control code is |ignore| @> @\@< In applicable cases |return| compressed two-symbol operator @> } return c; } @ The flag |preprocessing| raised when the first character of a line is~`\.\#', and is lowered at a non-escaped newline; while it is raised spaces are not ignored. The test whether a newline is escaped is not entirely sound, since a final `\.{\\}' might be part of a \:\\ control code or of a \Cpp\ one-line comment (after `$/\!/$'); a completely correct check would be more difficult, and leaving |preprocessing| raised a bit too long does no real harm. Contrary to |preprocessing|, the flag |comment_continues| will always be given a new value on the first call to |get_next| after it is set, and therefore can remain set only if |skip_comment| finds another newline. @< Handle cases involving newlines... @>= { if (loc>=limit) { if (preprocessing && limit>buffer && limit[-1]!='\\') preprocessing=false; return get_line() ? '\n' : new_section; } if (comment_continues ||(c=*loc++)=='/' && (*loc=='*' || C_plus_plus && *loc=='/')) @< Scan to the end of the comment or of the current line, setting |comment_continues| to indicate which case applies; in the former case |goto restart|, in the latter |return '\n'| @> if (isspace(c)) if (line_output) if (preprocessing) return ' '; @+ else goto restart; /* ignore other spaces */ else return c; /* when |line_output==false| preserve white space faithfully */ if (c=='#' && loc==buffer+1) preprocessing=true; } @ When we come to this code |loc| either points to the beginning of a multi-line comment or to the second character of `\.{//}' or `\.{/*}'; in the latter case we must advance |loc| before looking for `\.{*/}', lest `\.{/*/}' would be scanned as a complete comment. Another subtlety arises when both |preprocessing| and |comment_continues| are raised, which means that a preprocessor line has ended inside a comment; although in \Cee\ this means that the preprocessor line continues, \.{\me.} will terminate it because the comment is removed and a non-escaped newline remains. The warning message below can be ignored by the user unless something follows the comment, but the proper way to resolve this is to start the comment on the line after the preprocessor line, which avoids all confusion. @< Scan to the end of the comment... @>= { boolean one_liner=false; if (!comment_continues) { if (!line_output) {@; store_byte('/'); store_byte(*loc); } one_liner=*loc++=='/'; /*/ record kind of comment, and advance to comment proper {/}*/ } else if (preprocessing) { print("\nWarning: Multi-line comment in preprocessor line"); @.Multi-line comment...@> preprocessing=false; mark_harmless(); } if (comment_continues=skip_comment(one_liner)) /* scan to end of comment or newline */ return '\n'; /* comment contains a newline; |get_line| has been called */ else goto restart; /* comment complete, get next token */ } @ The following code assigns values to the combinations `\.{++}', `\.{--}', `\.{->}', `\.{>=}', `\.{<=}', `\.{==}', `\.{<<}', `\.{>>}', `\.{!=}', `\.{\v\v}' and `\.{\&\&}'. The compound assignment operators (e.g., `\.{+=}') are separate tokens, according to {\sl The C Reference Manual\/} (so there may be white space and even comments in between), but according to the \caps{ANSI/ISO} standard they are single tokens. We have not allocated single-byte codes for them, so we scan them as two separate symbols; these will be adjacent on output. Therefore, as far as \.{\me.} is concerned, one may write a separation between the two characters even if the \Cee-compiler does not accept this; however, this practice should still be avoided because it will confuse the parser of |CWEAVE|, leading to ill-formatted output. @d compress(char2,code) @+ if (*loc==char2) return ++loc,code @; @< In applicable cases... @>= case '+': compress('+',plus_plus); break; case '-': compress('-',minus_minus); compress('>',minus_gt); break; case '=': compress('=',eq_eq); break; case '>': compress('=',gt_eq); compress('>',gt_gt); break; case '<': compress('=',lt_eq); compress('<',lt_lt); break; case '&': compress('&',and_and); break; case '|': compress('|',or_or); break; case '!': compress('=',not_eq); break; @ Identifiers are looked up unless they consist of a single character |c<0x80|, in which case that character can be found as |*id_first|; there is no need to set |id_loc| in either case. In case any 8-bit characters appear in identifiers, they are only stored in the name table, and cannot interfere with any of the special codes we use. @< Get an identifier @>= { id_first=--loc; /* mark beginning of identifier */ do c=*++loc; while (c<0x80 ? isalnum(c) || c=='_' : translation_exists(c)); cur_id= loc==id_first+1 && (eight_bits)(*id_first)<0x80 ? NULL : id_lookup(id_first,loc,0); } @ Scanning numeric constants is straightforward. The only subtle point is that we must not call functions like |isdigit| with an argument of type |char|, like |*loc|, since that may fail for 8-bits characters if |char| is a signed type; therefore we copy characters into |c|, which has type |unsigned char|, before applying such functions. We don't mind \.{8}'s and \.{9}'s appearing in octal constants (although the \Cee~compiler will), so there is no need here to distinguish between octal and decimal constants. @< Get a numeric constant @>= { if (*(id_first=loc-1)=='0' && tolower((eight_bits)*loc)=='x') /* hex constant */ do c=*++loc; while (isxdigit(c)); else /* octal, decimal or float constant */ { while (isdigit(c)) c=*loc++; if (c=='.') @+ do c=*loc++; while (isdigit(c)); if (tolower(c)=='e') /* floating point constant with exponent */ { if ((c=*loc)=='+' || c=='-') c=*++loc; while (isdigit(c)) c=*++loc; } else --loc; /* back up to first character after constant */ } while (isalpha(c)) c=*++loc; /* incorporate any `\.{U}', `\.{L}', or `\.{F}' suffixes */ id_loc=loc; } @ After an `\.@@' sign has been scanned, the next character tells us whether there is more work to do. Although control codes like \:\^ and \:t are ignored by \.{\me.}, their control text has to be skipped, which is performed by |get_control_text|. Verbatim constructions also use |get_control_text|, but then return the |verbatim| control code, unless their control text was empty, in which case they are ignored. @< Get a control code... @>= { eight_bits cc=code_of(*loc++); switch(cc) { case ignore: goto restart; case control_text: get_control_text(); goto restart; case verbatim: if (get_control_text()) goto restart; @+ else break; case ord: @ @+ break; case module_name: @ } return cc; } @ Before we scan a module name, we record whether it started with \:(, so that we may record it afterwards as an auxiliary file name if necessary. @= { boolean file_module= loc[-1]=='('; /* does this module define an output file? */ cur_mod=get_module_name(); if (file_module && cur_mod!=NULL) @ } @ There is no reason why we should allow a newline within an \ASCII. constant, even if it is escaped. @= id_first=loc; /* first character after opening quote */ while (*loc!='\'') { if (*loc++=='\\') loc++; /* accept any character following backslash */ if (loc>=limit) {@; err_print("! ASCII constant didn't end"); break; } @.ASCII constant didn't end@> } id_loc=loc++; /* move past closing quote */ @* Scanning a piece of \Cee\ text. Having acquired the skills of skipping over text and isolating tokens that really matter, we are now ready to tackle complete pieces of \Cee~text. The rules for generating the replacement texts for preprocessor directives following \:d and \:h and for bodies of sections are almost identical; the only differences are that \yskip \item{a)} Module names are not possible in preprocessor directives; indeed, the appearance of a module name terminates such directives and starts the \Cee~part of the section. \item{b)} The codes \:d, \:h, \:f, and \:c are not allowed in a \Cee~part, while they terminate preprocessor directives. \item{c)} The code \:p is not allowed inside preprocessor directives. \yskip \noindent Therefore there is a single function |scan_repl| with a parameter |context| indicating which of the two kinds of replacement texts is being scanned; its value is either |preproc_directive| or |section_body|. After |scan_repl| has acted, |cur_text| will point to the replacement text just generated, and |next_control| will contain the control code that terminated the activity. @d preproc_directive 0 @d section_body 1 @= text_pointer cur_text; /* replacement text formed by |scan_repl| */ eight_bits next_control; /* control code which has already been scanned */ @ We will try to reduce storage requirements by discarding any trailing newlines both after preprocessor directives and section bodies. This does not affect the line sequencing indicated by \&{\#line} directives in either case, since in the former case such directives are not issued in the first place, and in the latter case a new \&{\#line} directive will be issued before any further code is added. Note that we cannot reliably recognise newlines in a right-to-left motion when the end of a replacement text is reached, since there is a mixture of one-byte, two-byte and five-byte tokens. Therefore, instead of tracing backwards, we mark the position of the last non-discardable token every time one is appended, and back up to the last position recorded when reaching the end of a replacement text. Given this mechanism, it is easy to discard other things than newlines, such as \&{\#line} directives that apply to no lines at all, and to back up at other occasions where the line sequencing is interrupted as well, like before applied module names. This is also a good place to add a simple but useful check, namely that in each section and macro replacement text the parentheses and braces should be matched (failure in this respect can lead to syntax errors that are very hard to locate in the \.{CWEB} source, since the real problem may be buried inside a module name or macro). @c void scan_repl (eight_bits context) /* creates a replacement text */ { eight_bits a; /* the current token */ eight_bits* keep=tok_ptr; /* non-discardable stuff up to this point */ int brace_level=0, par_level=0; if (context==section_body) @< Insert the line number into |tok_mem| @> do @< Scan and store a token, setting |keep=tok_ptr| afterwards unless it is discardable; when the replacement text has ended |goto done| @> while (true); done: tok_ptr=keep; /* backup over trailing newlines and \&{\#line} marks */ next_control=a; /* this control code must be reconsidered */ if (par_level>0 || brace_level>0) @< Report unmatched opening symbols @> @< Wrap up the accumulated bytes into a completed text, pointed to by |cur_text| @> } @ When the function |scan_repl| is called we have either just scanned \:h or the identifier following \:d or a defining occurrence of a module name (with the following `\.='); in all cases we start by calling |get_next|. @< Scan and store a token... @>= { switch (a=get_next()) { @\@< Cases where |a| is a special token (|id_code|, |module_name|, etc.): either store the corresponding bytes and |continue|, or |goto done| if |a| signals the end of this replacement text@> case format: case definition: case header: case begin_C: if (context==preproc_directive) goto done; err_print ("! `@@f', `@@d', `@@h', and `@@c' are ignored in section body"); @.`@@f', `@@d', ... are ignored...@> continue; case new_section: goto done; @\@< Cases that keep track of |par_level| and |brace_level| @> case '\n': store_byte('\n'); if (context==section_body && print_where) /* input file was switched */ { tok_ptr=keep; /* back up over discardable items */ @< Insert the line number into |tok_mem| @> } continue; case join_code: a=join; } store_byte(a); keep=tok_ptr; /* mark as non-discardable */ } @ We test matching parentheses and braces separately because it is easier, and in the unlikely case that a mismatch should go undetected the compiler will surely report an error that is easy to locate. All parentheses and braces that are returned from |get_next| are real ones, i.e., not part of a string or comment. @< Cases that keep track of |par_level| and |brace_level| @>= case '(': ++par_level; break; case ')': if (par_level<=0) {@; err_print("! Unmatched closing parenthesis"); continue; } @.Unmatched closing...@> --par_level; break; case '{': ++brace_level; break; case '}': if (brace_level<=0) {@; err_print("! Unmatched closing brace"); continue; } --brace_level; break; @ The most difficult part of matching parentheses and braces is reporting any errors at the end of the macro or section in proper English. We supply the missing closing symbols so that there is a slightly larger chance that the produced \Cee~code will compile. Note that there is no need to update |keep| here, since no backing up will follow. @< Report unmatched opening symbols @>= { char *p, *s; int l; if (par_level>0) l=par_level, s="parenthes", p= l>1 ? "es" : "is"; else l=brace_level, s="brace", p=l>1 ? "s" : ""; print("\n! There %s %d unclosed %s%s" ,par_level+brace_level>1 ? "are" : "is", l, s, p ); @.There are unclosed...@> if (par_level>0 && brace_level>0) print(" and %d unclosed brace%s" , brace_level, brace_level>1 ? "s" : ""); print(" in the previous "); err_print(context==preproc_directive ? "macro" : "section"); while (--par_level>=0) store_byte(')'); while (--brace_level>=0) store_byte('}'); } @ When a module reference is inserted the line sequencing is interrupted, so we must restore it afterwards; on the other hand newlines before the insertion can be discarded. @< Cases where |a| is...@>= case id_code: @< Store the identifier just scanned @> @+ keep=tok_ptr; continue; case module_name: if (context==preproc_directive) goto done; if (cur_mod!=NULL) /* don't record bad module name */ { sixteen_bits n = mod_index(cur_mod); /* index of module name */ @< If this looks like a defining occurrence, report a runaway section @> if (line_output) tok_ptr=keep; /* back up */ store_two_bytes(0xA800+n); keep=tok_ptr; /* store reference to module name */ @< Insert the line number into |tok_mem| @> /* to get in phase after module insertion */ } continue; case constant: case verbatim: @/@< Copy a constant or verbatim construction @> @+ keep=tok_ptr; continue; case ord: @ @+ keep=tok_ptr; continue; case include_preproc: if (context==preproc_directive) err_print("! `@@p' is forbidden in preprocessor directive"); @.`@@p' is forbidden...@> else { if (line_output) tok_ptr=keep; store_byte(0xFA); atp_seen=true; keep=tok_ptr; @< Insert the line number... @> } continue; case char_trans: err_print("! `@@l' is only allowed in limbo"); @.`@@l' is only allowed in limbo@> continue; @ Identifiers of length~$1$ have no name pointer, but |id_first| still points at the relevant character. @< Store the identifier... @>= if (cur_id==NULL) store_byte(*id_first); /* single-character identifier */ else store_two_bytes(0x8000+id_index(cur_id)); @ Here we look to see whether a module name is followed by~`\.{=}' or~`\.{+=}' (whether another `\.{=}' follows is irrelevant at this point), in which case we assume that introduces the \Cee-part of a new section, of which we somehow missed the opening \:\ . Note that this need not limit the use of module names standing for single expressions, even if followed by the `\.{=}' or the `\.{+=}' operator, since such names should be followed by `\.{@@;@@;}' for proper treatment by \.{CWEAVE}. @< If this looks like a defining occurrence, report... @>= { char *p=loc; while (*p==' ' && p } @ Constants of length one can only be single digit numbers, so they can be stored in a single (unquoted) byte like single letter identifiers. Inside strings (and verbatim constructions) {\it any\/} character may in principle show up, including the character |verb_quote| used to delimit the verbatim storage of the string. Even if this is hardly a sign of good programming, the effect on output would be so disastrous that some counter measure is called for: we replace |verb_quote| by its three-digit octal escape code, which the user should have written to begin with. Because a \:i line or a change file match can occur inside a string, we must test |print_where| at the end of a string. @< Copy a constant... @>= if (id_loc==id_first+1) store_byte(*id_first); /* single digit number */ else { store_byte(verb_quote); /* opening */ do if (*id_first==verb_quote) { ++id_first; store_byte('\\'); store_byte('0'+(verb_quote>>6)); store_byte('0'+((verb_quote>>3)&7)); store_byte('0'+(verb_quote&7)); } else { if ((eight_bits)(*id_first)>=0x80) store_byte(0xF8); /* quote 8-bit characters */ store_byte(*id_first++); } while (id_first } @ This section should be rewritten on machines that don't use \ASCII. code internally. It is probably best to use a table indexed by characters and containing \ASCII. codes for this purpose (like the |xord| array in \TeX). Incidentally, it would be tempting to write statements like `|case'\\': c=@'\\';|' here, which would in principle be correct in any other place, but which would cause circularity in a bootstrapped system like this. @^ASCII code dependencies@> @< Translate an \ASCII. constant @>= { int c=*id_first++; if (c=='@@' && *id_first++ !='@@') @/{@; err_print("! Double `@@' should be used in ASCII constant"); @.Double `@@' should be used...@> --id_first; } if (c=='\\') { c=*id_first++; switch (c) { case 't':c='\011';break; case 'n':c='\012';break; case 'b':c='\010';break; case 'f':c='\014';break; case 'v':c='\013';break; case 'r':c='\015';break; case '0':c='\0';break; case '\\':c='\134';break; case '\'':c='\047'; break; case '\"':c='\042'; break; default: err_print("! Unrecognised escape sequence"); @.Unrecognised escape sequence@> } } else { /* at this point |c| should be converted to its \ASCII. code number */ @+} if (id_first!=id_loc) if (id_loc>id_first) err_print("! ASCII constant should be single character"); @.ASCII constant should be...@> else {@; err_print("! Empty ASCII constant"); c=0; } @.Empty ASCII constant@> store_byte(verb_quote); if (c>=100) store_byte('0'+c/100); @+ if (c>=10) store_byte('0'+(c/10)%10); store_byte('0'+c%10); @/ store_byte(verb_quote); } @ Line number marks, which will lead to \&{\#line} directives on output, are inserted in several places, so we shall call a function to do the necessary work. For reasons explained below this function scans ahead a bit through the input, and in doing so it may hit the end of the input; when this happens it must be reported back to |scan_repl|. Therefore |mark_line| returns a boolean value which tells whether |input_has_ended|. @< Prototypes @>= boolean mark_line(void); @~When |mark_line| returns |true|, we should terminate |scan_repl| with |next_control==new_section|; this is achieved by jumping to |done|, where some exitialisations are done, and where |next_control| is assigned from the variable~|a|. @< Insert the line... @>= {@; if (mark_line()) {@; a=new_section; goto done; } } @~If |line_output==true|, the following items are stored for recording the line number: first a byte equal to |0xF9|, then the numeric line number, and finally the |id_index| of the file name, which is looked up as if it were an identifier (except that we use the feature of |id_lookup| to accept null-terminated strings as first argument if the second argument is~|NULL|). Note that |keep| is not increased; if nothing essential follows this line number indication then it need not be recorded. We improve compactness of storage and output a bit by discarding any newlines that are immediately ahead; the inserted line number will automatically be increased properly. While reading we might find there is no input left, in which case no line number is recorded and |true| is returned. @c boolean mark_line(void) { while (loc>=limit) @+ if (!get_line()) return true; /* skip over newlines */ print_where=false; /* request is being serviced */ if (line_output) { store_byte(0xF9);@/ id_first=changing ? change.name : cur_file_name; store_two_bytes(changing ? change_line : cur_line); store_two_bytes(id_index(id_lookup(id_first,NULL,0))); } return false; } @* Scanning a section. The function |scan_section| starts when \:\ , \:\~, or \:* has been sensed in the input, and it proceeds until the end of that section. It uses |section_count| to keep track of the current section number; hopefully \.{CWEAVE} and \.{\me.} will both assign the same numbers to sections. @c void scan_section (void) { ++section_count; if (loc[-1]=='*') print_section_progress(); @ @ } @ For \.{\me.} there is no real difference between the \TeX\ and definition parts of a section; after all, the latter can contain \&{format} commands that are ignored just like \TeX~text is. So we set up a loop that stops when |next_control>=begin_C|, and handles the cases |definition| and |header|, while ignoring anything else. The price we pay for our haste in skipping the \TeX~part of a section (and \:f commands) is that we may have to back up after finding \:< in order to scan the whole module name and set |cur_mod|, which requires calling~|get_next|. In case \:< follows a |definition| or |header| code, it is scanned by |scan_repl|, and no backing up is required; therefore the code for backing up and rescanning cannot be moved outside the loop. @= { next_control=ignore; /* clear |new_section| */ do { if (next_control scan_repl(preproc_directive); cur_text->text_link=macro_flag; /* characterise as macro */ } else /* |next_control==header| */ {@; scan_repl(preproc_directive); cur_text->text_link=header_flag; } if (next_control==module_name) @< If this is not a defining occurrence, set |next_control=ignore| @> } while (next_control= { do next_control=get_next(); /* allow white space before identifier */ while (line_output ? next_control=='\n' : next_control<0x80 && isspace(next_control)); if (next_control!=id_code) @/{@; err_print("! Macro definition ignored, must start with identifier"); @.Macro definition ignored...@> continue; } @< Store the id... @> if (isspace((eight_bits)*loc)) store_byte(' '); /* separate the replacement text from a parameterless macro */ } @ Due to the fact that module names are allowed enclosed in `\pb', it is not certain that the occurrence of a module name signals the start of the \Cee~part of the section. The decisive factor will be what follows the module name: if this is `\.=' or `\.{==}', possibly preceded by `\.+', then this is really the name of the module defined in this section. If not, then we require that the next token is a `\.\v' (the one closing `\pb'), unless we are in compatibility mode, in which case anything goes. We use the fact that scanning the `\.+' and `\.=' symbols will not destroy the value in |cur_mod|. @< If this is not a defining occurrence, set |next_control=ignore| @>= { eight_bits t=get_next(); if (t=='+') t=get_next(); /* skip an optional `\.+' */ if (t!='=' && t!=eq_eq) /* then apparently a cited occurrence, ignore it */ { next_control=ignore; if (t!='|' && !compatibility_mode) err_print("! `=' sign missing, module name ignored"); @.`=' sign missing, module name ignored@> } } @ If we come to this module with |next_control==module_name|, then the code of the previous section has just removed the `\.=' or `\.{==}' and optional `\.+' following the module name. @= { mod_pointer p; /* module name for the current section */ switch (next_control) { default: return; /* no \Cee\ part present */ case begin_C: p=NULL; break; /* section contributing to unnamed module */ case module_name: p=cur_mod; /* section contributing to named module */ } store_two_bytes(0xD000+section_count); /* record section number */ scan_repl(section_body); /* now |cur_text| points to the replacement text */ @< Link the section body at |cur_text| to the module named |p|, or to the unnamed module if |p==NULL| @> } @ Sections whose body starts with a bad name (e.g., an ambiguous prefix) will be treated as if they were unnamed, so the user had better not ignore the error message. Unnamed sections get a slightly special treatment, which makes accumulating many of them a bit more efficient than building up a named module that is defined in many sections. @= { if (p == NULL) /* unnamed section, or bad module name */ {@; *last_unnamed = cur_text; last_unnamed = &cur_text->text_link; } else { text_pointer* q=&p->equiv; /* text for the current module */ while(*q!=NULL) q=&(*q)->text_link; /* find end of list */ *q=cur_text; /* add section to module */ } cur_text->text_link=NULL; /* end list, also marking replacement text as a non-macro */ } @* Phase I processing. Finally we can wrap everything up and define the global structure of~Phase~I. @= void phase_one (void); /* read all the user's text and compress it into |tok_mem| */ @~The only thing left to do is picking up character translations in the limbo part. @c void phase_one (void) { phase=1; section_count=0; reset_input(); while ((next_control=skip_ahead())!=new_section) if (next_control==char_trans) @< Store a character translation @> while (!input_has_ended) scan_section(); /* load all sections */ check_complete(); /* verify that change file hasn't got out of sync */ } @ The code \:l should be followed (after optional space) by a two-digit hex number~|c| with |c>=0x80|, white space, and a translation string of at most |trans_limit| characters, either alphanumeric or underscores, terminated by another space. @< Store a character translation @>= { int c; while (loc else if (sscanf(loc,"%x",&c),c<0x80) err_print("! You cannot translate characters < 0x80"); @.You cannot translate...@> else { char* p=trans_of(c); int i=0; loc+=3; @+ while (find_char() && isspace((eight_bits)*loc)) ++loc; /* skip space */ if (!input_has_ended) while (isalnum((eight_bits)*loc) || *loc=='_') if (++i<=trans_limit) *p++=*loc++; @+ else break; if (i>0) *p='\0'; /* terminate translation unless none was given */ @< Report any problems with the translation string @> } } @ The main contribution of the \:l feature to \.{CWEB} is new syntax, leading to a variety of new error messages. It is possible to give an empty translation for a character by terminating the translation string by a non-space character that is not allowed in identifiers; this is forbidden however, since an empty translation for~|c| would just make |translation_exists(c)| fail, making~|c| useless anyway. Even if the translation string is not empty, terminating it by a non-space character is considered an error, since this would make the \TeX\ macro used for formatting the corresponding line of the printed document fail. @< Report any problems... @>= if (i==0) err_print("! Translation string absent after `@@l'"); @.Translation string...@> else if (i>trans_limit) err_print("! Translation string too long"); else if (!isspace((eight_bits)*loc)) err_print("! Translation string not terminated by space"); @ At the end of the run, if |STAT| was defined and the `\.{+s}' flag present, we report how much of all the arrays was actually needed. @d report(k,c,m) printf("%lu %ss (out of %lu)\n",(unsigned long)(c),k,(unsigned long)(m)) @c #ifdef STAT void print_stats (void) { print("\nMemory usage statistics:\n"); @/report("identifier", id_index(id_ptr), max_idents); @/report("module name", mod_index(mod_ptr), max_modules); @/report("byte", byte_ptr-byte_mem, max_bytes); @/report("replacement text", text_ptr-text_table, max_texts); @/report("token", tok_ptr-tok_mem, max_toks); } #endif @* Index. Here is a cross-reference table for the \.{\me.} processor. All sections in which an identifier is used are listed with that identifier, except that reserved words are indexed only when they appear in format definitions, and the appearances of identifiers in module names are not indexed. Underlined entries correspond to where the identifier was declared. Error messages and a few other things like ``\ASCII. code dependencies'' are indexed here too.