Literate Programming

Prev:WEB 1 Top:WEB 0 Next:WEB 3

WEB 2 - Generating HTML

Jon Breuer - September 8, 2024.

The original TANGLE and WEAVE programs generate both document and program from the source *.WEB file. Thus far I've been using the HTML as my source, but that leaves me formatting my <pre> sections manually. A good WEB program would format them for me. That will finally get rid of the ugly and redundant <pre>@p blocks in my code.

The HTML output from this file contains nested <pre> blocks so the code looks only half-way correct in this file and a different half correct in the generated file. I'll add a hack style here to fix part of it.

<style>
pre pre {
    margin: -0.5em;
}
</style>

This is the header. It hasn't changed a lot from version to version.

@p
////////////
// WEB2.D
//
// This is a level 2 bootstrapping Literate Programming thing.
// It will start generating HTML from WEB files.
//
module web2;
@>

Normal includes. These will slowly grow and it would be convenient to call them out where they become useful.

@p
private import std.algorithm; // Needed for countUntil and searching
private import std.file;      // Needed for file input and output
private import std.stdio;     // Needed for error reporting and my debugging
private import std.string;    // These programs are all about string processing.
@>

This webpage would be so much prettier to read if comments and strings were syntax colored.

The normal countUntil operates from the start of the string, but I need a variant that can be moved progressively through the string. Converting to the "last half" slice of the array and back to "index within whole array" is just a bit cumbersome at the call site.

@p
ptrdiff_t countFromPosUntil(string haystack, ptrdiff_t startIndex, string needle)
{
    ptrdiff_t offset = countUntil(haystack[startIndex..haystack.length], needle);
    if(offset < 0) {
        return offset;
    }
    return startIndex + offset;
}
@>

Start of program and basic error handling.

@p
void main(string[] args)
{
    if(args.length != 4) {
        writefln("Usage: WEB1 inputFile outputHTMLFile outputCodeFile");
    }
    const string inputFilename = args[1];
    string fileContents = cast(string) std.file.read(inputFilename);
    if(fileContents.length == 0) {
        writefln("Unable to read file '%s'.", inputFilename);
        return;
    }

    // Generate these strings so they don't appear in the source.
    const string startTag = "@" ~ "p";
    const string endTag = "@" ~ ">";
@>

Now we are generating two files as we loop over the @@p blocks.

@p
    string outputCodeContents = "";
    string outputDisplayContents = "";
    ptrdiff_t blockEndIndex = 0;
    ptrdiff_t startDisplayIndex = 0;
    int lineNumber = 0;
    for(
        ptrdiff_t blockStartIndex = countUntil(fileContents, startTag);
        blockStartIndex != -1;
        blockStartIndex = countFromPosUntil(fileContents, blockEndIndex, startTag)) {
@>

Copy the interstitials into the display file. A future version will really need to start tangling the code blocks out of order. Not every bit needs to be explained.

@p
        outputDisplayContents ~= fileContents[startDisplayIndex .. blockStartIndex];
@>

We will also start counting newlines so we can insert the source line numbers into the intermediate code files.

@p
        lineNumber += count(fileContents[startDisplayIndex .. blockStartIndex], '\n');
@>

The new version of escaping @ symbols will remove the doubling up from the HTML.

@p
        if(blockStartIndex > 0 && fileContents[blockStartIndex - 1] == '@') {
            // Don't parse escaped at symbols or examples.
            blockEndIndex = blockStartIndex + 1;
            startDisplayIndex++;
        continue;
        }
@>

Extra robust close tag tracking.

@p
        string codeSourceContents = "";
        for(blockEndIndex = blockStartIndex + startTag.length; blockEndIndex < fileContents.length; blockEndIndex++) {
            if(fileContents[blockEndIndex] == '@' && blockEndIndex < fileContents.length - 1) {
                dchar nextChar = fileContents[blockEndIndex + 1] ;
                if(nextChar == '@') {
                    // Escaped @
                    blockEndIndex++;
                } else if(nextChar == '>') {
                    blockEndIndex++;
                    break;
                    
                }
            }
            codeSourceContents ~= fileContents[blockEndIndex];
        }
        
        if(blockEndIndex >= fileContents.length) {
            writefln("Start tag without end tag found at location %d near line %d.", blockStartIndex, lineNumber);
            return;
        }
@>

Copying to code works the same as before.

@p
        outputCodeContents ~= format("#line %d \"%s\"", lineNumber, inputFilename);
        outputCodeContents ~= codeSourceContents;
       @>

But we need to start generating the prettified <pre> tags ourselves. Luckily, D lets me declare functions out of order, so I can insert future work here.

@p
        outputDisplayContents ~= "<"~"pre>";
        outputDisplayContents ~= formatCodeForDisplay(codeSourceContents, lineNumber);
        outputDisplayContents ~= "";
 @>

Update the line number and the display index.

@p
        lineNumber += count(codeSourceContents, '\n');
        startDisplayIndex = blockEndIndex + endTag.length;
 @>

End the parsing for loop.

@p
    }
@>

Add the tail of the file to the display.

@p
    outputDisplayContents ~= fileContents[startDisplayIndex..fileContents.length];
@>

Write the final results.

@p
    string outputDisplayFilename = args[2];
    std.file.write(outputDisplayFilename, outputDisplayContents);

    string outputCodeFilename = args[3];
    std.file.write(outputCodeFilename, outputCodeContents);
@>

End the main function.

@p
    }
@>
C:\literate> web1 literate_programming_2.html web2.d
C:\literate> dmd web2
C:\literate> web2 literate_programming_2.html literate_programming_2b.html web2.d
C:\literate> dmd web2
C:\literate> web2 literate_programming_2.html literate_programming_2b.html web2.d

Here is the result:literate_programming_2b.html

There are two bugs in this program. Because this HTML source file has the <pre> tags already in place, the generated HTML has double-nested <pre> blocks. Second, my line counts are off by one, but I haven't looked for why yet.

Now I want to start color formatting the code.

@p
private import std.ascii; // Character type checks.

//TODO:// Escape HTML codes so they don't mis-render.
//TODO:// Trim line lengths to fit.
string formatCodeForDisplay(string source, int lineNumber)
{
    string output = "";
    // Calling out future work...
    string scanner = escapeHTMLCharacters(source);
    scanner: while(!scanner.empty) {
@>

Each section of the scanner converts a diffent part of code into colored blocks. First, comments. Match a // and scan til the end of line.

@p
        if(scanner.startsWith("//")) {
            // Color comments.
            output ~= "<"~"span class=\"code_comment\">";
            int lineLength = countUntil(scanner, "\n");
            output ~= scanner[0..lineLength - 1];
            output ~= "<"~"/span>";
            scanner = scanner[lineLength..scanner.length];
@>

Next, strings. I handle both single and double quotes and then skip escaped characters and scan for end of string.

@p
        } else if(scanner.startsWith("\"") || scanner.startsWith("\'")) {
            // Color strings.
            char stringType = scanner[0];
            output ~= "<"~"span class=\"code_string\">";
            int stringLength = 1;
            while(stringLength < scanner.length && scanner[stringLength] != stringType) {
                if(scanner[stringLength] == '\\') {
                    stringLength += 1;
                }
                stringLength += 1;
            }
            if(stringLength >= scanner.length) {
                writefln("Unable to find close quote for string %s near line %d", scanner[0..min(scanner.length, 20)], lineNumber);
                break scanner;
            }
            output ~= scanner[0..stringLength + 1];
            output ~= "<"~"/span>";
            scanner = scanner[stringLength + 1..scanner.length];
@>

Identifiers were a bit tricky. I have a list of known identifiers and I have to check that the block of text starts with an alpha character and continues. I made several mistakes here, scanning for whitespace instead of non-identifier, and allowing partial matches like fo and format instead of for.

I'm still breaking my HTML tags apart to keep the browser from mis-rendering this code.

@p             
        } else {
            if(isAlpha(scanner[0])) {
                bool isNotIdentifier(dchar ch) { return !(isAlpha(ch) || isDigit(ch) || ch == '_'); }
                int wordLength = countUntil!isNotIdentifier(scanner);
                
                const string[] identifiers = [ "const", "bool", "break", "char", "dchar", "else",
                    "for", "if", "import", "int", "main", "module", "private", "return", "string",
                    "std", "void", "while", ];
                
                if(wordLength > 0 && !findAmong(identifiers, [scanner[0..wordLength]]).empty) {
                    // Special identifiers
                    output ~= "<"~"span class=\"code_identifier\">";
                    output ~= scanner[0..wordLength + 1];
                    output ~= "<"~"/span>";
                } else {
                    output ~= scanner[0..wordLength + 1];
                }
                scanner = scanner[wordLength + 1..scanner.length];
@>

Unknown character. Advance and test again.

@p                  
            } else {
                output ~= scanner[0];
                scanner = scanner[1..scanner.length];
            }
        }
    }
    
    return output;
}
@>

And the result is now something like this.

void main(string[] args)
{
    if(args.length != 4) {
        writefln("Usage: WEB1 inputFile outputHTMLFile outputCodeFile");
    }
    const string inputFilename = args[1];
    string fileContents = cast(string) std.file.read(inputFilename);
    if(fileContents.length == 0) {
        writefln("Unable to read file '%s'.", inputFilename);
        return;
    }

    // Generate these strings so they don't appear in the source.
    const string startTag = "@" ~ "p";
    const string endTag = "@" ~ ">";

I keep having to break out <pre> and <span> tags into parts so the browser doesn't choke on them. Proper text escaping will fix that.

@p
string escapeHTMLCharacters(string source)
{
    string output;
    string scanner = source;
    foreach(dchar ch; source) {
        if(countUntil("<>&", ch) >= 0) {
            if(ch == '<') {
                output ~= "<";
            } else if(ch == '>') {
                output ~= ">";
            } else if(ch == '&') {
                output ~= "&";
            } else {
                writefln("BUG: Only partly implemented support for '%s'.", ch);
            }
        } else {
            output ~= ch;
        }
    }
    return output;
}
@>

Now I'm about to leave HTML behind as the source. WEB3 will work from *.WEB source and generate HTML and D code as the outputs.

I am occasionally creating bugs - mostly infinite loops in WEB2 which leave me with no working WEB2. I have to run WEB1 on the source first before I can get a working WEB2 again.

C:\literate> web2 literate_programming_2.html literate_programming_2b.html web2.d
core.exception.ArraySliceError@literate_programming_2.html(312): slice [0 .. 4294967295] extends past source array of length 398 ---------------- 0x00419FEB 0x004019CC 0x004012E8 0x00426263 0x004261C3 0x00426032 0x0041B724 0x00401C13 0x75907BA9 in BaseThreadInitThunk 0x77D0C10B in RtlInitializeExceptionChain 0x77D0C08F in RtlClearBits
C:\literate> web1 literate_programming_2.html web2.d C:\literate> dmd web2 C:\literate> web2 literate_programming_2.html literate_programming_2b.html web2.d C:\literate> dmd web2 C:\literate> web2 literate_programming_2.html literate_programming_2b.html web2.d C:\literate>

Prev:WEB 1 Top:WEB 0 Next:WEB 3