Literate Programming

Prev:WEB 1 Top:WEB 0 Next:WEB 3

WEB 2 - Generating HTML

Jon Breuer - September 8, 2024.

The original TANGLE and WEAVE programs generate both document and program from the source *.WEB file. Thus far I've been using the HTML as my source, but that leaves me formatting my <pre> sections manually. A good WEB program would format them for me. That will finally get rid of the ugly and redundant <pre>@p blocks in my code.

The HTML output from this file contains nested <pre> blocks so the code looks only half-way correct in this file and a different half correct in the generated file. I'll add a hack style here to fix part of it.

<style>
pre pre {
    margin: -0.5em;
}
</style>

This is the header. It hasn't changed a lot from version to version.

////////////
// WEB2.D
//
// This is a level 2 bootstrapping Literate Programming thing.
// It will start generating HTML from WEB files.
//
module web2;

Normal includes. These will slowly grow and it would be convenient to call them out where they become useful.

private import std.algorithm; // Needed for countUntil and searching
private import std.file;      // Needed for file input and output
private import std.stdio;     // Needed for error reporting and my debugging
private import std.string;    // These programs are all about string processing.

This webpage would be so much prettier to read if comments and strings were syntax colored.

The normal countUntil operates from the start of the string, but I need a variant that can be moved progressively through the string. Converting to the "last half" slice of the array and back to "index within whole array" is just a bit cumbersome at the call site.

ptrdiff_t countFromPosUntil(string haystack, ptrdiff_t startIndex, string needle)
{
    ptrdiff_t offset = countUntil(haystack[startIndex..haystack.length], needle);
    if(offset < 0) {
        return offset;
    }
    return startIndex + offset;
}

Start of program and basic error handling.

void main(string[] args)
{
    if(args.length != 4) {
        writefln("Usage: WEB1 inputFile outputHTMLFile outputCodeFile");
    }
    const string inputFilename = args[1];
    string fileContents = cast(string) std.file.read(inputFilename);
    if(fileContents.length == 0) {
        writefln("Unable to read file '%s'.", inputFilename);
        return;
    }

    // Generate these strings so they don't appear in the source.
    const string startTag = "@" ~ "p";
    const string endTag = "@" ~ ">";

Now we are generating two files as we loop over the @

Now we are generating two files as we loop over the @@p blocks.

    string outputCodeContents = "";
    string outputDisplayContents = "";
    ptrdiff_t blockEndIndex = 0;
    ptrdiff_t startDisplayIndex = 0;
    int lineNumber = 0;
    for(
        ptrdiff_t blockStartIndex = countUntil(fileContents, startTag);
        blockStartIndex != -1;
        blockStartIndex = countFromPosUntil(fileContents, blockEndIndex, startTag)) {

Copy the interstitials into the display file. A future version will really need to start tangling the code blocks out of order. Not every bit needs to be explained.

        outputDisplayContents ~= fileContents[startDisplayIndex .. blockStartIndex];

We will also start counting newlines so we can insert the source line numbers into the intermediate code files.

        lineNumber += count(fileContents[startDisplayIndex .. blockStartIndex], '\n');
/pre>

The new version of escaping @ symbols will remove the doubling up from the HTML.

        if(blockStartIndex > 0 && fileContents[blockStartIndex - 1] == '@') {
            // Don't parse escaped at symbols or examples.
            blockEndIndex = blockStartIndex + 1;
            startDisplayIndex++;
        continue;
        }

Extra robust close tag tracking.

        string codeSourceContents = "";
        for(blockEndIndex = blockStartIndex + startTag.length; blockEndIndex < fileContents.length; blockEndIndex++) {
            if(fileContents[blockEndIndex] == '@' && blockEndIndex < fileContents.length - 1) {
                dchar nextChar = fileContents[blockEndIndex + 1] ;
                if(nextChar == '@') {
                    // Escaped @
                    blockEndIndex++;
                } else if(nextChar == '>') {
                    blockEndIndex++;
                    break;
                    
                }
            }
            codeSourceContents ~= fileContents[blockEndIndex];
        }
        
        if(blockEndIndex >= fileContents.length) {
            writefln("Start tag without end tag found at location %d near line %d.", blockStartIndex, lineNumber);
            return;
        }

Copying to code works the same as before.

        outputCodeContents ~= format("#line %d \"%s\"", lineNumber, inputFilename);
        outputCodeContents ~= codeSourceContents;
       

But we need to start generating the <pre> tags ourselves. Luckily, D lets me declare functions out of order, so I can insert future work here.

        outputDisplayContents ~= "<"~"pre>";
        outputDisplayContents ~= formatCodeForDisplay(codeSourceContents, lineNumber);
        outputDisplayContents ~= "</"~"pre>";
 
/pre>

Update the line number and the display index.

        lineNumber += count(codeSourceContents, '\n');
        startDisplayIndex = blockEndIndex + endTag.length;
 
/pre>

End the parsing for loop.

    }

Write the final results.

    string outputDisplayFilename = args[2];
    std.file.write(outputDisplayFilename, outputDisplayContents);

    string outputCodeFilename = args[3];
    std.file.write(outputCodeFilename, outputCodeContents);

End the main function.

    }
C:\literate> web1 literate_programming_2.html web2.d
C:\literate> dmd web2
C:\literate> web2 literate_programming_2.html literate_programming_2b.html web2.d
C:\literate> dmd web2
C:\literate> web2 literate_programming_2.html literate_programming_2b.html web2.d

Here is the result:literate_programming_2b.html

There are two bugs in this program. Because this HTML source file has the <pre> tags already in place, the generated HTML has double-nested <pre> blocks. Second, my line counts are off by one, but I haven't looked for why yet.

Now I want to start color formatting the code.

private import std.ascii; // Character type checks.

//TODO:// Escape HTML codes so they don't mis-render.
//TODO:// Trim line lengths to fit.
string formatCodeForDisplay(string source, int lineNumber)
{
    string output = "";
    // Calling out future work...
    string scanner = escapeHTMLCharacters(source);
    scanner: while(!scanner.empty) {
/pre>

Each section of the scanner converts a diffent part of code into colored blocks. First, comments. Match a // and scan til the end of line.

        if(scanner.startsWith("//")) {
            // Color comments.
            output ~= "<"~"span class=\"code_comment\">";
            int lineLength = countUntil(scanner, "\n");
            output ~= scanner[0..lineLength - 1];
            output ~= "<"~"/span>";
            scanner = scanner[lineLength..scanner.length];
/pre>

Next, strings. I handle both single and double quotes and then skip escaped characters and scan for end of string.

        } else if(scanner.startsWith("\"") || scanner.startsWith("\'")) {
            // Color strings.
            char stringType = scanner[0];
            output ~= "<"~"span class=\"code_string\">";
            int stringLength = 1;
            while(stringLength < scanner.length && scanner[stringLength] != stringType) {
                if(scanner[stringLength] == '\\') {
                    stringLength += 1;
                }
                stringLength += 1;
            }
            if(stringLength >= scanner.length) {
                writefln("Unable to find close quote for string %s near line %d", scanner[0..min(scanner.length, 20)], lineNumber);
                break scanner;
            }
            output ~= scanner[0..stringLength + 1];
            output ~= "<"~"/span>";
            scanner = scanner[stringLength + 1..scanner.length];
/pre>

Identifiers were a bit tricky. I have a list of known identifiers and I have to check that the block of text starts with an alpha character and continues. I made several mistakes here, scanning for whitespace instead of non-identifier, and allowing partial matches like fo and format instead of for.

I'm still breaking my HTML tags apart to keep the browser from mis-rendering this code.

             
        } else {
            if(isAlpha(scanner[0])) {
                bool isNotIdentifier(dchar ch) { return !(isAlpha(ch) || isDigit(ch) || ch == '_'); }
                int wordLength = countUntil!isNotIdentifier(scanner);
                
                const string[] identifiers = [ "const", "bool", "break", "char", "dchar", "else",
                    "for", "if", "import", "int", "main", "module", "private", "return", "string",
                    "std", "void", "while", ];
                
                if(wordLength > 0 && !findAmong(identifiers, [scanner[0..wordLength]]).empty) {
                    // Special identifiers
                    output ~= "<"~"span class=\"code_identifier\">";
                    output ~= scanner[0..wordLength + 1];
                    output ~= "<"~"/span>";
                } else {
                    output ~= scanner[0..wordLength + 1];
                }
                scanner = scanner[wordLength + 1..scanner.length];
/pre>

Unknown character. Advance and test again.

                  
            } else {
                output ~= scanner[0];
                scanner = scanner[1..scanner.length];
            }
        }
    }
    
    return output;
}

And the result is now something like this.

void main(string[] args)
{
    if(args.length != 4) {
        writefln("Usage: WEB1 inputFile outputHTMLFile outputCodeFile");
    }
    const string inputFilename = args[1];
    string fileContents = cast(string) std.file.read(inputFilename);
    if(fileContents.length == 0) {
        writefln("Unable to read file '%s'.", inputFilename);
        return;
    }

    // Generate these strings so they don't appear in the source.
    const string startTag = "@" ~ "p";
    const string endTag = "@" ~ ">";

I keep having to break out <pre> and <span> tags into parts so the browser doesn't choke on them. Proper text escaping will fix that.

string escapeHTMLCharacters(string source)
{
    string output;
    string scanner = source;
    foreach(dchar ch; source) {
        if(countUntil("<>&", ch) >= 0) {
            if(ch == '<') {
                output ~= "&lt;";
            } else if(ch == '>') {
                output ~= "&gt;";
            } else if(ch == '&') {
                output ~= "&amp;";
            } else {
                writefln("BUG: Only partly implemented support for '%s'.", ch);
            }
        } else {
            output ~= ch;
        }
    }
    return output;
}