Prev:WEB 3 Top:WEB 0 Next:WEB 5
Jon Breuer - September 10, 2024.
I've been looking forward to a parser that can tangle code samples out of order. On the one hand it will let me talk about the interesting things first and on the other hand it could let me define a function, discuss it, then define it a second time and have the compiler pick up the updated definition.
Now that the parser can handle more tags, I'll be writing this file in WEB as much as possible with the HTML stuff kept to a minimum.
I really want to redefine existing sections and dynamically add content into group sections.
This is the source, and this is the output.
The overall program layout looks like this:
__main__
File Header Library Imports Definitions Utility Functions Main Function Comments to test appending and replacing
File Header
//////////// // WEB4.D // // This is a level 4 bootstrapping Literate Programming thing. // It will insert indices and tables of contents. It may also allow appending or replacing sections. // module web3;
Library Imports
private import std.algorithm; // Needed for countUntil and searching private import std.ascii; // Character type checks. private import std.file; // Needed for file input and output private import std.stdio; // Needed for error reporting and my debugging private import std.string; // These programs are all about string processing.
I've converted the bool isCode/isIdentifier into a consistent enum and started tracking line numbers within each text block.
Definitions
enum ESectionType { CODE, HEADER, PARAGRAPH, IDENTIFIER, INDEX_TERM, PRE, // Terms? Literal/Emphasis? BOLD, }; struct SSection { string name; ESectionType type; SBlock[] contents; }; struct SBlock { ESectionType type; int lineNumber; string content; }; New character definitions
Utility Functions
An enhanced version of countUntil that can start at a given string position. Format code for display with colors and escaped HTML codes. Escape special HTML Characters with their safe entities. Parse an entire section of text, recursing for definitions if needed. Find Matching Identifiers expand_code_identifier parse_web_then_tangle_and_weave
I'm inserting these hide blocks here. I wish I could insert bits of display inside a code block.
An enhanced version of countUntil that can start at a given string position.
ptrdiff_t countFromPosUntil(string haystack, ptrdiff_t startIndex, string needle) { ptrdiff_t offset = countUntil(haystack[startIndex..haystack.length], needle); if(offset < 0) { return offset; } return startIndex + offset; }
Format code for display with colors and escaped HTML codes.
string formatCodeForDisplay(string source, int lineNumber) { string output = ""; string scanner = escapeHTMLCharacters(source); scanner: while(!scanner.empty) { if(scanner.startsWith("//")) { // Color comments. output ~= "<span class=\"code_comment\">"; int lineLength = countUntil(scanner, "\n"); if(lineLength < 0) { lineLength = scanner.length; } output ~= scanner[0..lineLength - 1]; output ~= "</span>"; scanner = scanner[lineLength..scanner.length]; } else if(scanner.startsWith("\"") || scanner.startsWith("\'")) { // Color strings. char stringType = scanner[0]; output ~= "<span class=\"code_string\">"; int stringLength = 1; while(stringLength < scanner.length && scanner[stringLength] != stringType) { if(scanner[stringLength] == '\\') { stringLength += 1; } stringLength += 1; } if(stringLength >= scanner.length) { writefln("ERROR: Unable to find close quote for string %s near line %d in string %s\n", scanner[0..min(scanner.length, 20)], lineNumber, source); break scanner; } output ~= scanner[0..stringLength + 1]; output ~= "</span>"; scanner = scanner[stringLength + 1..scanner.length]; } else { if(isAlpha(scanner[0])) { bool isNotIdentifier(dchar ch) { return !(isAlpha(ch) || isDigit(ch) || ch == '_'); } int wordLength = countUntil!isNotIdentifier(scanner); if(wordLength < 0) { wordLength = scanner.length; } const string[] identifiers = [ "const", "bool", "break", "char", "dchar", "else", "for", "if", "import", "int", "main", "module", "private", "return", "string", "std", "void", "while", ]; if(wordLength > 0 && !findAmong(identifiers, [scanner[0..wordLength]]).empty) { // Special identifiers output ~= "<span class=\"code_identifier\">"; output ~= scanner[0..wordLength]; output ~= "</span>"; } else { output ~= scanner[0..wordLength]; } scanner = scanner[wordLength..scanner.length]; } else { output ~= scanner[0]; scanner = scanner[1..scanner.length]; } } } return output; }
Escape special HTML Characters with their safe entities.
string escapeHTMLCharacters(string source) { string output; string scanner = source; foreach(dchar ch; source) { if(countUntil("<>&", ch) >= 0) { if(ch == '<') { output ~= "<"; } else if(ch == '>') { output ~= ">"; } else if(ch == '&') { output ~= "&"; } else { writefln("BUG: Only partly implemented support for '%s'.", ch); } } else { output ~= ch; } } return output; }
Parse an entire section of text, recursing for definitions if needed.
SBlock[] slurp_section(string contents, ref int offset, ref int lineNumber, bool recurse, ESectionType sectionType) { SBlock[] results; string currentBlock = ""; int startLineNumber = lineNumber; int index = offset; for(; index < contents.length; index++) { if(contents[index] == '@') { if(recurse && contents[index + 1] == '<') { results ~= SBlock(sectionType, startLineNumber, currentBlock); currentBlock = ""; startLineNumber = lineNumber; slurp identifier subsection if(contents[index..$].startsWith("@>")) { index += 2; } else { writefln("Identifier '%s' invoked without close tag. at %s", identifier, contents[index..min($, index + 10)]); break; } } else if(contents[index + 1] == '@') { currentBlock ~= contents[index]; // Skip the escaped at symbol. index++; slurp styled section } else { break; } } Handle embedded index points else { if(contents[index] == '\n') { lineNumber++; } currentBlock ~= contents[index]; } } results ~= SBlock(ESectionType.CODE, startLineNumber, currentBlock); offset = index; return results; }
expand_code_identifier
string expand_code_identifier(SSection[] sections, string identifier, string inputFilename) { string output; output ~= "/* from "~identifier~" */"; SSection[] definitions = find_matching_identifiers(sections, identifier); if(definitions.empty) { writefln("ERROR: Unable to find identifier '%s'.", identifier); return format("ERROR: %s is undefined", identifier); } foreach(section; definitions) { foreach(block; section.contents) { output ~= format("\n#line %d \"%s\"\n", block.lineNumber, inputFilename); if(block.type == ESectionType.IDENTIFIER) { output ~= expand_code_identifier(sections, block.content, inputFilename); } else { output ~= block.content; } } } return output; }
parse_web_then_tangle_and_weave
void parse_web_then_tangle_and_weave(ref string outputDisplayContents, ref string outputCodeContents, string fileContents, string inputFilename) { SSection[] fileSections; int lineNumber = 0; int charIndex = 0; while(charIndex < fileContents.length) { dchar ch = fileContents[charIndex]; if(ch == '@') { dchar chNext = charIndex < fileContents.length - 1 ? fileContents[charIndex + 1] : 0; charIndex += 2; if(chNext == '@') { // It's just an escaped at. Continue parsing. } else if(chNext == 'p') { fileSections ~= SSection("__main__", ESectionType.CODE, slurp_section(fileContents, charIndex, lineNumber, true, ESectionType.CODE)); } else if(chNext == '>') { //End tag. This should be the end of this block. } else if(chNext == '<') { SBlock[] identifierBlocks = slurp_section(fileContents, charIndex, lineNumber, false, ESectionType.IDENTIFIER); assert(identifierBlocks.length == 1); string identifier = identifierBlocks[0].content; SBlock[] sectionContents; if(fileContents[charIndex..charIndex + 3] == "@>=") { charIndex += 3; sectionContents = slurp_section(fileContents, charIndex, lineNumber, true, ESectionType.CODE); } else { writefln("Identifier '%s' invoked outside program and not a definition.", identifier); } fileSections ~= SSection(identifier, ESectionType.CODE, sectionContents); } else if(chNext == '*') { int titleEndingPeriod = countFromPosUntil(fileContents, charIndex, "."); string title = ""; if(titleEndingPeriod > 0) { title = fileContents[charIndex..titleEndingPeriod]; charIndex = titleEndingPeriod + 1; } fileSections ~= SSection(title, ESectionType.HEADER, slurp_section(fileContents, charIndex, lineNumber, false, ESectionType.HEADER)); } else { // '@ ' will be converted into a section. fileSections ~= SSection("", ESectionType.PARAGRAPH, SBlock(ESectionType.PARAGRAPH, lineNumber, "<p>") ~ slurp_section(fileContents, charIndex, lineNumber, false, ESectionType.PARAGRAPH)); } } else { fileSections ~= SSection("", ESectionType.PARAGRAPH, slurp_section(fileContents, charIndex, lineNumber, false, ESectionType.PARAGRAPH)); } } New Display Work in parse_web_then_tangle_and_weave New support for main sections foreach(block; mainSection[0].contents) { if(block.type == ESectionType.IDENTIFIER) { outputCodeContents ~= expand_code_identifier(fileSections, block.content, inputFilename); } else { outputCodeContents ~= block.content; } } }
Main Function
void main(string[] args) { if(args.length != 4) { writefln("Usage: WEB3 inputFile outputHTMLFile outputCodeFile"); } const string inputFilename = args[1]; string fileContents = cast(string) std.file.read(inputFilename); if(fileContents.length == 0) { writefln("Unable to read file '%s'.", inputFilename); return; } string outputDisplayContents = ""; string outputCodeContents = ""; parse_web_then_tangle_and_weave( outputDisplayContents, outputCodeContents, fileContents, inputFilename); string outputDisplayFilename = args[2]; std.file.write(outputDisplayFilename, outputDisplayContents); string outputCodeFilename = args[3]; std.file.write(outputCodeFilename, outputCodeContents); }
Here's where I insert the new index/table of contents work. Each header includes a hyperlink target and the code blocks are marked for show/hide.
New Display Work in parse_web_then_tangle_and_weave
foreach(SSection section; fileSections) { insertTableOfContents insertIndex Better Headers if(section.type != ESectionType.CODE) { string paragraphReducer(string output, SBlock block) { if(block.type == ESectionType.INDEX_TERM) { return output ~ "<b id='"~section.name~block.content~"'><i>" ~ block.content.strip("|") ~ "</i></b>"; } else if(block.type == ESectionType.BOLD) { return output ~ "<b>" ~ block.content ~ "</b>"; } else if(block.type == ESectionType.PRE) { return output ~ "<i>" ~ block.content ~ "</i>"; } else { return output ~ block.content; } } string content = reduce!paragraphReducer("", section.contents); outputDisplayContents ~= content; } else { Better Code Blocks foreach(block; section.contents) { string outputContent = formatCodeForDisplay(block.content, block.lineNumber); if(block.type == ESectionType.IDENTIFIER) { outputDisplayContents ~= "<b><i>" ~ outputContent ~ "</i></b>"; } else { outputDisplayContents ~= outputContent; } } outputDisplayContents ~= "</pre>"; } }
Indexes require some kind of hyperlink anchor for the links to link back to. Headers and code sections are thus named for linking.
Better Headers
if(section.type == ESectionType.HEADER) { outputDisplayContents ~= "<h3 id=\"" ~ section.name ~ "\">" ~ section.name ~"</h3>" ~ "<p>"; } else if(section.type == ESectionType.CODE) { outputDisplayContents ~= "<p id=\"" ~ section.name ~ "\"><b>" ~ section.name ~"</b>"; }
This inserts a little javascript button to hide and show each code block.
Better Code Blocks
static if(false) { outputDisplayContents ~= format(" <button onclick=\"toggle_element_hidden('%s_code')\">Show/Hide Code</button>", section.name); } outputDisplayContents ~= "<pre id=\"" ~ section.name ~ "_code" ~ "\">";
I want to highlight the existing blocks where I made a change.
I want to generate something different than HTML and D.
Bug! I've noticed a bug where code segments have to be separated by non-code sections. Here's the fix in the slurp section function.
slurp identifier subsection
slurp: read identifier
slurp: check for new definition
slurp: insert subsection
Save the scanner index before reading the identifier.
slurp: read identifier
int preIdentifierIndex = index; index += 2; SBlock[] identifierBlocks = slurp_section(contents, index, lineNumber, false, sectionType); assert(identifierBlocks.length == 1); string identifier = identifierBlocks[0].content;
An identifier inserted in a block will look like @> and the definition of a new identifier will look like @>=. If the last code section is ending because of the start of a new one, revert the identifier and allow the new section to start reading.
slurp: check for new definition
if(contents[index..$].startsWith("@>=")) { // The end of one block has bumped into the start of another. Roll back. index = preIdentifierIndex; break; }
Now that we're sure we're still in the old section, add the new identifier.
slurp: insert subsection
// Now that we're sure this is a reference to an identifier and not a definition of a new identifier, continue.
results ~= SBlock(ESectionType.IDENTIFIER, lineNumber, identifier);
We're going to start generating a table of contents. I think WEB uses |special word| to generate a seperate index. I've inserted a special token __table_of_contents__ to control where the TOC gets generated. Header and code sections both have titles, so I can insert them in the TOC. Headers define major sections, so I indent sub-sections below them. Luckily HTML will convert an empty list <ul></ul> into no space at all, so I can start inside an empty header and then the first header will bump us out. (Saves me tracking the start of the first header.)
insertTableOfContents
if(section.name == "__table_of_contents__") { outputDisplayContents ~= "<h3>Table of Contents:</h3>"; outputDisplayContents ~= "<ul><ul>"; foreach(SSection tocSection; fileSections) { if(tocSection.name.startsWith("__")) { continue; } if(tocSection.type == ESectionType.HEADER) { outputDisplayContents ~= "</ul><li><b><a href=\"#"~tocSection.name~"\">" ~ tocSection.name ~"</a></b><ul>"; } else if(tocSection.type == ESectionType.CODE) { outputDisplayContents ~= "<li><a href=\"#"~tocSection.name~"\">" ~ tocSection.name ~"</a>"; } } outputDisplayContents ~= "</ul></ul>"; continue; }
An index is the same thing as a TOC except the list is alphabetic. (Version 1 has duplicates here from both the definition and references.)
insertIndex
if(section.name == "__index__") { Print the index header Scan sections for index targets Sort index alphabetically Print index entries outputDisplayContents ~= "</ul>"; continue; }
Print the index header
outputDisplayContents ~= "<h3>Index:</h3>"; outputDisplayContents ~= "<ul>";
Like the Table of Contents, we're linking to Header and Code sections. I've added tagging for Index Terms so they get added to the index as well.
Scan sections for index targets
string[] references; foreach(SSection indexSection; fileSections) { if(indexSection.name.startsWith("__")) { continue; } if(indexSection.type == ESectionType.HEADER || indexSection.type == ESectionType.CODE) { references ~= indexSection.name~"@"~indexSection.name; } foreach(SBlock block; indexSection.contents) { if(block.type == ESectionType.IDENTIFIER ) { references ~= block.content ~"@"~indexSection.name; } if(block.type == ESectionType.INDEX_TERM) { references ~= block.content.strip("|") ~"@"~indexSection.name~block.content; } } }
Sort index alphabetically
import std.algorithm.mutation : SwapStrategy; auto sortedReferences = sort!("a.toUpper < b.toUpper", SwapStrategy.stable) (references);
Print index entries
string lastReference = ""; int referenceCount = 0; foreach(string reference; sortedReferences) { Print a title for each index entry Count the links to each usage }
I cached the references as "block @ section" for ease in storage and sorting. Parse them out. Then check if we've hit a new term to start a new index entry versus adding numbered subscripts.
Print a title for each index entry
string[] components = reference.split("@"); if(components[0] != lastReference) { lastReference = components[0]; outputDisplayContents ~= "<li><b>" ~ components[0] ~ ":" ~ "</b>"; referenceCount = 0; } outputDisplayContents ~= " <a href=\"#" ~ components[1] ~ "\">" ;
Count the links to each usage
referenceCount++; if(components[0] == components[1]) { outputDisplayContents ~= "(definition) "~ "</a>"; } else { string indexNumber = format( "%d", referenceCount); outputDisplayContents ~= indexNumber ~ "</a>"; }
The @p tag is renamed __main__. Make sure there's exactly one.
New support for main sections
SSection[] mainSection = find_matching_identifiers(fileSections, "__main__"); if(mainSection.length != 1) { writefln("ERROR: Exactly 1 __main__ section needed. %d found.", mainSection.length); foreach(section; mainSection) { writefln("%s found at line %d", section.name, section.contents[0].lineNumber); } return; } if(mainSection[0].type != ESectionType.CODE) { writefln("ERROR: __main__ section needs to be code."); return; }
I've wanted to append sections like includes. I don't want to call out each include, I want to add includes through the demonstration and have them accumulate at the top. For tutorial purposes, I'd like to define a section, then expand and replace it. Knuth's original WEB supports truncated identifiers. The reference may be a full sentence and the definition is truncated.
Find Matching Identifiers
SSection[] find_matching_identifiers(SSection[] sections, string identifier) { SSection[] results; foreach(section; sections) { // A section might be name... or name...! or name...+. Check each form. if(section.name.endsWith("...")) { if(!identifier.startsWith(section.name[0..$-3])) { continue; } } else if(section.name.endsWith("...!") || section.name.endsWith("...+")) { if(!identifier.startsWith(section.name[0..$-4])) { continue; } } else if(section.name != identifier) { continue; } // We've found a name that matches. if(section.name.endsWith("...+")) { // Append tag. } else if(section.name.endsWith("...!")) { // Replace tag. results = []; } else if(!results.empty) { writefln("WARNING: Multiple matches for '%s' found. Use ...+ to append or ...! to replace.", identifier); } results ~= section; } return results; }
Comments to test appending and replacing
// There should be 2 appended comments, 1 replaced comment, and 1 partial comment. This message will repeat. Test appending Test replacing // There should be 2 appended comments, 1 replaced comment, and 1 partial comment. This message is the repetition.
Test appending
// This is the original named appending
Test appending...+
// This is the first appended comment
Test appending...+
// This is the second appended comment
Test replacing
// This is the wrong replaced comment
Test replacing...!
// This is an unseen replaced comment
Test replacing...!
// This is the correct replaced comment
Test partial names
Test partial ...
// Partial names appended
The result is something like this:
Results:
// There should be 2 appended comments, 1 replaced comment, and 1 partial comment. This message will repeat. // This is the original named appending // This is the first appended comment // This is the second appended comment // This is the correct replaced comment // Partial names appended // There should be 2 appended comments, 1 replaced comment, and 1 partial comment. This message is the repetition.
I'm tired of version 1 of each parser choking on a tag and me having to escape it either temporarily or permanently. Define tokens here.
New character definitions
const dchar CHAR_PIPE = '|'; const dchar CHAR_AT = '@'; const dchar CHAR_NEWLINE = '\n';
Handle embedded index points
else if(contents[index] == CHAR_PIPE && sectionType != ESectionType.CODE) { if(contents[index + 1] == CHAR_PIPE) { currentBlock ~= contents[index]; // Skip the escaped pipe symbol. index++; } else { // Save the previous block. results ~= SBlock(sectionType, startLineNumber, currentBlock); currentBlock = ""; startLineNumber = lineNumber; currentBlock ~= contents[index]; for(index++; index < contents.length; index++) { currentBlock ~= contents[index]; if(contents[index] == CHAR_NEWLINE) { lineNumber++; } if(contents[index] == CHAR_PIPE) { if(index < contents.length - 1 && contents[index + 1] == CHAR_PIPE) { // Skip the escaped pipe symbol. index++; } else { break; } } else if(contents[index] == CHAR_AT) { if(index < contents.length - 1 && contents[index + 1] == CHAR_AT) { // Skip the escaped at symbol. index++; } else { writefln("WARNING: At symbol encountered before end of piped index term."); break; } } } if(index >= contents.length || contents[index] != CHAR_PIPE) { writefln("WARNING: Close pipe expected near line %d.", startLineNumber); break; } // Save the indexed identifier. results ~= SBlock(ESectionType.INDEX_TERM, startLineNumber, currentBlock); currentBlock = ""; startLineNumber = lineNumber; } }
'@ ' and '@^' start certain font styles and '@>' will end them. This code is inserted into slurp_section.
I am not sure how this will work with WEB3. A thing to test.
This is the next unstyled paragraph.
slurp styled section
} else if(contents[index + 1] == '^' || contents[index + 1] == '.') { ESectionType styleType = contents[index + 1] == '^' ? ESectionType.BOLD : ESectionType.PRE; Save off the previous block of text parse the styled section skip the close tag around the style section insert the style block into the section
parse the styled section
SBlock[] textBlocks = slurp_section(contents, index, lineNumber, false, sectionType);
assert(textBlocks.length == 1);
string textBlock = textBlocks[0].content;
skip the close tag around the style section
if(contents[index..$].startsWith("@>")) { index += 1; } else { writefln("Identifier '%s' invoked without close tag. at %s", textBlock, contents[index..min($, index + 10)]); break; }
An identifier inserted in a block will look like @> and the definition of a new identifier will look like @>=. If the last code section is ending because of the start of a new one, revert the identifier and allow the new section to start reading.
Now that we're sure we're still in the old section, add the new identifier.
insert the style block into the section
results ~= SBlock(styleType, lineNumber, textBlock);
Save off the previous block of text
results ~= SBlock(sectionType, startLineNumber, currentBlock);
currentBlock = "";
startLineNumber = lineNumber;
index += 2;