The Forum for Discussion about The Third Manifesto and Related Matters

Please or Register to create posts and topics.

Codd 1970 'domain' does not mean Date 2016 'type' [was: burble about Date's IM]

PreviousPage 22 of 22

Lexical parsing -- lexxing -- is fundamentally the process of categorising substrings. It's all about substrings.

I no longer know what point is at issue, so I'll state my case and exit stage left.

  • A lexer is something that consumes a stream of characters and emits a stream of tokens.
  • A token has a type (kind) and may have a value (of some internal compiler type). It may have incidental information, not required for the language definition but useful in various ways.
  • There is no required relationship of any kind between portions (substrings) of an input stream and tokens.
  • There is no required relationship of any kind between how the lexer assigns a value to a token and the operation of any runtime library function.
  • Later stages of the compiler are solely responsible for any conversions between token type (kind) and language types. No substrings involved.

The output of a typical lexxer is a stream of token/substring pairs. For some operations, the substring is ignored or elided. E.g., we see '+' as a substring of the input stream and categorise it as token type PLUS; we can throw away the '+' substring. Keyword tokens don't need to preserve their corresponding source substrings, nor do comment tokens unless the language supports "smart" comments, but the substrings associated with most other tokens are crucial for obtaining specific identifiers and values.

There are no substrings. It's all an illusion. I have a compiler that reads a 10 line program, generates and compiles over a million tokens, and not a substring in sight.

Language parsers are indeed typically defined in terms of token types rather than substrings (because the lexical parser is defined in terms of substring recognition, and has done that work for us) but that doesn't mean it's not about substrings. It is. It's all about substrings. We simply abstract them away at the appropriate level, but keep them as needed for identifiers, selecting values, etc.

Though some languages do more lexical parser heavy-lifting than others. In a language like FORTH, the global lexical parser only (as I recall) effectively identifies lexical types 'number' and 'word'. Everything else is up to the individual words to handle.

My recollection is that FORTH has two types of token: words and integers. Words are translated into pointer-to-code, integers are translated into push-value-on-stack (I think there are hex and dec modes). Some words are parsers, executing immediately and reading directly from the input stream. The lexer is minimalist, to say the least. The input is very much about substrings, but they don't survive. If there is a SELECTOR, it doesn't get to see any substrings.

Andl - A New Database Language - andl.org
Quote from dandl on March 22, 2020, 1:40 am

Lexical parsing -- lexxing -- is fundamentally the process of categorising substrings. It's all about substrings.

I no longer know what point is at issue, so I'll state my case and exit stage left.

  • A lexer is something that consumes a stream of characters and emits a stream of tokens.
  • A token has a type (kind) and may have a value (of some internal compiler type). It may have incidental information, not required for the language definition but useful in various ways.
  • There is no required relationship of any kind between portions (substrings) of an input stream and tokens.
  • There is no required relationship of any kind between how the lexer assigns a value to a token and the operation of any runtime library function.
  • Later stages of the compiler are solely responsible for any conversions between token type (kind) and language types. No substrings involved.

The output of a typical lexxer is a stream of token/substring pairs. For some operations, the substring is ignored or elided. E.g., we see '+' as a substring of the input stream and categorise it as token type PLUS; we can throw away the '+' substring. Keyword tokens don't need to preserve their corresponding source substrings, nor do comment tokens unless the language supports "smart" comments, but the substrings associated with most other tokens are crucial for obtaining specific identifiers and values.

There are no substrings. It's all an illusion. I have a compiler that reads a 10 line program, generates and compiles over a million tokens, and not a substring in sight.

That language must be all keywords, like a command language or configuration language. If you have values like integers, then conversion of the token to a specific integer requires use of the substring of the source that denotes that value. If you have identifiers, then conversion of the token to a reference to what the identifier denotes requires use of the substring of the source that specifies the identifier text. That's why the output of a lexxer is a stream of tokens where a token is a pair: a token type (sometimes just "the token") and the substring of source text identified as the token type. The substring may be elided for some token types (like keywords) but must be preserved for (at least) identifiers and values.

Language parsers are indeed typically defined in terms of token types rather than substrings (because the lexical parser is defined in terms of substring recognition, and has done that work for us) but that doesn't mean it's not about substrings. It is. It's all about substrings. We simply abstract them away at the appropriate level, but keep them as needed for identifiers, selecting values, etc.

Though some languages do more lexical parser heavy-lifting than others. In a language like FORTH, the global lexical parser only (as I recall) effectively identifies lexical types 'number' and 'word'. Everything else is up to the individual words to handle.

My recollection is that FORTH has two types of token: words and integers. Words are translated into pointer-to-code, integers are translated into push-value-on-stack (I think there are hex and dec modes). Some words are parsers, executing immediately and reading directly from the input stream. The lexer is minimalist, to say the least. The input is very much about substrings, but they don't survive. If there is a SELECTOR, it doesn't get to see any substrings.

You are aware that "parse" is from Latin pars, or parts, yes?

The "parts" are derived from substrings of the original source. What else would they be?

This is fundamental to parsing theory.

Selectors have to see substrings. Given an input like p := 3, a lexxer might tokenise it as {identifier, 'p'}, {assignment}, {integer, '3'}. Note the preservation of source substrings in the identifier and integer token types for later use.  The integer token type is mapped to the INTEGER selector, the '3' substring of the original source is passed as an argument to that selector, the result is a value of type INTEGER.

Or, you can instead map the integer token type to some internal type selector, the '3' substring of the original source is passed as an argument to that selector -- which might be C's atoi(), or Java's Integer.parseInt(), or C#'s int.TryParse() -- but it is a selector in TTM terms and it does receive a substring of the original source.

I'm not sure how you'd do it any other way.

 

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

There are no substrings. It's all an illusion. I have a compiler that reads a 10 line program, generates and compiles over a million tokens, and not a substring in sight.

That language must be all keywords, like a command language or configuration language.

It isn't. It just has such powerful compile-time processing that all the integers are constructed arithmetically and all the symbols are constructed by token pasting. Because the code is written lower case and the symbols are upper case, no substrings survive, except possibly the odd literal string.

You are aware that "parse" is from Latin pars, or parts, yes?

The "parts" are derived from substrings of the original source. What else would they be?

Easy. Parsing derives from 'part of speech', the analysis of natural language. Every possible word and character in English already exists in a dictionary. Parsing involves extracting a written or spoken word from context and looking it up in the dictionary. What survives into the next phase is a reference to the dictionary entry, with associated metadata. Just like in my compiler.

Of course parsing original written text means breaking it into lexemes (substrings) or phonemes in order to recognise language elements, but it provides no justification for insisting that those substrings play any further role.

Or, you can instead map the integer token type to some internal type selector, the '3' substring of the original source is passed as an argument to that selector -- which might be C's atoi(), or Java's Integer.parseInt(), or C#'s int.TryParse() -- but it is a selector in TTM terms and it does receive a substring of the original source.

In this context I take SELECTOR to mean the language-defined element responsible for creating new values, as distinct from the internal compiler element parsing integer literals. In my language the former is called INTEGER() and the latter is explicitly coded as part of the lexer. Integer tokens can also be constructed by compile-time arithmetic, with no substrings involved. They share nothing.

 

Andl - A New Database Language - andl.org
Quote from dandl on March 23, 2020, 11:42 pm

 

You are aware that "parse" is from Latin pars, or parts, yes?

The "parts" are derived from substrings of the original source. What else would they be?

Easy. Parsing derives from 'part of speech', the analysis of natural language.

Heh heh you're both right, but David is more right (or at least more recent). Etymonline:

"parse (v.) 1550s, in grammar, "to state the part of speech of a word or the words in a sentence," a verbal use of Middle English pars (n.) "part of speech" (c. 1300), from Old French pars, plural of part "a part," from Latin pars "a part, piece" (from PIE root *pere- (2) "to grant, allot") ... Transferred (non-grammatical) use is by 1788."

It's difficult to see this debate as more than about the number of angels dancing on a pin-head, but ...

Suppose your program text contains both "3" and "0003", or both "3.0" and "3.0000": the lexer is going to parse those substrings into the same token in each case, and throw away the superfluous "0"s. Or suppose your language is case-insensitive, then these'll be the same token despite being distinguishable substrings: "foo", "Foo", "FOO". The lexer might pass the substring and its line/column as metadata alongside the token, but the syntax-checker cares only about the token.

Quote from AntC on March 24, 2020, 5:42 am
Quote from dandl on March 23, 2020, 11:42 pm

You are aware that "parse" is from Latin pars, or parts, yes?

The "parts" are derived from substrings of the original source. What else would they be?

Easy. Parsing derives from 'part of speech', the analysis of natural language.

Heh heh you're both right, but David is more right (or at least more recent). Etymonline:

"parse (v.) 1550s, in grammar, "to state the part of speech of a word or the words in a sentence," a verbal use of Middle English pars (n.) "part of speech" (c. 1300), from Old French pars, plural of part "a part," from Latin pars "a part, piece" (from PIE root *pere- (2) "to grant, allot") ... Transferred (non-grammatical) use is by 1788."

It's difficult to see this debate as more than about the number of angels dancing on a pin-head, but ...

Suppose your program text contains both "3" and "0003", or both "3.0" and "3.0000": the lexer is going to parse those substrings into the same token in each case, and throw away the superfluous "0"s. Or suppose your language is case-insensitive, then these'll be the same token despite being distinguishable substrings: "foo", "Foo", "FOO". The lexer might pass the substring and its line/column as metadata alongside the token, but the syntax-checker cares only about the token.

Syntax checking only references the token (type), but code generation or interpretation references the token substring. Where leading or trailing zeroes are removed, or other transformations occur, is theoretically irrelevant and (implementation-wise) can occur anywhere prior to the associated value being selected/constructed/manifested.

The original claim was that parsing is not about substrings, but that's incorrect. Parsing is all about substrings, even when the substrings are classified (as a token) and the original text thrown away, it's still a classification of a substring of the source string.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
Quote from dandl on March 23, 2020, 11:42 pm

There are no substrings. It's all an illusion. I have a compiler that reads a 10 line program, generates and compiles over a million tokens, and not a substring in sight.

That language must be all keywords, like a command language or configuration language.

It isn't. It just has such powerful compile-time processing that all the integers are constructed arithmetically and all the symbols are constructed by token pasting. Because the code is written lower case and the symbols are upper case, no substrings survive, except possibly the odd literal string.

You are aware that "parse" is from Latin pars, or parts, yes?

The "parts" are derived from substrings of the original source. What else would they be?

Easy. Parsing derives from 'part of speech', the analysis of natural language. Every possible word and character in English already exists in a dictionary. Parsing involves extracting a written or spoken word from context and looking it up in the dictionary. What survives into the next phase is a reference to the dictionary entry, with associated metadata. Just like in my compiler.

Of course parsing original written text means breaking it into lexemes (substrings) or phonemes in order to recognise language elements, but it provides no justification for insisting that those substrings play any further role.

You appear to be referencing both your personal definitions and your personal implementations, rather than general parsing theory. In computer science, parsing is the syntactic classification of source parts. The parts are unavoidably (and by definition), substrings. That's what it is.

Or, you can instead map the integer token type to some internal type selector, the '3' substring of the original source is passed as an argument to that selector -- which might be C's atoi(), or Java's Integer.parseInt(), or C#'s int.TryParse() -- but it is a selector in TTM terms and it does receive a substring of the original source.

In this context I take SELECTOR to mean the language-defined element responsible for creating new values, as distinct from the internal compiler element parsing integer literals. In my language the former is called INTEGER() and the latter is explicitly coded as part of the lexer. Integer tokens can also be constructed by compile-time arithmetic, with no substrings involved. They share nothing.

A selector is the TTM term for obtaining a value from a defined set of values (i.e., a type). Other sources might describe it as value construction or whatever, but as this the "TTM Forum", it seems reasonable to use TTM terminology. As such, selection (aka construction) of values from source substrings (usually but not necessarily paired with token types in tokens and emitted by a lexxer) is unavoidable, whether or not the substrings undergo some semantic-preserving transformation, and whether or not the value thus selected is the end value or merely an intermediate that will be an argument to the final value selection.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

The original claim was that parsing is not about substrings, but that's incorrect. Parsing is all about substrings, even when the substrings are classified (as a token) and the original text thrown away, it's still a classification of a substring of the source string.

No, my claim is that source code substrings do not survive to the point of SELECTOR invocation. Lexing is the only part of parsing that concerns itself with substrings, and the rest of the compiler works with abstract entities that have no necessary relationship to the original substrings. Lexing is all about substrings, so the rest of the compiler doesn't have to deal with them.

Andl - A New Database Language - andl.org
Quote from dandl on March 24, 2020, 9:01 am

The original claim was that parsing is not about substrings, but that's incorrect. Parsing is all about substrings, even when the substrings are classified (as a token) and the original text thrown away, it's still a classification of a substring of the source string.

No, my claim is that source code substrings do not survive to the point of SELECTOR invocation. Lexing is the only part of parsing that concerns itself with substrings, and the rest of the compiler works with abstract entities that have no necessary relationship to the original substrings. Lexing is all about substrings, so the rest of the compiler doesn't have to deal with them.

Source code substrings absolutely do survive to the point of selector invocation, whether the selector is internal or for the end value. E.g., substring '3' winds up being passed, either directly or indirectly to the end type selector for INTEGER (or equivalent) to produce the value 3 at runtime. The fact that the source substring may undergo some semantic-preserving transformations -- e.g., from 0003 to 3 or whatever -- along the way is immaterial.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org

Of course parsing original written text means breaking it into lexemes (substrings) or phonemes in order to recognise language elements, but it provides no justification for insisting that those substrings play any further role.

You appear to be referencing both your personal definitions and your personal implementations, rather than general parsing theory. In computer science, parsing is the syntactic classification of source parts. The parts are unavoidably (and by definition), substrings. That's what it is.

This is getting ridiculous. You keep making obvious true statements that go nowhere near the crux of the matter.

Yes, parsing is (crudely) all about substrings. Compiling is not.

Or, you can instead map the integer token type to some internal type selector, the '3' substring of the original source is passed as an argument to that selector -- which might be C's atoi(), or Java's Integer.parseInt(), or C#'s int.TryParse() -- but it is a selector in TTM terms and it does receive a substring of the original source.

In this context I take SELECTOR to mean the language-defined element responsible for creating new values, as distinct from the internal compiler element parsing integer literals. In my language the former is called INTEGER() and the latter is explicitly coded as part of the lexer. Integer tokens can also be constructed by compile-time arithmetic, with no substrings involved. They share nothing.

A selector is the TTM term for obtaining a value from a defined set of values (i.e., a type). Other sources might describe it as value construction or whatever, but as this the "TTM Forum", it seems reasonable to use TTM terminology. As such, selection (aka construction) of values from source substrings (usually but not necessarily paired with token types in tokens and emitted by a lexxer) is unavoidable, whether or not the substrings undergo some semantic-preserving transformation, and whether or not the value thus selected is the end value or merely an intermediate that will be an argument to the final value selection.

I have defined my terms. I define SELECTOR as the invocation of an identifiable operator in the grammar of the language, such as this excerpt from the TD grammar:

<scalar selector inv>
::= <built in scalar literal>
| <possrep name> ( <argument exp commalist> )

The SELECTOR is the invocation described by this rule, not the means by which the <built in scalar literal> is constructed, which is described as follows:

INTEGER (signed integers): synonym INT; literals expressed as an optionally signed decimal integer;
usual arithmetic and comparison operators, with usual notation. The (implicit) example value is 0.

If you want to read those as equivalent, that's your choice, but I disagree. Strongly.

 

 




 

Andl - A New Database Language - andl.org
Quote from dandl on March 24, 2020, 11:04 pm

Of course parsing original written text means breaking it into lexemes (substrings) or phonemes in order to recognise language elements, but it provides no justification for insisting that those substrings play any further role.

You appear to be referencing both your personal definitions and your personal implementations, rather than general parsing theory. In computer science, parsing is the syntactic classification of source parts. The parts are unavoidably (and by definition), substrings. That's what it is.

This is getting ridiculous. You keep making obvious true statements that go nowhere near the crux of the matter.

Yes, parsing is (crudely) all about substrings. Compiling is not.

I didn't say compiling is all about substrings. I don't think I mentioned compiling, at least not by name.

I said parsing is all about substrings. Not crudely, it simply is. Parsing classifies parts of a source string, and the parts are -- by definition -- substrings. That's what it is.

I'm not sure how you can claim that the substrings don't play a further role; they play the only role. The classification of substrings, and subsequent application of those classifications to verify syntax and implement semantics is what a text-based computer language is.

Everything else is implementation. But implementations often carry substrings from source straight to run-time output. String literals, for example, often pass through unchanged. But even when elided, the token type is simply a surrogate -- a name -- for a substring or set of lexically equivalent substrings.

Or, you can instead map the integer token type to some internal type selector, the '3' substring of the original source is passed as an argument to that selector -- which might be C's atoi(), or Java's Integer.parseInt(), or C#'s int.TryParse() -- but it is a selector in TTM terms and it does receive a substring of the original source.

In this context I take SELECTOR to mean the language-defined element responsible for creating new values, as distinct from the internal compiler element parsing integer literals. In my language the former is called INTEGER() and the latter is explicitly coded as part of the lexer. Integer tokens can also be constructed by compile-time arithmetic, with no substrings involved. They share nothing.

A selector is the TTM term for obtaining a value from a defined set of values (i.e., a type). Other sources might describe it as value construction or whatever, but as this the "TTM Forum", it seems reasonable to use TTM terminology. As such, selection (aka construction) of values from source substrings (usually but not necessarily paired with token types in tokens and emitted by a lexxer) is unavoidable, whether or not the substrings undergo some semantic-preserving transformation, and whether or not the value thus selected is the end value or merely an intermediate that will be an argument to the final value selection.

I have defined my terms. I define SELECTOR as the invocation of an identifiable operator in the grammar of the language, such as this excerpt from the TD grammar:

<scalar selector inv>
::= <built in scalar literal>
| <possrep name> ( <argument exp commalist> )

The SELECTOR is the invocation described by this rule, not the means by which the <built in scalar literal> is constructed, which is described as follows:

INTEGER (signed integers): synonym INT; literals expressed as an optionally signed decimal integer;
usual arithmetic and comparison operators, with usual notation. The (implicit) example value is 0.

If you want to read those as equivalent, that's your choice, but I disagree. Strongly.

You appear to have limited "SELECTOR" to be a language construct specific to Tutorial D. Is it not a general TTM term for an operator that returns a value of a given type, given a literal denoting that value (which is presumably of some other type)?

My reading of TTM certainly suggests that it is.

I'm the forum administrator and lead developer of Rel. Email me at dave@armchair.mb.ca with the Subject 'TTM Forum'. Download Rel from https://reldb.org
PreviousPage 22 of 22