promo banner for Computing & Info Sciences

intro grammars/BNF

video: overview-before-grammars (2m22s)

A quick road-map and some explanation:

We've been looking at union types and compound types, because we're working towards implementing our own language. To represent a syntax tree, we're gonna need a union-of-compound-types (the composite pattern).
We have introduced racket (er, “beginning student”), mostly as an example of a language where there are extremely few syntax-rules, yet we can do everything we want.
We're talking about immutable data, to (a) show that you can certainly write a video-game without ever needing to modify a field (!), and (b) make it easier to write our language interpreter; adding mutation to the language would just makes it harder to reason about otherwise-straightforward code.

Reading: 
         Scott §§2.1.  (Read, then re-read closely.)


Let's play the gramma  game!
Can you turn the string
“<S>” into the string “hotdog hit android .”?
Rules: you can take as many steps as you like, where a step is one of:

1a.   “<S>” can be replaced with “<N> <V> .”
1b.   “<S>” can be replaced with “<N> <V> <N> .”
2a-h. “<N>” can be replaced with “cat” or “ufo” or “apple” or “fox”
     or “dog” or “hotdog” or “android” or “itec”
3a-g. “<V>” can be replaced with with “bite” or “run” or “jump” or “hit”
      or “love” or “spin” or “hear”


Write your answer as each string, with a “⇒” between them.
(You can also include the exact rule# you used too, if you like).

video: grammars-intro (26m41s)





That's a pretty crude sentence we generated.
Can we tweak (improve) these rules so that:
  (i) a noun can be optionally preceded by “a”, “an”, or “the”, and
  (ii) a verb can optionally have an “ed” added to the end?
(We won't enforce the right articles; “a android” will be allowed.)

Please write down your revised rules.

(It's tedious to keep writing “can be replaced with”; you can
just write → instead, as an abbreviation.
…In fact, I'll require you to use the abbreviation, since
it's tedious to keep reading that phrase.  Conciseness can be clarity!)

Likewise, let's abbreviate “or” with |.
(Btw, can we write all our rules w/o needing to say “or”?
It's already a shortcut — a bit of syntactic sugar to
make it easier for us to write our exact rules!)






Specifying Languages:
syntax vs semantics

Syntax: a grammar G1 (mongo the caveman's grammar):

<S> → <N> <V> .                                                 rule 1
<S> → <N> <V> <N> .                                             rule 2
<N> → cat | ufo | apple | fox | dog | hotdog | android | dept   rules 3a-h
<V> → bite | run | jump | hit | love | spin | hear              rules 4a-g

What is a sentence we can derive from <S>?  What's one that we can't?




G1 is a bit mongo-ish; how to add ...
  articles, plurals, and maybe some verb conjugation?

Here's G2:

<S>   → <NP> <CV> <NP> .                                                      rule 1
<S>   → <NP> <CV> .                                                         rule 2
<NP>  → <ART> <N>                                                          rule 3
<ART> → a | an | the                                                  rules 4a-c
<CV>  → <V> s | <V> ed                                                     rules 5a-b
<N>   → cat | ufo | apple | fox | dog | hotdog | android | mongo | itec rules 6a-i
<V>   → bite | run | jump | hit | love | spin | hear                    rules 7a-g


Example:
<S> ⇒ <NP> <CV> <NP> .                  by rule 1
    ⇒ <ART> <N> <CV> <NP> .               by rule 3
    ⇒ the <N> <CV> <NP> .               by rule 4c
    ⇒ the <N> <V> ed <NP> .             by rule 5b
    ⇒ the <N> hit ed <NP> .           by rule 7d
    ⇒ the hotdog hited <NP> .       by rule 6f
    ⇒ the hotdog hited <ART> <N> .    by rule 3
    ⇒ the hotdog hited a <N> .      by rule 4a
    ⇒ the hotdog hited a android. by rule 6g

Some vocabulary

Some vocabulary terms ¹:

nonterminal: A “replaceable” symbol in the grammar; in context-free grammars this means a symbol on the left-hand-side of “→”.
terminal: A “non-replaceable” symbol in the grammar; in context-free grammars this means a symbol that is not on the left-hand-side of any “→”.
→: denotes a grammar rule: in the string you're working with, you're allowed to replace what's on the the left-hand-side of a rule with the characters on the right-hand-side (splicing it into your string).
|: Or. It is just a short-hand: “Z → α | β” means there are actually two separate rules, “Z → α” and “Z → β”.
⇒: “derives in one step” — if γ and δ are strings, then “γ ⇒ δ” means that you can take γ, replace one of its non-terminals with the right-hand-side of one of the rules, and the result is the string δ.
⇒^*: “derives in many steps” — if γ₁ and γ_n are strings, then “γ₁ ⇒^* γ_n” means that there is a sequence ² of derivations γ₁ ⇒ γ₂ ⇒ γ₃ ⇒ … ⇒ γ_n.
derivation: An entire sequence of “⇒^*” steps, showing that one string derives another.
leftmost derivation: A derivation where each step happens to replace the leftmost nonterminal.
parsing: reading in a stream of characters/tokens, and building the corresponding parse tree (or, determine that the input string is not derivable from the grammar).
Later we'll talk about code to do parsing (at least, recursive descent parsing); for the moment we'll just do it instinctively.

Our textbook also uses the terms “token” (a nonterminal), and “lexeme” (a terminal string). Example: “DoubleLiteral” and “BinaryOp” are a tokens, while “+173.4” and “&&” are lexemes. (I've always heard 'token' used for what the book calls lexemes, myself.)

We'll not worry about the terms “concrete syntax tree” (showing every single step of the grammar rules) vs. “abstract syntax tree” (eliding internal branches that have only one child, like skipping the nodes NP ⇒ N ⇒ cat and just writing NP ⇒ cat). Later I'll subtly start using the latter, since those extra steps don't add anything to the data-representation of a syntax-tree.

Derivations also correspond to syntax-trees: video: syntax-tree-hotdog-hit-android (6m25s)

Group exercise: video: fieldDecl-example (15m21s)

What is a grammar for generating… a Java field-declaration? (Let's make a few test-cases first!)
Include: re-factoring rules to make them cleaner; ε,and the “blahMaybe” pattern.

How about an entire Java class-declaration? [not in video -- in lecture, instead]

self-check exercises:
Give a parse tree for the derivation
Give a leftmost-derivation for the parse tree ...

Design a grammar for … Java class-declaration? For declaring main (incl. “String[]” or “String...”)?

Design a grammar for strings matching the following regular expression for phone-numbers (or, one for zip codes): (\⟮\d{3}⟯\)|\d{3}-)?\d{3}-\d{4}

Recursive Rules

video: grammar-recur-ADJ-phrase (7m16s)

G3, a grammar that can generate infinite # of strings:

<S>    → <NP> <CV> <NP> .                                                   rule 1a
<S>    → <NP> <CV> .                                                       rule 1b
<NP>   → <ART_opt> <AP>                                                      rule 3
<ART>  → a | an | the                                                       rules 4a-c
<ART_opt> → ε | <ART>        <--- note: "0 or 1 ADJ"; this is a common CFG idiom
<AP>   → <ADJs> <N>                                                         rule 5      Cf.  <AP> → <ADJ> <AP> | <N>
<ADJs> → ε | <ADJ> <ADJs>     <--- note the circular def'n!                 rules 6a-b
<CV>   → <V> s | <V> ed                                                     rules 7a-b
<N>    → cat | ufo | apple | fox | dog | hotdog | android | mongo | itec rules 8a-i
<V>    → bite | run | jump | hit | love | spin | hear                     rules 9a-g
<ADJ>  → happy | pretty | nice | small | smelly                           rules 10a-e

<S> ⇒ <NP> <CV> <NP> .                                      
    ⇒^* the cat bites <NP> .                              took multiple steps here — hence the “⇒*”)
    ⇒ the cat bites <ART> <AP> .                        
    ⇒ the cat bites the <AP> .                           
    ⇒ the cat bites the <ADJs> <N> .                    by rule 5
    ⇒ the cat bites the <ADJ> <ADJs> <N> .              by rule 6b
    ⇒ the cat bites the nice <ADJs> <N> .
    ⇒ the cat bites the nice <ADJ> <ADJs> <N> .         by rule 6b
    ⇒ the cat bites the nice small <ADJs> <N> .
    ⇒ the cat bites the nice small <ADJ> <ADJs> <N> .   by rule 6b
    ⇒ the cat bites the nice small smelly <ADJs> <N> .
    ⇒ the cat bites the nice small smelly <N> .          by rule 6a
    ⇒ the cat bites the nice small smelly android .

[think "Sentence", "Verb Phrase", "Conjugated Verb", etc.]
[Technically, I'm leaving out spaces from the grammar.  How to fix?]


Is this string in the language?:
    a cat bites a apple.
YES (we can give a derivation for it)
but note that
    a cat bites a apples.
NO (...proof?...take ITEC420!)

making a grammar for Java identifiers

Here're some examples of what we want to be able to generate, using this grammar:

n
numStudents
xyz94
a9bC_today
_yeah
_

video: recursive-rule-java-id (16m15s)

And the rules we come up with:

  <Digit> → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9            (rule 1)
<JavaLetter> → _ | A | B | … | Z | a | b | … | z       (rule 2)
<DigitOrJavaLetter> → <Digit> | <JavaLetter>                   (rules 3a,3b)

<JavaId> → <JavaLetter> <DigitsOrJavaLetters>            (rule 4)

<DigitsOrJavaLetters> → ε                            (rule 5a)
<DigitsOrJavaLetters> → <DigitOrJavaLetter>  <DigitsOrJavaLetters>   (rule 5b)

Finally, we test our answer by deriving some of our examples from earlier:

<JavaId> bnf=> <JavaLetter> <DigitsOrJavaLetters>     (by rule 4)
       ⇒ n <DigitsOrJavaLetters>              (by rule 1n)
       ⇒ n ε                                (rule 5a)
       ⇒ n


<JavaId> bnf=> <JavaLetter> <DigitsOrJavaLetters>                (by rule 4)
       ⇒ n <DigitsOrJavaLetters>                         (by rule 1n)
       ⇒ n <DigitOrJavaLetter> <DigitsOrJavaLetters>       (by rule 5b)
       ⇒ n <Digit> <DigitsOrJavaLetters>                   (by rule 3a)
       ⇒ n 5 <DigitsOrJavaLetters>                       (by rule 2e)
       ⇒ n 5 <DigitOrJavaLetter> <DigitsOrJavaLetters>     (by rule 5b)
       ⇒ n 5 <JavaLetter> <DigitsOrJavaLetters>            (by rule 3b)
       ⇒ n 5 _ <DigitsOrJavaLetters>                     (by rule 1_)
       ⇒ n 5 _ <DigitOrJavaLetter> <DigitsOrJavaLetters>   (by rule 5b)
       ⇒ n 5 _ <JavaLetter> <DigitsOrJavaLetters>          (by rule 3b)
       ⇒ n 5 _ m <DigitsOrJavaLetters>                   (by rule 1m)
       ⇒ n 5 _ m ε                                     (by rule 5a)

A (simple but ambiguous) Grammar for Expressions

Give a grammar for ArithExpr, which can generate strings like:

11 + 3
11 + 2 * 3
(11 + 2) * 3
(((3)))
(2 + (3*4))

(Assume that we already have a nonterminal <Number>.) Hints:

What occurs on each side of a +?
What occurs inbetween ( and )?

Reflecting on grammars

Again I'll empahsize: grammars are all about syntax, and explicitly avoid addressing any semantics. Linguist Naom Chomsky uses the example “Colorless green ideas sleep furiously.” as an example of a sentence which is syntactically perfect English, but semantically nonsensical.

As mentioned already, derivations and syntax trees are both essentially-equivalent ways of showing how the syntax of a sentence/program. Derivations have the snag that multiple derivations can all correspond to the same syntax-tree. For example, in the “the hotdog hit an android .” example, we had “NP V NP .” as our first step, at which point we have three different ways we can proceed that still all yield the same result. However, leftmost derivations do correspond uniquely to syntax trees (in CFGs), so that's what I'll ask for in homeworks.

What if we add further rules?
  G4 will be G3, plus the rules:
Exercise: add adverbs and adjectives.
  ADJ → quick | lazy | brown | hot
  ADV → ADJ ly

and changing G3's rules for `S` to be:
  S → NP CV NP . | NP CV . | NP CV NP ADV.

  Self-exercise:
   - Make a strings that is in G4 (but was not in G3).
   - Make a string that is not in G4.



Okay, armed with the ability to recur to make "1 or more" or "0 or more", now let's try i
coming up with a grammar for a Java class declaration:

  - example:
  class Puppy {
    int numSpots;
    String name = "fido";
    }



  ...




  To see the tree corresponding to our derivation
  go to
  mshang.ca/syntree/
  and enter: [ClassDecl class
           [JavaId [Puppy]]
           {
           [FieldOrMethodDecls [FieldOrMethodDecl  [FieldDecl [Type int] [JavaId [numSpots]] ;] ]
                               [FieldOrMethodDecls [FieldOrMethodDecl [FieldDecl [Type String] [JavaId [name]] [=] [Expr [StringLiteral "fido"]] [;]]]
                                                   [FieldOrMethodDecls [ε]] ]]
           }
  ]
or perhaps
#(struct:cdata #(struct:location 464 4 18005) #(struct:location 470 9 18509) <![CDATA[
  <BoolExp> [<BoolExp> [<NumExpr> [a+2]] [<NumComparator> [>]] [<NumExpr> [5] ]] [<BoolComparator> ||] [<BoolExp> [(] [<BoolExp> [<BoolExp> [!] [<BoolExp> false]]] [<BoolComparator> ==] [<BoolExpr> x]] [)]]]

        [ClassDecl class [JavaId [Puppy]] { [FieldOrMethodDecls [FieldOrMethodDecl [FieldDecl [Type
  int] [JavaId [numSpots]] ;] ][FieldOrMethodDecls [FieldOrMethodDecl
  [FieldDecl [Type String] [JavaId [name]] [=] "fido" ;]] [FieldOrMethodDecls [ε]] ]] } ]
  ]]>)




What is a grammar for...
 - declaring a Java identifier (w/o init'ing, but including access modifiers)
 - a Java identifier?
 - a Java assignment statement?
    (we'll leave "expr" not fully defined)


See:
Java syntax for 120
(which is incomplete and technically wrong in some places), and
some of the BNF from the Java Language Definition.

Here is another version which includes both grammar and accompanying diagrams.



Some limitations of CFG -- see ITEC420 (computability th'y).
Also, if we place certain limitations on the grammar productions,
  (e.g. "Only one non-terminal on the right-hand-side)
  we end up with "regular grammars", which capture exactly
  the same thing as regular expressions.

(A separate issue, btw:
 By constraining our grammar we can get nicer guarantees
 of algorithmic behavior -- parsing etc.)

¹ Again, I'm only discussing Context-Free Grammars (CFGs); some of these vocabulary terms are more general, and the definition here isn't always the more-general-definition. And these definitions are stressing the intuitive, rather than the formal definition. See your textbook for more formal definitions. ↩

² You can also have 0 steps in a sequence-of-derivations, so for any string γ, it's trivially true that γ ⇒^* γ. ↩

This page licensed CC-BY 4.0 Ian Barland
Page last generated Please mail any suggestions
(incl. typos, broken links)
to ibarlandradford.edu