Knuth's Common Words Problem Spec/Design
Table of Contents
1 CWP (Common Words Problem)
- "Given a file of text, and a number k, print the k most common words."
- Used as an example of Literate Programming, the art of preparing programs for human readers. See ../../Design/design-doc.html.
- Our interest here: How to describe its spec + design?
2 Design of Knuth's Program for CWP
- Let us not question: Why do we need such a complex design?
- Instead: How do we describe this design?
- Instead: Did Knuth describe the design or the implementation?
3 CWP Spec
- The focus of this lecture is the design. But, we should be clear about the specs.
- Defs of text, words, … Straightforward.
- Def of "common" word.
- "Most Common"? (Based on #occurrences in the given file.)
- Mateti(2013) ../../../PDF/cwp-pmateti-2013.pdf Section 2.1
4 CWP Design
- Design Description of CWP needs several levels.
- See ../../Design/correct-by-design.html for S |= D (Design D meets Spec S)
- S |= D1 |= D2 |= … |= Implementation
4.1 Phase 1: Build the bag of words in the file
- Our pseudo code permits many special characters as part of an identifier, as in Scheme.
- Scanner Design (not shown here) provides
- initialize-the-scanner-with-the-file();
- next-word-from-scanner()
- close-the-scanner-and-the-file();
Psuedo-code
- B is a bag of words occurring in the file scanned so far
- union with lhs notation is +=
- The : = {} and += operations should be provided by the design of B
initialize-the-scanner-with-the-file(); B := {} # empty bag while not-at-eof do B += { next-word-from-scanner() }; od close-the-scanner-and-the-file();
4.2 Phase 2: Sort the Bag B into a Sequence Q
Sort the Bag into an ordered sequence based on word-counts
- A design provides (to be shown later) sort-the-bag-*
Q := sort-the-bag-word-count-based(B); assert bag(Q) == B; print Q[0 .. k-1];
4.3 Meta Comments
- The above can be claimed as a "canonical" design solution. That is, we cannot design in some other way, or without the two phases.
- We can refine the design into a distributed/ concurrent one where B and Q are built simultaneously.
5 CWP Design Refinements Hierarchy
- Recall: D2 is-a-refinement-of D1, D1 |= D2
- D1 is an abstraction of D2
- See ../../Design/correct-by-design.html
- How do we progress from a simple bag of words B (phase 1) to n-ary hash trie? In five steps.
- The bag of words B can be read out in each design.
5.1 Bags refined as Tables
- Rows of (Word, nOccurs) pairs
5.2 Tables refined as n-ary Trees
- Collection of (Word, nOccurs) as an n-ary tree
- ex: cwp-nary-tree.pdf, a node has upto 26 children
- Is a Prefix of a Word a Word? Yes.
- Sorted on the spelling
- Why? Efficient search to locate place of insertion
5.3 n-ary Trees refined as Plain Tries
- ex: cwp-plain-trie.pdf
- A more space efficient trie: Two liks per node, not 26
5.4 Plain Tries refined as Ringed Tries
- Gather siblings into a circular list, sorted reverse alphabetically
- Introduce a header node
- Parent points to header
- ex: cwp-ringed-trie.pdf
5.5 Ringed Tries refined as Hash Tries
- Concrete data structure: An array of Items.
- Each item: a structure/ record of 4 data members
- View it as a table of 4 columns, named
- link, ch, sibling, and count
- ex: cwp-hash-trie.pdf .
6 Hashed Trie
- CWP final design uses a hash trie. ex: cwp-hash-trie.pdf .
- Why a Hash Trie?
- Pro: Space efficiency
- Con: Code complexity increase
- Pro: Pointers (addresses) replaced with offsets
- To show off Literate Programming
- Recall: The : = {} and += operations should be provided by the design of B and all its refinements.
- Construct an empty hash trie.
- Add a word into the hash trie. May run into a collision.
7 Construct the List of (Words, nOccurs) sorted by nOccurs
- Sort the collection of words based on nOccurs (Phase 2) into a sequence.
- Many simple design solutions do exist. Knuth chose a complex design.
7.1 Background on Tree Traversals
- Consider, e.g., in-order binary tree traversal.
- Straightforward to write a recursive traversal.
- Straightforward to convert that into a non-recursive traversal, but using a stack of nodes (as a reminder list of nodes to be traversed).
- Suppose we "forbid" the use of a stack. How exactly do we forbid?
- We forbid the declaration of a stack using memory apart from the storage of the binary tree.
- In a binary tree of n nodes, half its links are nil/ null.
- Clever algorithms now exist that build a stack with these null-links.
7.2 Traversal of Hash Trie
- Knuth's CWP destructively traverses the hash trie
- We skip it. Beyond the scope of this course.
8 Criticism of Knuth's Solution
- Others criticized as an example of software development.
- "Faberge Egg!"
- Equivalent shell script with pipes using standard Unix/Linux utilities (McIlroy, p480, CACM 1986)
- Our criticism is that Knuth's article was not "good documentation." It only described the implementation made to meet an intuitive problem statement. The design and spec were never documented.
9 Exercises
- Exercise: The figure shown in cwp-hash-trie.pdf is a hash trie. Draw its abstraction that we called a Ring Trie.
- Exercise: Row 3021 has an error (ch should be 20 not 21). Trace how this row represents the word "bent".
- Exercise: Develop a Java class of the Hash Trie. [An implementation-level Pascal src code is in Knuth's paper.]
- Exercise: Develop a literate program edition of the above in Java. Use any of the tools mentioned at http://www.literateprogramming.com/
- Exercise: What is the class invariant of the Hash Trie?
10 References
- Donald E. Knuth, "A Solution to the Common Words Problem", Communications of the ACM, 1986. Knuth is a Turing Award winner. (local PDF ../../../PDF/cwp-knuth-cacm-1986.pdf of Knuth's paper, but search for another PDF link for a clean version). Required Reading.
- Reviews of Knuth's solution by David Hanson, John Gilbert, Communications of ACM (Literate Programming column), July 1987, 594 - 599. Recommended Reading.
- Prabhaker Mateti, Rigorous Re-Design of Knuth's Solution to the
Common Words Problem, ../../../PDF/cwp-pmateti-2013.pdf 45pp, 2013.
../../../PDF/cwp-pmateti-highlighted-full-2018.pdf reformatted, 2018, now 14pp.
- Full version: Recommended Reading.
- Boxed + highlighted portions: Required Reading.