Homework 4: Counting Bigrams

Assigned
  • February 14, 2024
Due
  • February 19, 2024 11:59pm
Collaboration
    Regular policies for collaboration apply to this homework assignment. Make sure you review and understand the policies on the syllabus before you discuss your work on this assignment with anyone else.
Submitting
    Submit your completed bigrams.c file and tests.txt to Gradescope under the corresponding assignment. You may submit as many times as you like up until the deadline.

Overview

This assignment is designed to give you practice with reading input from the keyboard and using arrays. You will write a program that reads input from the user until they type a newline character (i.e. they press enter). Your program will finish by displaying the number of times it saw each bigram in the user input, as well as the most common bigram.

A bigram is a sequence of two adjacent characters from in the user input. The input "abc" has two bigrams: "ab" and "bc". Bigrams include any adjacent characters – including spaces – so inputs like "test message" would include the bigrams "t " and " m".

Note that this assignment does not specify an upper limit on the length of user input; you should not set an upper limit in your implementation. Instead, design your solution so you do not need to store the entire input at any one time. All you need to keep track of is the count for each bigram you have observed. An array (perhaps even a 2D array) would be a good way to organize your count of bigrams.

To get started, download bigrams.tar.gz and run the following commands to extract and open the starter code on MathLAN:

$ cd ~/csc161/homework
$ tar xvzf ~/Downloads/bigrams.tar.gz
$ cd bigrams
$ code .

Requirements

As you complete your implementation, make sure your program meets the following requirements:

  1. The code places no upper limit on the length of user input (assuming there is no integer overflow)
  2. Input is case sensitive, meaning the bigrams "AB", "Ab", "aB", and "ab" all have distinct counts.
  3. The newline character is not part of the input your program processes, so it should not be contained within any bigrams.
  4. The most common bigram is printed at the end of the output. If more than one bigram is tied for most common you may print any of them.
  5. Only bigrams with a non-zero count should be printed.
  6. If the input does not contain any bigrams (i.e. it is shorter than two characters) the program should print a message to that effect instead of the most common bigram.

Your program can display bigrams in any order that is convenient for you.

Examples

The following examples show the expected output for some test cases. The input provided to the program is displayed just below the program’s invocation.

Example 1: A short input

$ ./bigrams
abc
"ab": 1
"bc": 1
Most common bigram: "ab"

Example 2: A long word

$ ./bigrams
antidisestablishmentarianism
"ab": 1
"an": 2
"ar": 1
"bl": 1
"di": 1
"en": 1
"es": 1
"hm": 1
"ia": 1
"id": 1
"is": 3
"li": 1
"me": 1
"ni": 1
"nt": 2
"ri": 1
"se": 1
"sh": 1
"sm": 1
"st": 1
"ta": 2
"ti": 1
Most common bigram: "is"

Example 3: A rhyming input with numbers and punctuation

$ ./bigrams
161 is (sometimes) fun
" (": 1
" f": 1
" i": 1
"(s": 1
") ": 1
"1 ": 1
"16": 1
"61": 1
"es": 1
"et": 1
"fu": 1
"im": 1
"is": 1
"me": 2
"om": 1
"s ": 1
"s)": 1
"so": 1
"ti": 1
"un": 1
Most common bigram: "me"

Example 4: An even longer input with capital letters and punctuation

$ ./bigrams
Sally sells seashells by the seashore.
" b": 1
" s": 3
" t": 1
"Sa": 1
"al": 1
"as": 2
"by": 1
"e ": 1
"e.": 1
"ea": 2
"el": 2
"he": 2
"ho": 1
"ll": 3
"ls": 2
"ly": 1
"or": 1
"re": 1
"s ": 2
"se": 3
"sh": 2
"th": 1
"y ": 2
Most common bigram: " s"

Example 5: One-character input

$ ./bigrams
a
The input did not contain any bigrams.

Hints

  1. I strongly suggest that you read and process user input one character at a time. The getchar() function is probably the easiest way to do this.
  2. A 2D array is a great way to store your bigram counts.

Grading

Your assignment will be graded on a scale of 0–100 points based on the following criteria:

  • The program reads input until it receives a newline character (25 points)
  • The program displays an accurate count of how often it saw each bigram (25 points)
  • The program displays the most common bigram in the input (25 points)
  • The submission includes a thorough testing plan (25 points)

Testing Plan

You no longer need to submit test transcripts with assignments for this class. Instead, you will fill in a file tests.txt (provided in your starter code) with a plan for how you will test your implementation. Your tests should cover edge cases, different types of input, and inputs that check the correctness of your most frequent bigram computation.

Note that your implementation does not have to pass all of your tests for you to receive full credit on the testing portion of the assignment. Designing a good set of tests is a good way to plan your implementation, so you may even choose to write tests first.

Code Quality

There will be potential deductions for any code quality violations. For now, there are six requirements for quality code:

  1. Indent the body of all loops and conditionals one level deeper than the code just outside the loop. Be consistent with your indentation style.
  2. Use curly braces for every loop or conditional body, regardless of the number of lines.
  3. Include comments that explain what your code does. There should be at least one comment by the start of each loop and ifelse block that explains what the purpose of that code is. The comment should not simply restate the code; it should add information for a human who reads the code and is confused.
  4. Give variables descriptive names that help a reader understand what they will be used for.
  5. Do not read from or display the value of any variable without first initializing it.
  6. Your code must compile with no errors and no warnings.