So for the last little bit I’ve been jobless. I wrote a whole post about it but haven’t posted it yet because after writing it I figured it was better to let that sit as one of those “write a letter but don’t send it until after you’ve had a good nights sleep” kind of thing. And so I wrote it, left it, and… yeah. There’s nothing particularly inflammatory or anything – at least, not as far as I can tell. However, I will be pulling out some things and writing posts on specific topics later.

Anyways!

Something I thought might be fun is to take some of the coding challenges I’ve gotten in the last few weeks and walking through the process of solving them. For shorter ones, I might also look into alternate solutions or ways to further optimize the solution I find.

As for what language I’ll be using for these, I’m going to be doing this using Go 1.19.

First Challenge: Pig Latin Encoder #

So here’s the first problem I’ll be tackling: I’m going to write some code that takes a string and “translates” each word into Pig Latin. I say “translates”, because what we’re really doing is using a cipher to encode some text, and I’ll be using “cipher” and “encode” from here on out. Mostly because that way I have a term for the thing we’re building ( ‘cipher’ ), and a different word for the operation it performs ( ’encode’ ).

To do that, I’ll be following these rules:

If the word begins with a consonant, take the first letter of the word and move it to the end and then ad “ay”
If the word begins with a vowel, just add “way” to the end of the word

Capitalization should respect the word, so “Hello world” should become “Ellohay orldway”, and same for punctuation.

Just so you’re all clear, here’s a few examples of what the encoding process should produce:

English	Pig Latin
hello	ellohay
hello world	ellohay orldway
eat apples	eatway applesway
Hello, world!	Ellohay, orldway!

So to sum up, here’s the “challenge”:

Write a Go package that can encode English text using the Pig Latin cipher.

However, I don’t want to stop there – so I’m adding the following bits to the challenge:

Write a command-line tool that uses the library to encode the provided text, whether the text is in a file or provided as arguments.

Write a web utility that can encode text sent to it.

Add ROT13 as a cipher.

Add the ability to decode text that has been encoded.

Sounds good? Good! Let’s dive in.

Also, quick note: I didn’t nail down that I’d be using “cipher” and “encode” instead of “translation” until I got a ways into writing this. Naming is hard, and I didn’t want to spend too much time thinking about naming until my subconscious figured it out for me. So if you see anything in here still saying “translation” or anything like that instead of using “cipher”, “encode”, or “decode” – make the appropriate substitution in your head.

Setting Up #

So, like any disciplined programmer, I’m going to do this TDD style. But where do I start, and where am I trying to get to?

So the first thing I need to figure out is folder structure. However, I only need to start with one folder: the one for the library. This is where the basic shared stuff will live – stuff that’s used by more than one cipher. I don’t need a folder for the web or command line tools yet, because I’m not touching those parts yet.

So I’m going to create the directory where I’ll be spending most of my time at first, a library folder. However, I’m not going to call it library or lib. Following some of Dave Cheney’s wisdom I’m going to give this folder a name that reflects the package I’ll be creating and what it does. I’m not going to use pig-latin because package names can’t have a dash in them ( and also we want to have more ciphers later on ); instead I’m going to call it cipher.

Why A Library #

Really quick digression on why I’m designing this as a library, rather than just wrapping it up entirely in the command line tool or an HTTP-based API.

Well, there’s two reasons for this. The first is that I want to build multiple “front-ends” for the library. I’ve already mentioned these front-ends, it’s the command line tool and the HTTP API.

The other reason is that even if I was writing this for a work thing where only either a command-line tool OR an HTTP API was needed, I’d still write this library first in the manner you’ll see below. That’s because A) I don’t know if it’ll need more front-ends later, and B) writing code this way makes it easier to read and maintain later.

So to write this in a way that doesn’t have code snippets being copied and pasted all over the place¹, I’m putting the core functionality into a library. This library will provide an API² that can be used within another application; I’ll be showing how that works when I write the command-line tool, and later the HTTP API.

If I do this right, I should be able to write this library to do Pig Latin, write our command line & web tools, and get them working – then once that’s all good, add ROT13 without having to change anything ( or much of anything ).

First Up: Tests! #

So the first thing I need is some tests. I’m doing this TDD style, after all.

Fun thing about my editor: if I open up a file that’s clearly a Go test file, it fills it in with a basic test. This one, in fact:

 1func TestPigLatin_Basics(t *testing.T) {
 2 tests := []struct {
 3  a, b, c int
 4 }{
 5  {1, 0, 1},
 6  {1, 1, 2},
 7 }
 8
 9 for i, x := range tests {
10  tt := x
11  t.Run(fmt.Sprintf("test%v", i), func(t *testing.T) {
12   c := tt.a + tt.b
13   if c != tt.c {
14    t.Errorf("something is seriously wrong, %v+%v != %v, got %v instead", tt.a, tt.b, tt.c, c)
15   }
16  })
17 }
18}

Which is actually handy for me, because I can ensure that my brand-new install of Go 1.19 is working:

$ go test -v ./...
?       github.com/seanhagen/pig-latinizer      [no test files]
?       github.com/seanhagen/pig-latinizer/cmd  [no test files]
=== RUN   TestPigLatin_Basics
=== RUN   TestPigLatin_Basics/test0
=== RUN   TestPigLatin_Basics/test1
--- PASS: TestPigLatin_Basics (0.00s)
    --- PASS: TestPigLatin_Basics/test0 (0.00s)
    --- PASS: TestPigLatin_Basics/test1 (0.00s)
PASS
ok      github.com/seanhagen/cipherator/cipher   (cached)

Neat!

Now to immediately delete all that auto-generated code.

My first step is to get to a “proper” failing test. Easiest way to do that is to call a function that doesn’t exist and expect a type that also doesn’t exist. That looks like this:

1func TestPigLatin_Basics(t *testing.T) {
2 var tr *PigLatin
3 var err error
4
5 tr, err = NewPigLatin()
6 assert.NotNil(t, tr)
7 assert.NoError(t, err)
8}

I’m using the wonderful stretchr/testify/assert package here, because it helps keep tests nice and readable. So what do I get when I run go test -v ./... again?

?       github.com/seanhagen/cipherator      [no test files]
?       github.com/seanhagen/cipherator/cmd  [no test files]
# github.com/seanhagen/cipherator/cipher [github.com/seanhagen/cipherator/cipher.test]
cipher/piglatin_test.go:10:10: undefined: PigLatin
cipher/piglatin_test.go:13:12: undefined: NewPigLatin
FAIL    github.com/seanhagen/cipherator/cipher [build failed]
FAIL

Huzzah! Failure!

This might seem silly, but I’ve actually done something important: I’ve verified our tests run properly. That previous run of go test showed me that I can get a passing test, but that could have been a false positive. Writing this new test shows me that I can get a failing test. It also helps me confirm the Go compiler is working properly; if it somehow compiled this and ran it I’d have bigger problems to sort out. That said, the only time I run that first test is usually after I install a new version of Go 😅.

Moving forward I’m probably not going to write many more super simple tests like this. No promises though.

Next up, I’m going write another test that will test the two main pieces of our encoder: encoding a word that begins with a consonant, and one that begins with a vowel.

 1func TestPigLatin_Words(t *testing.T) {
 2 tests := []struct {
 3  input, output string
 4 }{
 5  {"hello", "ellohay"},
 6  {"eat", "eatway"},
 7 }
 8
 9 for _, tt := range tests {
10  t.Run(fmt.Sprintf("%v to %v", tt.input, tt.output), func(t *testing.T) {
11   pl, err := NewPigLatin()
12   assert.NotNil(t, pl)
13   assert.NoError(t, err)
14
15   got, err := pl.Encode(tt.input)
16   assert.Equal(t, tt.output, got)
17  })
18 }
19}

This of course fails:

# github.com/seanhagen/cipherator/cipher [github.com/seanhagen/cipherator/cipher.test]
cipher/piglatin_test.go:33:19: pl.Encode undefined (type *PigLatin has no field or method Encode)
FAIL    github.com/seanhagen/cipherator/cipher [build failed]

Which makes sense, I don’t have an encode method. So I’ll write some code so I do!

1func (pl PigLatin) Encode(input string) (string, error ) {
2 
3}

But… what do I put in the function? Well, in TDD I’m supposed to write as little code as required to get the test to pass. So here’s what I’ll put in:

 1func (pl PigLatin) Encode(input string) (string, error ) {
 2 if input == "hello" {
 3  return "ellohay", nil
 4 }
 5
 6 if input == "eat" {
 7  return "eatway", nil
 8 }
 9
10 return "", fmt.Errorf("don't know how to encode '%v' yet", input)
11}

And hey, what do you know, it works:

=== RUN   TestPigLatin_Basics
--- PASS: TestPigLatin_Basics (0.00s)
=== RUN   TestPigLatin_Words
=== RUN   TestPigLatin_Words/hello_to_ellohay
=== RUN   TestPigLatin_Words/eat_to_eatway
--- PASS: TestPigLatin_Words (0.00s)
    --- PASS: TestPigLatin_Words/hello_to_ellohay (0.00s)
    --- PASS: TestPigLatin_Words/eat_to_eatway (0.00s)
PASS
ok      github.com/seanhagen/cipherator/cipher   0.003s

Real Code Hours #

However, this isn’t really the solution, right? I can’t hard-code the correct output for each possible input. Kind of goes against the whole reason to program an algorithm, right? Well, now I can add another test case or two, and then use that as my excuse to write the “proper” code. So let me add world and apples as two more cases to my test table:.

 1func TestPigLatin_Words(t *testing.T) {
 2 tests := []struct {
 3  input, output string
 4 }{
 5  {"hello", "ellohay"},
 6  {"eat", "eatway"},
 7  {"world", "orldway"},
 8  {"apples", "applesway"},
 9 }
10
11 for _, tt := range tests {
12  t.Run(fmt.Sprintf("%v to %v", tt.input, tt.output), func(t *testing.T) {
13   pl, err := NewPigLatin()
14   assert.NotNil(t, pl)
15   assert.NoError(t, err)
16
17   got, err := pl.Encode(tt.input)
18   assert.Equal(t, tt.output, got)
19  })
20 }
21}

This should fail, and does:

=== RUN   TestPigLatin_Basics
--- PASS: TestPigLatin_Basics (0.00s)
=== RUN   TestPigLatin_Words
=== RUN   TestPigLatin_Words/hello_to_ellohay
=== RUN   TestPigLatin_Words/eat_to_eatway
=== RUN   TestPigLatin_Words/world_to_orldway
    piglatin_test.go:36: 
                Error Trace:    cipherator/cipher/piglatin_test.go:36
                Error:          Not equal: 
                                expected: "orldway"
                                actual  : ""
                            
                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1 +1 @@
                                -orldway
                                +
                Test:           TestPigLatin_Words/world_to_orldway
=== RUN   TestPigLatin_Words/apples_to_applesway
    piglatin_test.go:36: 
                Error Trace:    cipherator/cipher/piglatin_test.go:36
                Error:          Not equal: 
                                expected: "applesway"
                                actual  : ""
                            
                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1 +1 @@
                                -applesway
                                +
                Test:           TestPigLatin_Words/apples_to_applesway
--- FAIL: TestPigLatin_Words (0.00s)
    --- PASS: TestPigLatin_Words/hello_to_ellohay (0.00s)
    --- PASS: TestPigLatin_Words/eat_to_eatway (0.00s)
    --- FAIL: TestPigLatin_Words/world_to_orldway (0.00s)
    --- FAIL: TestPigLatin_Words/apples_to_applesway (0.00s)
FAIL
FAIL    github.com/seanhagen/cipherator/cipher   0.003s
FAIL

Now it’s time to write some real code to solve this problem. So the first step is to remove our “working” code from before, leaving us with pretty much an empty function:

1func (pl PigLatin) Encode(input string) (string, error) {
2 return "", fmt.Errorf("don't know how to encode '%v' yet", input)
3}

Wait, How DO I Do This? #

So let’s take a quick break from code and think about how we can solve this problem. There are a few different ways I can do this, but let’s start with a simple version.

How about this: we start with three slices³:

one called data, that holds the input exploded into single UTF-8 characters
another called output, that will hold the output that will be returned
a last one called currentWord, that holds the word currently being encoded

1func (pl PigLatin) Encode(input string) (string, error) {
2 var output, currentWord []string
3 data := strings.Split(input, "")
4
5    // code goes here
6    
7 return "", fmt.Errorf("don't know how to encode '%v' yet", input)
8}

This isn’t working yet, because I’ve defined three variables but haven’t used them yet. We’re getting to that, though.

Now that I’ve got these three slices, what do I do with them? Well, how about looping through data, and putting each letter into currentWord. Do that until the code hits a non-letter character or reaches the end of data. When it hits a non-letter character or the end, take what’s in currentWord and “encode” it, then put the encoded data into output. At the end of the function, join output together into a single string and return it.

I’m going to need some helper functions though; I need to be able to identify if a character is a letter or not, and if it’s a space or not. So I put down the encoding stuff for a second, and write some more tests! If that sounds frustrating, think about it this way: writing tests is a way to force us to write the code we wanted to write next anyways – but in a way that ensures its testable! Handy, that.

Trying to write tests after the fact usually ends up with very odd tests that have to do way too much work to test the unit at hand, and are usually pretty fragile. By doing TDD we get test coverage (good) as well as easily testable code (very good).

 1func TestPigLatin_IsLetter(t *testing.T) {
 2 tests := []struct {
 3  input  string
 4  expect bool
 5 }{
 6  {"a", true},
 7 }
 8
 9 pl, err := NewPigLatin()
10 assert.NotNil(t, pl)
11 assert.NoError(t, err)
12
13 for _, tt := range tests {
14  t.Run(fmt.Sprintf("%v is letter %v", tt.input, tt.expect), func(t *testing.T) {
15   got := pl.isLetter(tt.input)
16   assert.Equal(t, tt.expect, got)
17  })
18 }
19}

Of course, this fails at the moment:

# github.com/seanhagen/cipherator/cipher [github.com/seanhagen/cipherator/cipher.test]
cipher/piglatin_test.go:56:14: pl.isLetter undefined (type *PigLatin has no field or method isLetter)
FAIL    github.com/seanhagen/cipherator/cipher [build failed]
FAIL

So let me get that passing. This one is pretty simple, not a lot to do here:

1// isLetter ...
2func (pl PigLatin) isLetter(in string) bool {
3 if in >= "a" && in <= "z" || in >= "A" && in <= "Z" {
4  return true
5 }
6 return false
7}

After that, I go through a very similar process to write a test for isSpace and the code to get the test to pass. Once that’s done, I can switch back to the encoder and keep moving forward. Going to skip the test & code steps here, and show you what I ended up with. There’s a bunch of code, but there are two important pieces to show you.

First up we’ve got the function that handles the encoding bit:

 1func (pl PigLatin) doTranslation(in []string) []string {
 2 if len(in) == 0 {
 3  return in
 4 }
 5
 6 if pl.isUpper(in[0]) {
 7  in[0] = strings.ToLower(in[0])
 8  in[1] = strings.ToUpper(in[1])
 9 }
10
11 var toAppend []string
12
13 if pl.isVowel(in[0]) {
14  toAppend = vowAppend
15 } else {
16  toAppend = conAppend
17  in = append(in[1:], in[0])
18 }
19
20 in = append(in, toAppend...)
21
22 return in
23}

And the function that gets called by a user to encode a string:

 1func (pl PigLatin) Encode(input string) (string, error) {
 2 var output, currentWord []string
 3 data := strings.Split(input, "")
 4
 5 for _, ch := range data {
 6  // if it's a letter, append to currentWord
 7  if pl.isLetter(ch) {
 8   currentWord = append(currentWord, ch)
 9   continue
10  }
11
12  // if there isn't anything in current word and we're not on a letter, just append
13  // the letter to output and continue on
14  if len(currentWord) == 0 && !pl.isLetter(ch) {
15   output = append(output, ch)
16   continue
17  }
18
19  // encode the word
20  currentWord = pl.doTranslation(currentWord)
21
22  // append currentWord to output
23  output = append(output, currentWord...)
24
25  // add the current character ( ie, not a letter, like spaces or punctuation )
26  output = append(output, ch)
27
28  // and reset currentWord to an empty slice
29  currentWord = []string{}
30 }
31
32 currentWord = pl.doTranslation(currentWord)
33 output = append(output, currentWord...)
34
35 return strings.Join(output, ""), nil
36}

If you want to take a look at the whole file, as well as the tests you can go take a look at the repo where I’ve got this code.

Alrighty! I’ve got some code that handles encoding English into Pig Latin. Now what?

Well, from here we could do a few things.

One is I could move on to the command line tool, and start writing that. I could also do the web utility first, for a bit of fun. However, I want to do some refactoring. This code isn’t quite as good as it could be, in my opinion.

For example, my Encode method takes in a string as its only argument and then everything internally is handled with strings. But it feels a bit weird to use a data type that can hold more than one character at a time (ie, strings) when I’d like to use a data type that can only hold one character at a time.

That way I can change the helper methods from expecting a string to this new data type, and they’d make a bit more sense. Basically, being able to pass in “apple” to isUpper feels a bit weird – I want to have the code force the “one character at a time” restriction on my helper methods.

Also, I’m exploding a string and doing a bunch of slice management. That feels a bit heavy; there’s got to be a solution we can use to clean that up too, right?

Refactoring Our Way To… Something #

So let’s think about these two design goals we want to achieve. The first is to move away from the string data type to one that can only hold a single character at a time. The other is to see what we can do to move away from slices if possible; or if we can hide those behind our own data type or something from the standard library.

For the first new goal, Go has the handy rune data type. I’m not going to go into the technical details, but if you want to check out this blog post on the go.dev site that goes over the difference between strings, runes, and characters.

So how does this change our code? Well, for one it actually lets me remove a line of code straight away:

 data := strings.Split(input, "")

See, if I just loop over our input like so:

for _, r := range input {}

The type of r is rune, not string – pretty handy!

Of course, this will require a bunch of changes to my code. So starting with tests, let’s do this!

Where to start? Well, I want to start by changing each of our helper methods so that they each take a rune instead of a string as their argument. I force myself to make this change by updating the tests so that they’re passing in a rune. For example, We can change TestPigLatin_IsLetter so that the test table is set up like so:

 1 tests := []struct {
 2  input  rune
 3  expect bool
 4 }{
 5  {'a', true},
 6  {'B', true},
 7  {'', false},
 8  {' ', false},
 9  {'!', false},
10  {'1', false},
11 }

Of course, making that change causes a few… issues:

cipher/piglatin_test.go:52:4: illegal rune literal

I expected an error, but maybe not this one – that’s pointing to this line:

 1 tests := []struct {
 2  input  rune
 3  expect bool
 4 }{
 5  {'a', true},
 6  {'B', true},
 7  {'', false},
 8  {' ', false},
 9  {'!', false},
10  {'1', false},
11 }

Turns out, there is no “empty rune” in Go. Good to know – hopefully that will make things a bit easier and let me remove some code. But first I’ve got to remove all the “empty string” tests. Once that’s done, I continue doing this refactoring until I’ve replaced as much usage of string with rune as I can.

Once that’s done, what’s changed? To be honest, not a ton. Some of our helper functions now just wrap functions from the unicode package:

 1// isLetter ...
 2func (pl PigLatin) isLetter(in rune) bool {
 3 return unicode.IsLetter(in)
 4}
 5
 6// isVowel  ...
 7func (pl PigLatin) isVowel(in rune) bool {
 8 for _, v := range vowels {
 9  if in == v {
10   return true
11  }
12 }
13
14 return false
15}
16
17// isUpper ...
18func (pl PigLatin) isUpper(in rune) bool {
19 if !pl.isLetter(in) {
20  return false
21 }
22
23 return unicode.IsUpper(in)
24}
25
26// isSpace  ...
27func (pl PigLatin) isSpace(in rune) bool {
28 return unicode.IsSpace(in)
29}

That’s good, less code is always appreciated! Especially code from the standard library. Other than that, the only other real change is using a string builder to turn our output slice into a string at the end. It feels a bit icky having this at the end of my Encode function:

1 outWr := strings.Builder{}
2 for _, r := range output {
3  outWr.WriteRune(r)
4 }
5
6 return outWr.String(), nil

Before implementing that change though, I’m going to tackle removing slices first. Don’t get me wrong; slices are fine, but what would be real nice is using stuff from the standard library so that I’m not working at such a low level. Or to word that better: I want the code inside each function to be working with the same levels of abstraction. This means if I want to use something like a string builder, the rest of the function shouldn’t mix in low-level primitives like slices. Of course, there’s exceptions to every rule, this one isn’t any different. The point though, isn’t strict adherence to every single rule; the point is to create readable, understandable, and maintainable code. That means not forcing another developer⁴ to keep low-level concepts in their head at the same time as high-level concepts.

What kind of solution could we use so that we’re abstracting away these slices?

What about a token parser? What’s that, you ask? Simple! Well, simple-ish.

Parsing Parsing Parsing #

So what are tokens, actually? Well let’s start by getting a bit more specific. In this context, I’m specifically talking about “lexical tokens”. Basically, taking a sentence like this:

Hello there, world!

And turning it into a list like this:

Token Type	Value
Word	Hello
Space
Word	there
Symbol	,
Word	world
Symbol	!

As you can see, we’ve got three ’types’ of tokens: words, spaces, and symbols. So what we’re going to do seems pretty straightforward: take our input string, turn it into a series of tokens, and then tell each token to encode itself. Actually implementing that will be a bit more work. So, let’s dive in!

Note from future Sean: I didn’t take the idea of tokens as far as I imagined here – the code didn’t need to get that complicated to solve the problem. This is actually one of the benefits of doing TDD in my mind; by writing a test focused on the problem I didn’t get caught up in writing the wrong solution. Or at least not an over-engineered solution.

First, we start by taking a look around the standard library to see what there is we could use. Turns out, there is! It’s called text/scanner, and we can use it to turn a string into a series of simple tokens. The best part is that we can customize how it parses the input; meaning we can use it to split our input string up exactly as we want!

However, I’ve never used text/scanner before – so I’m going to have to play around with this package to see how it works. So let’s play around a bit! First, I copied one of the examples from the godoc page, and modified it a bit:

 1package main
 2
 3import (
 4 "fmt"
 5 "strings"
 6 "text/scanner"
 7)
 8
 9func main() {
10 str := "Hello, world! this is a test string 12345 ### $5 . what"
11 var s scanner.Scanner
12 s.Init(strings.NewReader(str))
13 for tok := s.Scan(); tok != scanner.EOF; tok = s.Scan() {
14  fmt.Printf("%s: %s\n", s.Position, s.TokenText())
15 }
16}

Running that gives me this:

example:1:1: Hello
example:1:6: ,
example:1:8: world
example:1:13: !
example:1:15: this
example:1:20: is
example:1:23: a
example:1:25: test
example:1:30: string
example:1:37: 12345
example:1:43: #
example:1:44: #
example:1:45: #
example:1:47: $
example:1:48: 5
example:1:50: .
example:1:52: what

In other words, the default settings for text/scanner.Scanner almost turn Hello, world! this is a test string 12345 ### $5 . what into exactly what we need. At the moment, it’s dropping spaces. Lucky for us, fixing that is pretty straight-forward!

 1func main() {
 2 str := "Hello, world! this is a test string 12345 ### $5 . what"
 3 var s scanner.Scanner
 4 s.Init(strings.NewReader(str))
 5 s.Filename = "example"
 6 s.Whitespace ^= 1<<'\t' | 1<<' '
 7 for tok := s.Scan(); tok != scanner.EOF; tok = s.Scan() {
 8  fmt.Printf("%s: '%s'\n", s.Position, s.TokenText())
 9 }
10}

Adding that s.Whitespace line gives us this output:

example:1:1: 'Hello'
example:1:6: ','
example:1:7: ' '
example:1:8: 'world'
example:1:13: '!'
example:1:14: ' '
example:1:15: 'this'
example:1:19: ' '
example:1:20: 'is'
<cut for length>

Perfect!

So next step is to use this to parse the input string into a bunch of tokens. But how do I do that? What tests do I change, or add?

Let’s take a step back for a moment. There are two⁵ ways I could go about this. The first is that I could replace the loop we’ve got right now with one that just uses a text/scanner.Scanner. I wouldn’t have to change any tests at all, as I’d basically just be refactoring. The goal would be to re-implement this feature with new code that still passes the old tests.

However, when trying to just implement the scanner version while changing as little as possible, I very quickly run into an issue. All my helper functions are built around rune, and not string tokens. Right now, the code is set up to go through the input, rune by rune to build each token manually. These ’tokens’ are just slices of runes that I pass into a encoding function.

How would I have to change the code to make the scanner version work? Well, let’s start by looking at the documentation for the text/scanner – maybe there’s something there I can use? And because I’m writing this in the past, of course there is: Peek()!

Changing the scan test code to look like this:

 1func main() {
 2 str := "Hello, world! this is a test string 12345 ### $5 . what"
 3 var s scanner.Scanner
 4 s.Init(strings.NewReader(str))
 5 s.Filename = "example"
 6 s.Whitespace ^= 1<<'\t' | 1<<' '
 7 fmt.Printf("first peek: %c\n", s.Peek())
 8 for {
 9  n := s.Peek()
10  tok := s.Scan()
11  if tok == scanner.EOF {
12   break
13  }
14
15  fmt.Printf("%s: '%s', peek: %c\n", s.Position, s.TokenText(), n)
16 }
17}

Produces this output:

example:1:1: 'Hello', peek: H (eof: false)
example:1:6: ',', peek: , (eof: false)
example:1:7: ' ', peek:   (eof: false)
example:1:8: 'world', peek: w (eof: false)
example:1:13: '!', peek: ! (eof: false)
example:1:14: ' ', peek:   (eof: false)
example:1:15: 'this', peek: t (eof: false)
<cut for length>

Neat! Even better, Peek() returns a rune! So I can peek before scanning a token to get the first rune from the next token, and use that with our helper functions perhaps?

Another note from future Sean: I misread the docs on how Peek() works compared to Scan(); somehow my brain completely missed that what I wanted from it was to get a first line that looks like this: example:1:1: 'Hello', peek: , (eof: false). Thankfully, the way I misinterpreted this ended up working 😅 – don’t worry though, I eventually figure this out and fix the code so it’s not doing this specific silly thing any more. However, as I figured this out a few days after writing this section I’m just going to leave this in, mostly because trying to re-write this section just isn’t going to happen.

After making those changes, here’s what I’ve got now:

 1func (pl PigLatin) Encode(input string) (string, error) {
 2	var output []string
 3
 4	pl.s.Init(strings.NewReader(input))
 5	pl.s.Filename = "original"
 6	pl.s.Whitespace ^= 1<<'\t' | 1<<' '
 7
 8	for {
 9		ch := pl.s.Peek()
10		tok := pl.s.Scan()
11		if tok == scanner.EOF {
12			break
13		}
14
15		if pl.isLetter(ch) {
16			currentWord := pl.encodeStr(pl.s.TokenText())
17			output = append(output, currentWord)
18			continue
19		} else {
20			output = append(output, pl.s.TokenText())
21		}
22	}
23
24	return strings.Join(output, ""), nil
25}
26
27func (pl PigLatin) encodeStr(in string) string {
28	var d []rune
29	for _, ch := range in {
30		d = append(d, ch)
31	}
32	d = pl.encodeRunes(d)
33	out := strings.Builder{}
34	for _, ch := range d {
35		out.WriteRune(ch)
36	}
37	return out.String()
38}

Nice! Now there’s only one slice: the output slice that goes back to being a slice of strings. I wrote encodeStr to transform the string from the scanner into the slice of runes expected by encodeRunes ( which used to be doTranslation ), but I’m not really a fan. However, this works well enough for now that I can move onto making the command line app/tool/thing. I’ll clean this up later when I come back around to do other ciphers.

If you want to see all the code as it is now, you can check out the ‘refactoring’ branch on GitHub.

Command Line Fun #

Okay, so I’ve got the library now, time to build out the command line tool.

I’m going to be using Cobra to build the command line tool. One of the things that Cobra gives me is really easy sub-commands. For those of you who aren’t that familiar with the command line, a sub-command is basically an argument to the command line tool that causes it to do different things.

For example, let’s pretend I wanted to build a simple command line tool that can upload an image or get a list of images you’ve uploaded; let’s call it img. I could use flags to do these things; img -u <filename> could upload a file, img -l would list your uploaded images. However, flags aren’t super easy to remember, and figuring out what they do often requires heading to the documentation. Instead, I can use sub-commands so that instead of img -u <filename> you could write img upload <filename>; instead of img -l you could write img list.

I’m going to take advantage of this so that I can have each cipher I want to be it’s own sub-command. This command is going to be called cipherator, because that’s hip and edgy⁶. Next I have to decide which “thing” will be the next-level sub-commands: the cipher to use, or the operation to perform. In other words, do I want to have cipherator <cipher> <operation> or cipherator <operation> <cipher>?

Because I may end up with way more ciphers than operations, I think cipherator <cipher> <operation> makes more sense. That’s because this way I can have a “Pig Latin” command with two sub-commands ( encode and decode ), instead of an “encode” command with a sub-command for each cipher I want to support. Now to be clear, this is mostly a personal preference / organization thing. I’m not trying to say one of these is better than the other; just that I had to make a choice and I prefer cipherator <cipher> <operation>.

At the moment, I’ve only got one cipher: Pig Latin. As for operations, I’ve only got encode – ie, turning English into Pig Latin.

Now, how does the command get the input? I think the easiest thing is for it to have one flag: -f/--file <filename>. If this flag is present, it treats the argument to that flag as a text file to open and encode. If the flag isn’t present, it treats all other non-flag arguments as the strings to encode. In other words, if we have a file named “copy.txt” with the contents “hello world”, these two invocations of cipherator should produce the same output:

cipherator piglatin encode -f copy.txt
cipherator piglatin encode hello world

I think it’d be nice to also have a flag that lets us redirect the output to a file; let’s go with -o/--output <filename>. That way, we could do cipherator piglatin encode -f copy.txt -o pl.txt and have our copy.txt translated and put into pl.txt.

However, working with files will come a bit later – I want to do some refactoring before adding the ability to read from a file or to output to a file. For now the command will just take input as additional arguments.

Enough Planning, Write A Command Line Tool Already #

Okay, okay.

So, full disclosure: this is going to be the first time I’ll be writing a Cobra-backed command line tool using TDD. I’ve written a handful of CLI apps using Cobra in the past, I’ve just never tested the CLI app or written one using TDD.

The first thing we need to do is set up our sub-commands. We need a ‘Pig Latin’ sub-command, and that needs an ’encode’ sub-command. However, I don’t want to have to write the code that deals with input & output over and over again. In fact, all the commands need the same two things:

an io.Reader to read text from
an io.Writer to write the output to

Interfaces in Go are pretty great, and these two provided by the io package are among some of the best. Almost every Go library or application that needs to deal with input and/or output probably uses ( or should be using ) these interfaces.

So what we’re going to do is define the two flags ( -f/--file for input, -o/--output for output ) on the root command, rather than on any of the sub-commands. Does this mean you could type in cipherator -f <filename>? Yes. Would it do anything? No. It might be a bit silly, but I’m going to run with this for now.

First off, we’ve got to figure out some tests. To do that, we need to figure out what we’re testing. For our first test, we could check that when you run the base command – cipherator, in our case – that it prints out the help text. Sounds good!

Here’s what we end up with:

1func TestCmd_Root(t *testing.T) {
2 output := bytes.NewBuffer(nil)
3 expect := helpText
4
5 rootCmd.SetOutput(output)
6 err := rootCmd.Execute()
7 assert.NoError(t, err)
8 assert.Contains(t, output.String(), expect)
9}

If you’re curious, I’m using assert.Contains instead of assert.Equals because the help text changes as soon as you add a sub-command. This way I’m just testing that the initial help text is being output as expected.

Then there’s the root command, so the test passes:

 1package main
 2
 3import (
 4 "github.com/spf13/cobra"
 5)
 6
 7const helpText = `Cipherator is a CLI tool for encoding and decoding English text using
 8a variety of "toy" ciphers such as Pig Latin.`
 9
10// rootCmd represents the base command when called without any subcommands
11var rootCmd = &cobra.Command{
12 Use:   "cipherator",
13 Short: "Encode/decode text using various toy ciphers",
14 Long:  helpText,
15}
16
17func main() {
18 cobra.CheckErr(rootCmd.Execute())
19}

Nice! So that’s the first step done. Next up, our Pig Latin command. Initially, it looks pretty similar to our root command:

 1package main
 2
 3import "github.com/spf13/cobra"
 4
 5const piglatinHelpText = `Cipherator is a CLI tool for encoding and decoding English text using
 6a variety of "toy" ciphers such as Pig Latin.`
 7
 8var piglatinCmd = &cobra.Command{
 9	Use:   "piglatin",
10	Short: "Encode/decode text using the Pig Latin cipher",
11	Long:  piglatinHelpText,
12}
13
14func init() {
15	rootCmd.AddCommand(piglatinCmd)
16}

The test is basically the same as the root command:

 1func TestCmd_Piglatin(t *testing.T) {
 2	output := bytes.NewBuffer(nil)
 3	expect := piglatinHelpText
 4
 5	piglatinCmd.SetOutput(output)
 6	err := piglatinCmd.Execute()
 7
 8	assert.NoError(t, err)
 9	assert.Contains(t, output.String(), expect)
10}

But this test fails:

=== RUN   TestCmd_Piglatin
Cipherator is a CLI tool for encoding and decoding English text using
a variety of "toy" ciphers such as Pig Latin.

Usage:
  cipherator [command]

Available Commands:
  completion  Generate the autocompletion script for the specified shell
  help        Help about any command

Flags:
  -h, --help   help for cipherator

Additional help topics:
  cipherator piglatin   Encode/decode text using the Pig Latin cipher

Use "cipherator [command] --help" for more information about a command.
    piglatin_test.go:18: 
                Error Trace:    /home/sean/Code/Go/src/github.com/seanhagen/cipherator/cmd/piglatin_test.go:18
                Error:          "" does not contain "Cipherator is a CLI tool for encoding and decoding English text using\na variety of \"toy\" ciphers such as Pig Latin."
                Test:           TestCmd_Piglatin
--- FAIL: TestCmd_Piglatin (0.00s)
=== RUN   TestCmd_Root
--- PASS: TestCmd_Root (0.00s)
FAIL
FAIL    github.com/seanhagen/pig-latinizer/cmd  0.005s
FAIL

Why is that?

Well, turns out Cobra needs some extra prodding to work the way you expect. I had to change a few things, but managed to get this working.

The first step was to use the root command, but set the arguments so it calls our sub-command.

 1func TestCmd_Piglatin(t *testing.T) {
 2	output := bytes.NewBuffer(nil)
 3	expect := piglatinLongHelpText
 4
 5	rootCmd.SetArgs([]string{"piglatin"})
 6	rootCmd.SetOutput(output)
 7	err := rootCmd.Execute()
 8
 9	assert.NoError(t, err)
10	assert.Contains(t, output.String(), expect)
11}

The second step, which highlights an issue, was to update the test for the root command to do something similar:

 1func TestCmd_Root(t *testing.T) {
 2	output := bytes.NewBuffer(nil)
 3	expect := rootHelpText
 4
 5	rootCmd.SetArgs([]string{""})
 6	rootCmd.SetOutput(output)
 7	err := rootCmd.Execute()
 8
 9	assert.NoError(t, err)
10	assert.Contains(t, output.String(), expect)
11}

Why did I have to go back and change the test for the root command?

Global variables.

Because the root command is defined outside of any function or struct like so:

1var rootCmd = &cobra.Command{

It means that the root command variable in both tests is the same thing. Now, this might not seem like a big deal, but as I’ve already run into one weird issue because of global variables I’m going to take a moment to refactor all this to not use them at all. I don’t want these tests to become flaky or hard to understand because I kept using the default global variables Cobra sets up by default when you use cobra to generate the initial files for you.

one hour later — Okay, it really only took a few minutes, but I couldn’t pass up the opportunity

Here’s where we’re at now:

 1// in piglatin_test.go
 2func TestCmd_Piglatin(t *testing.T) {
 3	output := bytes.NewBuffer(nil)
 4	expect := piglatinLongHelpText
 5
 6	cmd := getPigLatinCommand()
 7	cmd.SetOutput(output)
 8	err := cmd.Execute()
 9
10	assert.NoError(t, err)
11	assert.Contains(t, output.String(), expect)
12}
13
14// in root_test.go
15func TestCmd_Root(t *testing.T) {
16	output := bytes.NewBuffer(nil)
17	expect := rootHelpText
18
19	rootCmd := getRootCommand()
20	rootCmd.SetOutput(output)
21	err := rootCmd.Execute()
22
23	assert.NoError(t, err)
24	assert.Contains(t, output.String(), expect)
25}
26
27// in root.go
28func getRootCommand() *cobra.Command {
29	return &cobra.Command{
30		Use:   "cipherator",
31		Short: "Encode/decode text using various toy ciphers",
32		Long:  rootHelpText,
33	}
34}
35
36// in piglatin.go
37func getPigLatinCommand() *cobra.Command {
38	return &cobra.Command{
39		Use:   "piglatin",
40		Short: piglatinShortHelpText,
41		Long:  piglatinLongHelpText,
42	}
43}
44
45func setupPigLatinCommand(root, plc *cobra.Command) {
46	root.AddCommand(plc)
47	// set up any flags below here
48}
49
50// in main.go
51func main() {
52	root := getRootCommand()
53
54	piglatin := getPigLatinCommand()
55	setupPigLatinCommand(root, piglatin)
56
57	cobra.CheckErr(root.Execute)
58}

Huzzah! No more global variables, and tests can run in parallel without messing with each other. Very good.

Okay, so now it’s time to add the sub-command for encoding, right?

Well…

Detours & Refactoring #

Here’s the thing about letting your subconscious figure stuff out for you: sometimes it takes a little while. Remember how I said earlier that I felt that cipherator <cipher> <operation> made more sense? My subconscious figured out a better way to set this command up, but it requires switching to cipherator <operation> <cipher> to make it work.

The reason is that before I was thinking that <cipher> and <operation> would all be sub-commands. But if I switch it around, then I can have something like this ( very much not real code, just an example ):

 1var exampleEncodeCommand = 	&cobra.Command{
 2		Use:   "encode <cipher>",
 3		Short: "Encode some text using the named cipher.",
 4		Long: `Use one of the built-in ciphers to encode some text.
 5
 6Use the 'list-ciphers' command to see the list of built-in ciphers`,
 7		RunE: func(cmd *cobra.Command, args []string) error {
 8			useCipher := args[0]
 9			toEncode := args[1:]
10
11			enc, err := cipher.GetEncoder(useCipher)
12			if err != nil {
13				return fmt.Errorf("'%v' is not a known cipher", useCipher)
14			}
15
16			enc.Encode(strings.Join(args[1:], " "))
17			cmd.OutOrStdout().Write(enc.Bytes())
18			
19			return nil
20		},
21	}

That highlighted part is the reason for the change to cipherator <operation> <cipher>.

Why is this version better, though? Simple: because it doesn’t require that the command line tool know anything about what ciphers are available. This way, if the caller of the API asks for a cipher not defined, we can return an error. This might seem a bit silly; there’s only one cipher at the moment!

When I add ROT13 though, why should that involve updating the command line tool? The only thing that should change is the cipher package. This should be the goal – because then I won’t forget to update both packages when a new cipher is added later.

Also: I’ve got some ideas on how to improve the encoder⁷.

So, what does this mean? Well, I need to go back to our cipher package and make some changes. The main one will be to add the “get a cipher” function, which will take an argument that defines which encoder to return, and returns either the encoder that was asked for – or an error.

Again, starting with a test in cipher_test.go:

1func TestCipher_GetEncoder(t *testing.T) {
2	var enc Encoder
3	var err error
4
5	enc, err = GetEncoder(EncoderTypePiglatin)
6
7	assert.NoError(t, err)
8	assert.IsType(t, &PigLatin{}, enc)
9}

And then some implementation:

 1//go:generate go-enum -f=$GOFILE --marshal
 2
 3package cipher
 4
 5import "fmt"
 6
 7// EncoderType ...
 8// ENUM(piglatin, rot13)
 9type EncoderType int32
10
11// Encoder
12type Encoder interface {
13	Encode(string) (string, error)
14}
15
16// GetEncoder ...
17func GetEncoder(t EncoderType) (Encoder, error) {
18	switch t {
19	case EncoderTypePiglatin:
20		return NewPigLatin()
21	}
22
23	return nil, fmt.Errorf("%v is an unknown encoder type", t.String())
24}

Here I’m using the fantastic go-enum package to generate some enum values. If you want to see what was generated, go take a look on GitHub. Other than that, pretty straight-forward.

There is more I could do, but I need to stay focused on our current task: finishing the command line tool. There are other changes I could make to the cipher package, but those should wait until later⁸.

Okay, so what’s next? Back to the command line package and let’s write some tests. So the new way this will work is by calling the command with the operation first and the cipher second, like this: cipherator encode piglatin <text>. So first up I need an encode command.

 1func TestCmd_EncodeNoFlags(t *testing.T) {
 2	encPig := cipher.EncoderTypePiglatin.String()
 3
 4	tests := []struct {
 5		cipher string
 6		input  []string
 7		expect string
 8		error  bool
 9	}{
10		{encPig, []string{"hello world"}, "ellohay orldway", false},
11		{encPig, []string{"hello", " ", "world"}, "ellohay orldway", false},
12		{encPig, []string{"h", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"}, "ellohay orldway", false},
13		{"nope", []string{"hello world"}, "hello world", true},
14	}
15
16	for i, tt := range tests {
17		t.Run(
18			fmt.Sprintf("test %v cipher %s input %s expect %s error %v", i, tt.cipher, tt.input, tt.expect, tt.error),
19			func(t *testing.T) {
20				output := bytes.NewBuffer(nil)
21
22				cmd := getEncodeCommand()
23				args := append([]string{tt.cipher}, tt.input...)
24
25				cmd.SetArgs(args)
26				cmd.SetOutput(output)
27				err := cmd.Execute()
28
29				if tt.error {
30					assert.Error(t, err)
31				} else {
32					assert.NoError(t, err)
33					assert.Equal(t, tt.expect, output.String())
34				}
35			},
36		)
37	}
38}

After writing the test, I updated the command:

 1func getEncodeCommand() *cobra.Command {
 2	return &cobra.Command{
 3		Use:   "encode <cipher>",
 4		Short: encodeShortHelpText,
 5		Long:  encodeLongHelpText,
 6		RunE: func(cmd *cobra.Command, args []string) error {
 7			useCipher := args[0]
 8			toEncode := strings.Join(args[1:], "")
 9
10			c, err := cipher.ParseEncoderType(useCipher)
11			if err != nil {
12				return fmt.Errorf("unable to parse cipher name: %w", err)
13			}
14
15			enc, err := cipher.GetEncoder(c)
16			if err != nil {
17				return fmt.Errorf("unable to fetch cipher: %w", err)
18			}
19
20			output, err := enc.Encode(toEncode)
21			if err != nil {
22				return fmt.Errorf("unable to encode using the '%v' cipher: %w", c.String(), err)
23			}
24
25			_, err = cmd.OutOrStdout().Write([]byte(output))
26
27			return err
28		},
29	}
30}

And it all works! Neato. Can I build the command and run it?

$ go build -o ciph ./cmd
$ ./ciph encode piglatin hello world
Error: 0x5431a0

Apparently not!

Turns out the issue was this bit in main.go:

1func main() {
2	root := getRootCommand()
3
4	enc := getEncodeCommand()
5	setupEncodeCommand(root, enc)
6
7	cobra.CheckErr(root.Execute)
8}

That line needed to be this instead:

1	cobra.CheckErr(root.Execute())

That done, I can run the command:

$ go build -o ciph ./cmd
$ ./ciph encode piglatin hello world
elloworldhay

Hrmmm. That’s not quite correct, but the fix turns out to be pretty easy. See, the arguments on the command line aren’t passed in as a single string. In other words, this:

$ ./ciph encode piglatin hello world

When the arguments finally end up inside the encode command, what I get is this: []string{"piglatin","hello","world"}.

Is there an easy fix for this?

Well, how about changing this:

toEncode := strings.Join(args[1:], "")

To this:

toEncode := strings.Join(args[1:], " ")

That works!

This means that the tests have to change though – specifically, by changing the tests to be just these three:

		{encPig, []string{"hello world"}, "ellohay orldway", false},
		{encPig, []string{"hello", " ", "world"}, "ellohay   orldway", false},
		{"nope", []string{"hello world"}, "hello world", true},

Okay, cool! At this point how am I doing according to the todo list of tasks?

~~Write a Go package that can encode English text using the Pig Latin cipher~~
~~Write a command-line tool that uses the library to encode the provided text, whether the text is in a file or provided as arguments~~
Write a web utility that can encode text sent to it
Add ROT13 as a cipher
Add the ability to decode text that has been encoded

Groovy! Up next, web app thing!

CLI to HTTP #

So now it’s time to put this thing online. Being a web developer who is “Not Great™” at front end stuff at the best of times, this is going to be just an API. I’ll leave building a fancy page with HTML and JavaScript up to you.

screenshot of the world’s worst website — Pictured: about how good my front-end design chops are

So I’ll just be building a little HTTP-based API that has a route to do the encoding. I want the route to be laid out in a similar fashion as the arguments to the CLI tool, so the route is going to look like this:

POST /encode/<cipher>

The text to get encoded will be in the HTTP POST body. In other words, we should be able to do the following:

$ curl -X POST "https://ciphernator.site/encode/piglatin" --data-raw "hello world"
ellohay orldway

Before diving into testing, I’m going to think about how to organize this code. There are kind of two ways I could go about this. One would be to create a ‘web’ or ‘api’ package, and put everything for the web service in there. The other way would be to move what’s currently in cmd into cmd/cli, and then create a cmd/web. In that new folder, I could create the server binary – ie, this is where the main() function for the web server would live.

Because most of what I’m writing is going to be HTTP-related code, the second option feels a bit better to me. This way, all of the packages with a main() are grouped under the cmd folder, instead of being spread all over our repository. The idea of having every main() live inside the same folder within a repository is super pleasing to me; it’s a bit of organization that just feels good, you know?

Okay, so I’m using cmd/web for the HTTP server stuff. Are there any other packages I might need? I’m not sure, but that’s what writing tests is for!

Diving in, eventually I end up with this:

 1func encodeHandler(w http.ResponseWriter, r *http.Request) {
 2	vars := mux.Vars(r)
 3	c, ok := vars["cipher"]
 4	if !ok {
 5		w.WriteHeader(http.StatusBadRequest)
 6		return
 7	}
 8
 9	et, err := cipher.ParseEncoderType(c)
10	if err != nil {
11		w.WriteHeader(http.StatusBadRequest)
12		return
13	}
14
15	enc, err := cipher.GetEncoder(et)
16	if err != nil {
17		w.WriteHeader(http.StatusInternalServerError)
18		return
19	}
20
21	text, err := io.ReadAll(r.Body)
22	if err != nil {
23		w.WriteHeader(http.StatusInternalServerError)
24		return
25	}
26
27	out, err := enc.Encode(string(text))
28	if err != nil {
29		w.WriteHeader(http.StatusInternalServerError)
30		return
31	}
32
33	w.WriteHeader(http.StatusOK)
34	_, _ = fmt.Fprintf(w, out)
35}

It works, but it doesn’t quite feel… great. There’s a lot of stuff going on there, and while there’s only got one level of indentation with the if statements, it’s still a bit longer than I’d like. Thankfully, because there are unit tests, I can start changing this without having to worry about breaking stuff. So long as the tests keep passing I can keep moving forward with changes!

A little bit more work, and here’s what I’ve got now:

 1func encodeHandler(w http.ResponseWriter, r *http.Request) {
 2	defer r.Body.Close()
 3    
 4	enc, err := getRequestEncoder(r)
 5	if err != nil {
 6		w.WriteHeader(http.StatusInternalServerError)
 7		return
 8	}
 9    
10	err = encodeRequest(w, r.Body, enc)
11	if err != nil {
12		w.WriteHeader(http.StatusInternalServerError)
13		return
14	}
15
16	w.WriteHeader(http.StatusOK)
17}
18
19func getRequestEncoder(r *http.Request) (cipher.Encoder, error) {
20	vars := mux.Vars(r)
21	c, ok := vars["cipher"]
22	if !ok {
23		return nil, fmt.Errorf("'cipher' not a valid key in request vars")
24	}
25
26	et, err := cipher.ParseEncoderType(c)
27	if err != nil {
28		return nil, fmt.Errorf("unable to parse encoder type: %w", err)
29	}
30
31	return cipher.GetEncoder(et)
32}
33
34func encodeRequest(w io.Writer, r io.Reader, enc cipher.Encoder) error {
35	text, err := io.ReadAll(r)
36	if err != nil {
37		return fmt.Errorf("unable to ready request body: %w", err)
38	}
39
40	out, err := enc.Encode(string(text))
41	if err != nil {
42		return fmt.Errorf("unable to encode input: %w", err)
43	}
44
45	_, err = fmt.Fprintf(w, out)
46	return err
47}

I’ve pulled the two actions being taken in the request handler into two separate functions. The first handles getting the cipher encoder, the second handles actually encoding the text we send in our request.

Making Some Improvements #

Let’s take a closer look at encodeRequest though, I think there’s still some work I can do to improve it.

The biggest code smell to me is that I’m using io.ReadAll. At the very least I should be wrapping the request body in a io.LimitedReader so that we can try to protect ourselves against someone trying to crash our site by sending a never-ending stream of data. Also, I don’t like that I need read all the data first, then cast it to a string in order to pass it into Encode. What would be really great is if the encoder worked a little bit more like the json Encoder type.

Basically, what I’d like to end up with is something a bit more like this:

1func encodeRequest(w io.Writer, r io.Reader, enc cipher.Encoder) error {
2    if err := enc.WriteTo(w); err != nil {
3        return err
4    }
5    
6    return enc.ReadFrom(r)    
7}

Or even better, refactoring some of the other code as well, I could end up with something more like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


func encodeHandler(w http.ResponseWriter, r *http.Request) {
	vars := mux.Vars(r)
	c, ok := vars["cipher"]
	if !ok {
		return nil, fmt.Errorf("'cipher' not a valid key in request vars")
	}

    enc, err := cipher.NewEncoder(c, w)
    if err != nil {
		w.WriteHeader(http.StatusInternalServerError)
		return
    }
    
    if err := enc.Read(r.Body); err != nil {
		w.WriteHeader(http.StatusInternalServerError)
		return
    }
}

Or at least something like that. None of that is what I’m eventually going to end up writing; think of it as pseudo-code that just happens to look a lot like Go.

In any case, this will require changing some stuff in the library package. I’ve got to add a new constructor that will accept an io.Writer, and I’ve got to add a method that accepts an io.Reader. On top of that, there’s whatever other changes I’ll need to make so that it can read from that io.Reader and write the encoded text to that io.Writer.

After playing around for a while, I know how I want to proceed with these changes. Unfortunately it’s going to mean re-writing some stuff. If this was a library already being used by other folks I’d look for a different way to do this. However, I’m going to pretend we’re still in the “pre-release” phase of this project, and we haven’t shipped Version 1.0.0 quite yet.

So after doing some tests, reading some docs, and thinking about this for a bit, I think I know how I’m going to proceed.

Keeping The Old Stuff, But Improving It #

First up, for small bits of text it’d be nice to be able to do either of the following:

1encoded, err := piglatin.Encode(input)
2// OR
3encoded, err := cipher.Encode(cipher.EncoderTypePigLatin, input)

If you know you only are going to be dealing with smaller bits of text, and you only want Pig Latin, you can use the first. If you’re going to be dealing with various ciphers, but still small amounts of text, you can use the second.

Handling `io.Writer` Instead Of Strings #

Next up, handling io.Writer as an argument. There are two versions of this I’m going to implement. As a side note, from now on I’ll only be showing the piglatin package – there will be a similar function in the cipher package that has an additional argument that specifies which cipher to use.

1// version 1, write to a provided io.Writer
2buf := bytes.NewBuffer(nil)
3err := piglatin.EncodeTo(input, buf)
4
5// version 2, create an encoder and then use it later
6enc, err := piglatin.New(buf)
7err := enc.Encode(input)

The reason I want to have these three different ways to interact with the encoder is pretty simple. The first is that keeping Encode(string) won’t break previous tests. The second is that while creating and holding on to an encoder makes sense in a web application, it makes a bit less sense in a command line tool that is only “alive” for a (hopefully) short period of time. Anyways!

What I Ended Up With #

After about half an hour of work, I think I’m happy with where the code is at now. Let’s take a look!

First up there’s the changes to the Pig Latin encoder:

 1func New(wr io.Writer) (*Encoder, error) {
 2	return &Encoder{wr}, nil
 3}
 4
 5func Encode(in string) (string, error) {
 6	out := bytes.NewBuffer(nil)
 7	err := EncodeTo(in, out)
 8	return out.String(), err
 9}
10
11func EncodeTo(in string, wr io.Writer) error {
12	pl, err := New(wr)
13	if err != nil {
14		return err
15	}
16	return pl.EncodeFromString(in)
17}
18
19func (spl *Encoder) EncodeFromString(in string) error {
20	read := strings.NewReader(in)
21	return spl.readInto(read, spl.output)
22}

Nice! And best of all, this required barely any changes to our other packages – just the change from accepting a string to accepting an io.Reader in Encode.

Take a look at our handler now:

 1func encodeHandler(w http.ResponseWriter, r *http.Request) {
 2	enc, err := getRequestEncoder(w, r)
 3	if err != nil {
 4		w.WriteHeader(http.StatusInternalServerError)
 5		return
 6	}
 7
 8	err = enc.Encode(r.Body)
 9	if err != nil {
10		w.WriteHeader(http.StatusInternalServerError)
11		return
12	}
13
14	w.WriteHeader(http.StatusOK)
15}

Very nice! Taking another look at our list:

~~Write a Go package that can encode English text using the Pig Latin cipher~~
~~Write a command-line tool that uses the library to encode the provided text, whether the text is in a file or provided as arguments~~
~~Write a web utility that can encode text sent to it~~
Add ROT13 as a cipher
Add the ability to decode text that has been encoded

We’ve built that web utility, and now it’s time to add another cipher! Let’s see what we need to do to add ROT13!

Adding ROT13 #

Turns out, I don’t need to do much!

First up, our ROT13 function – which I shamelessly borrowed from a StackOverflow answer and then modified a bit:

 1func rot13(r rune) rune {
 2	capital := r >= 'A' && r <= 'Z'
 3	if !capital && (r < 'a' || r > 'z') {
 4		return r // Not a letter
 5	}
 6
 7	r += 13
 8	if capital && r > 'Z' || !capital && r > 'z' {
 9		r -= 26
10	}
11	return r
12}

After that I, uh, went a bit overboard. I’m not going to paste it in here because this post is already somewhat long; instead you can go look at the code on GitHub. I will walk you through it a bit here though.

Basically, I found that in the io package there are two other handy “XReader” interfaces, where “X” can be rune or byte. Because I’m working with runes, my preference would be that I just use a io.RuneReader. However, not all io.Readers can be io.RuneReaders. So using some handy methods from the uft8 package I wrote some other encoders that work with bytes. Honestly I could have just gone with the one that uses the io.Reader but it was fun coding up the three different ways that these interfaces from the io package can all accomplish the goal of “encode each rune using ROT13”.

When Not To Improve #

One thing that you may notice is that most of the encode functions look pretty similar to the Pig Latin one. In fact, let’s do a little test: which encoder is the following code from:

 1// New  ...
 2func New(wr io.Writer) (*Encoder, error) {
 3	return &Encoder{wr}, nil
 4}
 5
 6// Encode ...
 7func Encode(in string) (string, error) {
 8	out := bytes.NewBuffer(nil)
 9	err := EncodeTo(in, out)
10	return out.String(), err
11}
12
13// EncodeTo  ...
14func EncodeTo(in string, wr io.Writer) error {
15	pl := &Encoder{wr}
16	return pl.EncodeFromString(in)
17}
18
19// EncodeFromString  ...
20func (spl *Encoder) EncodeFromString(in string) error {
21	read := strings.NewReader(in)
22	return spl.Encode(read)
23}

Is this a chance to refactor out some common code? It is! However, that’s not the real question to ask right now: should we refactor out this common code?

Well, the func Encode(in string) (string,error) and func EncodeTo(in string, wr io.Writer) error functions can’t be refactored away… can they? Could we do something like the following:

1func Encode = cipher.Encode

Unfortunately no, we can’t alias functions the way we can with types in Go. Those two functions have to stay. We could create a type in the cipher package that looks something like this:

 1type BaseEncoder struct {
 2    encodeFn func(io.Reader) error
 3}
 4
 5func (be *BaseEncoder) EncodeFromString(in string) error {
 6    read := strings.NewReader(in)
 7    return be.encodeFn(read)
 8}
 9
10func (be *BaseEncoder) Encode(r io.Reader) error {
11    return be.encodeFn(r)
12}

And then use it in one of our encoders like so (I’m using the ROT13 encoder as an example):

1type Encoder struct {
2    *cipher.BaseEncoder
3    wr io.Writer    
4}
5
6func New(wr io.Writer) (*Encoder, error) {
7    // handle setup
8}

Unfortunately, that’s still not going to work. Remember the New(t EncoderType, wr io.Writer) (Encoder, error) method in the cipher package? Well, for that to work we have to import each package that provides an encoder recognized by our encoder types:

 1func New(t EncoderType, wr io.Writer) (Encoder, error) {
 2	switch t {
 3	case EncoderTypePiglatin:
 4		return piglatin.New(wr)
 5	case EncoderTypeRot13:
 6		return rot13.New(wr)
 7	}
 8
 9	return nil, fmt.Errorf("%v is an unknown encoder type", t.String())
10}

See how I’m calling piglatin.New(wr) on line 4, and rot13.New(wr) on line 6? I’m only able to do that because I’m importing both those packages like so:

1import (
2	"fmt"
3	"io"
4	"strings"
5
6	"github.com/seanhagen/cipherator/cipher/piglatin"
7	"github.com/seanhagen/cipherator/cipher/rot13"
8)

So if you’re not familiar, Go doesn’t allow circular imports. Those are situations where package A imports package B, and package B imports package A. The import chain can be longer than that, too – it doesn’t matter how many packages sit on the chain between A and B; if they both end up importing each other the compiler tosses out an error.

In order to refactor out the common code, I need a package to put it in. The best place is in the cipher package – but we can’t import anything from the cipher package inside of the piglatin or rot13 packages because that would cause a circular import and cause the compiler to fail. I could make a “base” encoder package that lives inside the cipher package, but then we’d be committing a naming no-no.

For now I’m just going to leave this. Copying the 18 lines of code that make up the first few functions in the rot13 or piglatin packages isn’t a big deal for now. Instead, I’m going to move on to the next task.

Decoding What Has Been Encoded #

Back to the list:

~~Write a Go package that can encode English text using the Pig Latin cipher~~
~~Write a command-line tool that uses the library to encode the provided text, whether the text is in a file or provided as arguments~~
~~Write a web utility that can encode text sent to it~~
~~Add ROT13 as a cipher~~
Add the ability to decode text that has been encoded

Looks like the last thing left to do is to decode text.

We’re going to start off with another interface:

1type Decoder interface {
2    DecodeString(string) error
3    Decode(io.Reader) error
4}

Pretty much the same as the Encoder interface. I’m also creating a composite interface as well:

1type Handler interface {
2	Encoder
3	Decoder
4}

I’m calling it Handler for now because I can’t think of a good name at the moment, and Cipher creates stutter when you realize it would be used as cipher.Cipher.

Next up, a choice: do I update New so that it returns Handler instead of Encoder?

I don’t know why I’m asking you, not only can’t you answer – I’ve already made up my mind! I am going to make that change, because it won’t break backwards compatibility. Anything expecting a cipher.Encoder will still work with a cipher.Handler!

After that, time to update the rot13 package so that it can decode! To start off I’m going to create two helper methods similar to the encode helper functions: Decode(string) (string, error) and DecodeTo(string, io.Writer) error. After those, I’ll add a DecodeFromString(string) error and a Decode(io.Reader) error method to the rot13.Encoder type.

Decoding ROT13 #

I’m starting with ROT13 because it’s much easier to decode; just apply ROT13 again and you’re golden! That makes writing the decoder super simple:

 1package rot13
 2
 3import (
 4	"bytes"
 5	"io"
 6	"strings"
 7)
 8
 9// Decode ...
10func Decode(in string) (string, error) {
11	buf := bytes.NewBuffer(nil)
12	err := DecodeTo(in, buf)
13	return buf.String(), err
14}
15
16// DecodeTo ...
17func DecodeTo(in string, wr io.Writer) error {
18	rt := Encoder{wr}
19	return rt.DecodeString(in)
20}
21
22// DecodeFromString  ...
23func (e *Encoder) DecodeString(in string) error {
24	read := strings.NewReader(in)
25	return e.Decode(read)
26}
27
28// Decode  ...
29func (e *Encoder) Decode(r io.Reader) error {
30	return e.Encode(r)
31}

That’s it, the entirety of the code required to decode ROT13.

But what about…

Decoding Pig Latin #

This is where things get a bit more fun. As a quick refresher, here’s the rules for encoding English into Pig Latin:

If the word begins with a consonant, take the first letter of the word and move it to the end and then ad “ay”
If the word begins with a vowel, just add “way” to the end of the word

What does this produce for various inputs?

Input	Output
hello	ellohay
eat	eatway
by	ybay
at	atway
world	orldway
apples	applesway
I	Iway
a	away
way	ayway

Well, the first thing we find out is that the shortest possible “word” in Pig Latin is four characters. The only single-letter words in the English language we care about are “I” and “a”; both are vowels and would have the full “-way” suffix attached when they get encoded. But do we need to care about the length of the token we’re processing if it’s a word and not a symbol? Not really; the rules to encode have nothing to do with the length of the word, just if it starts with a vowel or not.

Now, I could add a rule that handles “I” and “a” by saying “if the encoded text is four letters long, remove the last three”. Or I could just figure out the “proper” rule for decoding words that start with a vowel; that will work regardless of whether the word is “I”, “eat”, or “electroencephalographers”.

So how do we decode words that start with a vowel, like turning “eatway” into “eat” – and how do we differentiate between a word that starts with a vowel and one that starts with a ‘w’?

Well, turns out this is where I run into the first big challenge of this project.

Pig Latin and Word Collisions #

To sum this problem up, lets take a look at two words: eight, and weight.

Let’s start by running both through a Pig Latin translator to see what they turn into. First up is eight; it starts with a vowel so we just add “way” to the end and get eightway. Next is weight; it starts with a consonant, so we move that to the end and add “ay”, this gives us… eightway.

Oh no.

Turns out there are other English words that translate into the same encoded Pig Latin word, too. So what does this mean for our ability to decode Pig Latin?

Well… hrmm. Before I make a decision, I want to figure something out.

Basically, I’d like to figure out rougly how many words in the English language encode to the same Pig Latin word. I’ve already figured out a few; eight & weight become eightway, arm and warm both become armway.

First up, I need a big-ass list of as many English words as I can find. Instead of just downloading a single list and using that, I decided to combine a few sources. The first was a word list I downloaded from wordgamedictionary.com, the second was a list I downloaded from this web page, and the last I scraped from bestwordlist.com with a quick Go script.

Then I used some command line tools to combine the lists, filter out duplicates, and remove “bad” words. In this case I don’t care about swear words, but each list has words that aren’t really words. For example one “word” I found was hexaenoic␣acid, another was Heywood␣Jablome⁹. Another one I found was hey␣rube hey␣Rube, which seems to be how the data source handled alternate versions of a single word – have them on the same “line” with a space separating them.

So first up, split words on that ␣ character as well as spaces, putting the split parts on new lines. For example hey␣rube hey␣Rube would become two lines: hey␣rube and hey␣Rube. Then I looped over the list again and further split words on characters like ␣. Then I downcased everything and sorted the list and removed any duplicates.

Next, I wrote a little script that would read each word from our new “prime word list” and encode it. It would use that as a key to a map[string][]string, and put the original word in the slice. This way I end up with a map of each Pig Latin encoded word and all the words that encode to that Pig Latin word. Then I printed out each encoded word that had more than one English base word and removed the obviously silly words like “winwards”.

What did I find out?

Well, for one, at least one of these word lists came from scraping the web. How else did I end up with “winwards” and “wambulances”? I also found “okas wokas” which sounds like a line from a rage comic.

Anyways, what I ended up with was a list of 94 Pig Latin words that could be produced by more than one English word when the suffix is -way. There were a few commonalities between these words. Here, check out some examples and see if you can spot how these words relate to each other:

A	B	Pig Latin
artless	wartless	artlessway
and	wand	andway
easel	weasel	easelway
ebbed	webbed	ebbedway
in	win	inway
ok	wok	okway
orks	works	orksway

In each case word B can be created by taking word A and slapping a ‘w’ on the front.

Which makes sense, right? For words that start with a vowel we throw “way” on the end as the suffix, and words that start with ‘w’ also append ‘way’. For every letter other than ‘w’, this is fine because the end of the word is <letter>-ay like ‘day’, ‘say’, ‘fay’, ‘pay’, etc.

Okay, And? #

This doesn’t help me figure out a way to see when we’ve run into one of these “ambiguous” Pig Latin words, though. Starting from a word like axesway there is still no way to tell if the original was axes or waxes.

Now, I could change how I encode Pig Latin. For example, I could change the code so that the suffix includes a dash – but it gets attached after the consonant has been moved. That way a word like ok becomes ok-way and wok becomes okw-ay. While this would make decoding super simple, I’m not a fan. Mostly because I kind of want the encoding to be hard to spot; seeing -ay and -way all over the encoded text would make it pretty easy to figure out how to decode it.

Are there other ways I could change how English gets encoded to Pig Latin? Well, looking at the Wikipedia page on Pig Latin it turns out our rules are one of a few variations on how to encode Pig Latin; maybe one of the other versions will work better?

For example, one alternative is to use a different suffix for words that start with a vowel. I’m using ‘way’ as the suffix, alternates are ‘yay’ or ‘hay’. Unfortunately, those will probably have about the same size collision space – or worse. I can’t tell until I produce some output though, so give me another hour and I’ll see how ‘yay’ and ‘hay’ perform compared to ‘way’.

After putting in some work, I’ve got the collisions I care about for each of these suffixes: ‘-hay’, ‘-way’, and ‘-yay’. They’re a bit long, so I threw them up in a gist if you want to see them. It turns out there are differences in how many collisions each suffix produces. ‘-way’ comes in first with 94 collisions, ‘-hay’ is next with 78, and ‘-yay’ is last with… 28!

I could probably even remove a few from the lists; I doubt many people are using ‘yoctograms’ or ‘yorkish’ that often. After removing a few more less likely words, I end up these final totals: ‘-way’ has dropped to 59, ‘-hay’ dropped to to 37, and ‘-yay’ dropped to 15!

Does this help me make a decision on what to do about decoding Pig Latin? Well, not really. I did a bit more playing around while I was doing this, and found out there are a few suffixes that produce NO collisions but they’re suffixes like -aay or -eay. When it comes to making a decision on how to proceed, there’s still two things I need to figure out: how accurate I want the decoding to be, and how much I want to stick to the original rules of the challenge.

Can I Get Accurate Decoding? #

A better question might be this: do I want perfect decoding or do I want best-effort decoding?

For an example of what “best-effort” could look like, take this sentence:

Our earnings perked up ears, the contract was just inked, hold on until our axes are ready.

In the table below you can see the words in the sentence that have ambiguous decodings for the suffixes -way, -hay, and -yay. I’m going to put an ‘x’ where the word doesn’t have an ambiguous way to decode.

Original	-way	-hay	-yay
Our	x	hour \| our	our \| your
earnings	x	x	earnings \| yearnings
perked	x	x	x
up	x	x	x
ears	ears \| years	ears \| hears	x
the	x	x	x
contract	x	x	x
was	x	x	x
just	x	x	x
inked	inked \| winked	x	x
hold	x	hold \| old	x
on	x	x	x
until	x	x	x
our	x	hour \| our	our \| your
axes	axes \| waxes	x	x
are	x	x	x
ready	x	x	x

What I could do is simply make it clear that a word has multiple ways it can get decoded. For example, if ‘-yay’ is the suffix, this could be how the text gets decoded:

[Our|Your] [earnings|yearnings] perked up ears, the contract was just inked, hold on until [our|your] axes are ready.

That would be an example of “best-effort” decoding.

Reach For Perfection! #

But what if I want “better-effort” decoding? I’m never going to get perfect decoding; if you take some random Pig Latin from elsewhere on the internet this code probably won’t be able to handle it unless it follows the exact same rules my code does. Even then, there will always be Pig Latin words that can be ambiguously decoded into an English word.

However, what about the encoded text produced by this encoder? Is there a way we can make that text perfectly decodable? And can we do it without changing how the text gets displayed? In other words, with no dashes or visible special characters?

Well, we’re using UTF-8 for our text – because that’s how Go encodes all strings. And the REALLY neat thing about UTF-8 ( or rather, Unicode ) is that it contains something called “non-printable characters”. You’re already familiar with these, in fact, you’re staring at too many to count right this very second! That’s because stuff like a space, tabs, and newlines are all “non-printable” characters. I can’t use a space, tab, or newline though – so what’s the big deal? Well, there are many other non-printable characters I can use!

Like, for example, codepoint U+200C – aka the “zero width non-joiner” character.

This is what our sentence looks like when it’s been encoded with the “zero width non-joiner”:

Our‌way earnings‌way erked‌pay up‌way ears‌way, he‌tay ontract‌cay as‌way ust‌jay inked‌way, old‌hay on‌way until‌way our‌way axes‌way are‌way eady‌ray.

If you copy that, and head over to this handy site and paste in the text into the textbox and hit “Show me the characters”, you should get something like this:

image showing the non-printable characters in the
text

See all those U+200C? that’s the zero-width character. Ignore the ‘·’ characters, that’s just what they replace spaces with so you can see each individual space.

Getting Tricky With It #

Okay, so we’ve got a character we can put into the encoded text in order to… what? How does “armway” with a zero-width character help us know if it should be decoded to “arm” or “warm”?

What if we used it to mark words that originally started with a vowel? For example, take the word inkway – it could be either ink or wink. But if I only put the zero-width character before the w when the original word was ink, then I’d be able to tell it’s supposed to decode to ink and not wink!

Another way is I could just put it in front of the suffix. If we pretend _ is the zero-width character, then it would mean ink becomes ink_way, and wink becomes inkw_ay. I like this much better, because it opens up the ability to let the user choose which suffix they want when they encode. Or I could even let the user provide their own! We’re not going to do that today though, maybe another day.

Of course, if the word doesn’t have the zero-width character I’d have to fall back to the “best effort” decoding. I’m fine with this though, as it makes the decoder a bit more useful.

So, of the two options I came up with I like the second one the most. Doing it that way means a word without a zero-width character should be processed using the best-effort decoding, otherwise we get really accurate decoding!

Neat.

But I’m Lazy #

This is a neat idea; but now that I’ve got the idea of letting the user choose a suffix when encoding, I want to make sure the code doesn’t prevent me from doing that in the future.

That means one of two things; either I figure out a way to reliably determine when a word is ambiguous, or I lock down the suffixes to a hard-coded list – and also hard-code in ambiguous words. I’m really not a fan of the second one. So let’s see if there’s an easy-ish way to tell if a word is ambiguous. Heads up: this doesn’t really go anywhere, but I’m leaving it in¹⁰.

Here’s a sample of some of ambiguous words I’ve found:

Pig Latin	Suffix	Possible Decodings
axesway	-way	axes, waxes
orksway	-way	orks, works
eastyay	-yay	east, yeast
ouryay	-yay	our, your
armedhay	-hay	armed, harmed
eaterhay	-hay	eater, heater

What about some words that aren’t ambiguous?

Pig Latin	Suffix	Decodes To
ellohay	-[w]ay	hello
ellohay	-[y]ay	hello
ellohay	-[h]ay	hello
eatway	-way	eat
eatyay	-yay	eat
eathay	-hay	eat

The suffix for the first three has the first letter in [] brackets because hello starts with a consonant, meaning the suffix is always -ay.

What can I figure out using these two pieces of info?

Let’s see:

Words that start with a consonant don’t expose the full three-letter suffix

Well that’s not much to go on. What if I take a closer look at both ellohay and eatway, and see how they would decode under various conditions?

First up, what happens if the code assumes the first letter of the original word was a consonant? That means we remove the ay, and move the letter at the new end of the word to the front:

ellohay -> ello [h] -ay -> [h] ello = hello
eatway  -> eat [w] -ay  -> [w] eat  = weat

I’ve been using little diagrams like these as a kind of manual encoding/decoding process, but it feels like I might as well share them with you. The letter in the [] brackets is the letter getting moved, and the -ay is the suffix we’re removing. The -> separate each step, and the word after = is the output. Here’s each step of decoding ellohay as a list:

Start with ellohay
Remove the -ay suffix, the last letter ( h ) is getting moved
Move the h to the front of the word
Output hello

Anyways, we can see “assume first letter of word was consonant” works fine with hello and breaks with eat. This makes sense, because eat doesn’t start with a consonant!

What about if we assume the word started with a vowel? In that case, the last three letters of the word are all suffix, so remove them:

ellohay -> ello -hay -> ello 
eatway  -> eat -way  -> eat

Well, that’s not much better. What if we specify what the suffix is? For example, if we specify the suffix is -way:

ellohay -> elloh -ay -> ello[h] -> [h] ello = hello
eatway  -> eat -way                         = eat

Well that’s handy. Turns out specifying the suffix makes it super easy. Or… does it?

What about one of our collision words, like ouryay? Let’s say we know the suffix is -yay; how does the word decode?

ouryay -> our -yay = our

But how would we decode ouryay to get “your”? See what happens when we encode “your”:

your -> [y]our -> our [y] +{y}ay -> our + y + ay = ouryay

Well shoot.

And if you think about it, this makes sense. Think back to the ROT13 encoding and decoding process. That worked out because the English alphabet is 26 characters. If you put the letters around a circle so that if you move forward one from z you’re now at a, adding 13 to any letter twice gets you back to that letter!

What if we wanted a different rotation scheme? Like ROT7, or ROT15? I’m going to call these ROTN, where N is the number of steps to take when encoding or decoding. And how does this relate to decoding Pig Latin?

Let’s take another look at the rule for encoding something using ROT13:

When encoding, move 13 characters forward in the alphabet, wrapping around if you reach the end of the alphabet.

And decoding is the same, except now you move backwards.

For ROT13 encoding and decoding are the same thing, because half of 26 is 13 – you’ll always end up where you started if you encode the same text twice. But for ROT7 encoding and decoding are different. Encoding something with ROT7 twice is the same as encoding once ROT14 – not the same as ROT13. However, encoding and decoding for ROTN are technically the same rule, with one thing changing if you want to encode or decode. Here, check this out:

When [encoding/decoding], move n characters [forward/backward] in the alphabet, wrapping around if you reach the [end/start] of the alphabet.

How does this relate to Pig Latin? Well, see how for ROTN there’s really only two questions: are you encoding or decoding, and what’s n? For us this means inside a function that handles decoding all we need to produce the correct output is n.

But with Pig Latin, there’s a big difference. Let’s take a look at the rules for encoding Pig Latin again real quick:

If the word begins with a consonant, take the first letter of the word and move it to the end and then ad “ay”
If the word begins with a vowel, just add “way” to the end of the word

So not only do we have two different rules, they apply to words, and not individual letters! For each word you need to know the suffix it was encoded with, and if the word originally started with a vowel or not. In other words, there is no way to decode some Pig Latin words unless you know both if they originally started with a vowel or not, and the suffix used to encode the word!

Is there any way around this? Nope.

Sometimes You Have To Compromise #

Well, I was really hoping I could figure out a way to decode any Pig Latin. Looks like I’m going to have to go with a compromise.

Here’s how the library is going to encode text, assuming the suffix is configured to be -way.

when adding the suffix to the end of a word, a zero-width character will be inserted into the encoded text that marks where the suffix begins; -ay for words starting in a consonant, -way for words starting in a vowel

As for decoding:

The code first searches the word for the zero-width character. If it exists, move to step 2, otherwise move to step 3.
If the number of characters after the zero-width character is 2, the word originally started with a consonant; remove the last two characters, move the new last character to the front of the word. If there were 3 characters after the zero-width character, the word started with a vowel; remove the last three characters.
Take the last three characters of the word. If they don’t match the suffix -way, the word originally started with a consonant. If they do match, mark the word as ambiguous and put a [w] at the front of the word to indicate this after removing the suffix.

So what does this mean? Well, for encoding only one thing needs to change: inserting the zero-width character at the right point. What about decoding? If the text being decoded is from this library it should work fine.

What about text not generated by this library? In that case it’ll do the best it can, and mark words where it wasn’t sure if it originally started with a vowel or a consonant. For example, if the suffix is -way and it’s decoding andway; the output will be [w]and because the code can’t tell if the word originally started with a vowel or not.

In other words, all words that start with the same letter as the first letter in the suffix, and the second letter is a vowel are potentially ambiguous. It’s not perfect, but it’s as close as we can get without refusing to process text not created by this library.

Providing The Suffix? #

I could extend the code so that the encode/decode functions require a kind of ‘key’; the number of steps to move with ROTN, or the suffix for Pig Latin. However, this post is already well over ten thousand words so I’m going to work on wrapping things up now.

That means just implementing this zero-width-character-based decoding with the fallback, and then wrapping this up.

So let’s do that!

Wrapping Up By Decoding Our Way To Victory! #

Right, so what does all this mean for what we’ve got already?

How does this change the encoding process, and how does this change our decoding process?

For encoding, not much has to change. The tests have to be updated first so that they’re expecting the zero-width character, then the code gets updated so that the tests pass again. For the Pig Latin encoder, it means the test table gets updated to look like this:

 1	tests := []struct {
 2		input, expect string
 3	}{
 4		{"hello", "elloh\u200cay"},
 5		{"eat", "eat\u200cway"},
 6		{"by", "yb\u200cay"},
 7		{"you", "ouy\u200cay"},
 8		{"at", "at\u200cway"},
 9		{"to", "ot\u200cay"},
10		{"world", "orldw\u200cay"},
11		{"apples", "apples\u200cway"},
12		{"hello world", "elloh\u200cay orldw\u200cay"},
13		{"Hello world", "Elloh\u200cay orldw\u200cay"},
14		{"Hello, world!", "Elloh\u200cay, orldw\u200cay!"},
15		{"I", "I\u200cway"},
16	}

That \u200c is our zero-width non-joining character. This has to be put in because the UTF-8 character has to be in the expected output, even if we can’t see it. That’s because the code isn’t testing if the visible text matches what we see, it’s testing if A equals A. That includes the length of the string, which different when you include the zero-width character.

Decoding hasn’t been written yet for Pig Latin, so we’re not so much “changing” as simply “writing”. Starting with a simple case: the suffix is -way and I’m decoding ellohay:

1	tests := []struct {
2		input, expect string
3	}{
4		{"ellohay", "hello"},
5	}

In order to get the tests to stop complaining about missing functions and instead complain the code isn’t doing the right thing, I quickly fill in the decoding functions:

 1// Decode ...
 2func Decode(in string) (string, error) {
 3	buf := bytes.NewBuffer(nil)
 4	err := DecodeTo(in, buf)
 5	return buf.String(), err
 6}
 7
 8// DecodeTo ...
 9func DecodeTo(in string, wr io.Writer) error {
10	rt := Encoder{wr}
11	return rt.DecodeString(in)
12}
13
14// DecodeFromString  ...
15func (e *Encoder) DecodeString(in string) error {
16	read := strings.NewReader(in)
17	return e.Decode(read)
18}
19
20// Decode  ...
21func (e *Encoder) Decode(r io.Reader) error {
22	return nil
23}

Nice and neat; I’ve only got one function to get working! After going through a bunch of red-green-refactor, here’s what I’ve got:

  1// Decode  ...
  2func (e *Encoder) Decode(r io.Reader) error {
  3	scan := scanner.Scanner{}
  4	scan.Init(r)
  5	scan.Filename = "decoding"
  6	scan.Whitespace ^= 1<<'\t' | 1<<' '
  7	// tell the scanner to treat the zero width rune as part of a token, and not a separator
  8	scan.IsIdentRune = e.scannerIsIdentRune
  9
 10	// start decoding tokens
 11	return e.scanTokens(scan, e.decodeToken)
 12}
 13
 14// scanTokens  ...
 15func (e *Encoder) scanTokens(scan scanner.Scanner, process func(string) error) error {
 16	for {
 17		ch := scan.Scan()
 18		if ch == scanner.EOF {
 19			break
 20		}
 21
 22		if err := process(scan.TokenText()); err != nil {
 23			return err
 24		}
 25	}
 26	return nil
 27}
 28
 29// scannerIsIdentRune ...
 30func (e *Encoder) scannerIsIdentRune(ch rune, i int) bool {
 31	if i <= 1 {
 32		// no numbers in the first two characters, or everything will probably explode
 33		return (ch == zeroWidth || unicode.IsLetter(ch)) && !unicode.IsDigit(ch)
 34	}
 35	return ch == zeroWidth || unicode.IsLetter(ch)
 36}
 37
 38// decodeString  ...
 39func (e *Encoder) decodeToken(token string) error {
 40	if len(token) == 0 {
 41		return nil
 42	}
 43
 44	word, suffix, hasZW := e.splitToken(token)
 45
 46	if hasZW {
 47		return e.decodePerfect(word, suffix)
 48	}
 49	return e.decodeBestGuess(word)
 50}
 51
 52// splitToken ...
 53func (e *Encoder) splitToken(token string) (word, suffix []rune, hasZW bool) {
 54	// split our token into the runes making up the word, and the runes making up the suffix
 55	for _, r := range token {
 56		if e.isZeroWidth(r) {
 57			hasZW = true
 58			continue
 59		}
 60
 61		if !hasZW {
 62			word = append(word, r)
 63		} else {
 64			suffix = append(suffix, r)
 65		}
 66	}
 67	return
 68}
 69
 70// decodeBestGuess ...
 71func (e *Encoder) decodeBestGuess(word []rune) error {
 72	// wordFormat := "%c%s"
 73	wl := len(word)
 74	suffix := word[wl-3:]
 75	word = word[:wl-3]
 76
 77	if suffix[0] == e.defaultSuffix[0] {
 78		word = append([]rune{'[', suffix[0], ']'}, word...)
 79	} else {
 80		word = append([]rune{suffix[0]}, word...)
 81	}
 82
 83	return e.writeRunes(word)
 84}
 85
 86// decodePerfect  ...
 87func (e *Encoder) decodePerfect(word, suffix []rune) error {
 88	if len(suffix) == 3 {
 89		return e.writeRunes(word)
 90	}
 91
 92	l := len(word)
 93	last := word[l-1]
 94	word = append([]rune{last}, word[:l-1]...)
 95
 96	return e.writeRunes(word)
 97}
 98
 99// wordStartsWithVowel  ...
100func (e *Encoder) suffixForWordStartingWithVowel(suffix []rune) bool {
101	return len(suffix) == 3
102}
103
104// writeRunes  ...
105func (e *Encoder) writeRunes(input []rune) error {
106	b := strings.Builder{}
107	for _, r := range input {
108		if _, err := b.WriteRune(r); err != nil {
109			return fmt.Errorf("unable to write rune to output string: %w", err)
110		}
111	}
112
113	_, err := io.WriteString(e.output, b.String())
114	return err
115}

Very cool, and the best part is that our test suite is all passing:

	tests := []struct {
		input, expect string
	}{
		{"ellohay", "hello"},
		{"orldway", "[w]orld"},
		{"orldw\u200Cay", "world"},
		{"andway", "[w]and"},
		{"andw\u200Cay", "wand"},
		{"and\u200Cway", "and"},
	}

--- PASS: TestPigLatinDecoding (0.00s)
    --- PASS: TestPigLatinDecoding/Decode(string) (0.00s)
        --- PASS: TestPigLatinDecoding/Decode(string)/test_0_decode_'ellohay'_to_'hello' (0.00s)
        --- PASS: TestPigLatinDecoding/Decode(string)/test_1_decode_'orldway'_to_'[w]orld' (0.00s)
        --- PASS: TestPigLatinDecoding/Decode(string)/test_2_decode_'orldw\u200cay'_to_'world' (0.00s)
        --- PASS: TestPigLatinDecoding/Decode(string)/test_3_decode_'andway'_to_'[w]and' (0.00s)
        --- PASS: TestPigLatinDecoding/Decode(string)/test_4_decode_'andw\u200cay'_to_'wand' (0.00s)
        --- PASS: TestPigLatinDecoding/Decode(string)/test_5_decode_'and\u200cway'_to_'and' (0.00s)

Wrapping Up #

Okay, time to start wrapping things up. What is there still left to do?

implement reading from a file in the command line tool
implement writing output to a file in the command line tool
implement decoding in the command line tool
implement decoding in the web app
add some way to limit how much data the web app will accept
do some cleaning up & further refactoring

Let’s start tackling these, one by one.

CLI: Read From File #

Starting off with this test:

 1func TestCmd_EncodeReadFromFile(t *testing.T) {
 2	tests := []struct {
 3		id      int
 4		cipher  string
 5		success bool
 6	}{
 7		{1, "piglatin", true},
 8		{1, "rot13", false},
 9	}
10
11	for i, tt := range tests {
12		t.Run(fmt.Sprintf("test %v encode file %v with cipher %v", i, tt.id, tt.cipher), func(t *testing.T) {
13			output := bytes.NewBuffer(nil)
14			inputFile, outputFile := getInputAndOutputFiles(tt.id)
15
16			expect, err := os.Open(outputFile)
17			require.NoError(t, err)
18
19			cmd := getEncodeCommand()
20			args := []string{tt.cipher, "-f", inputFile}
21			cmd.SetArgs(args)
22			cmd.SetOutput(output)
23			err = cmd.Execute()
24
25			if tt.success {
26				assert.NoError(t, err)
27				assert.Equal(t, expect, output)
28			} else {
29				assert.Error(t, err)
30			}
31		})
32	}
33}
34
35func getInputAndOutputFiles(id int) (string, string) {
36	in := fmt.Sprintf("./testdata/%v-input.txt", id)
37	out := fmt.Sprintf("./testdata/%v-output.txt", id)
38	return in, out
39}

This is pretty similar to the test I already had in cmd/cli/encode_test.go ( now renamed to TestCmd_EncodeReadFromStdin ), the main difference is how the arguments are constructed. Previously, the arguments were the words in a sentence that I wanted to encode. Now I’m mimicking someone writing the following:

$ cipherator encode piglatin -f input.txt

If that file contains “hello world”, we should get the output “ellohay orldway” – with the Unicode zero-width non-joining character in there, of course!

While working on this part, I run into a fun issue: newlines! Check out what happens when I add the test case {"hello\n", "elloh\u200cay\n"} to our test table:

=== RUN   TestPigLatinEncoding/Encode(string)/test_12_encode_'hello_'_to_'elloh\u200cay_'
    piglatin_test.go:49: 
                Error Trace:    /home/sean/Code/Go/src/github.com/seanhagen/cipherator/cipher/piglatin/piglatin_test.go:49
                Error:          Not equal: 
                                expected: "elloh\u200cay\n"
                                actual  : "elloh\u200cay"
                            
                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,2 +1 @@
                                 elloh‌ay
                                -
                Test:           TestPigLatinEncoding/Encode(string)/test_12_encode_'hello_'_to_'elloh\u200cay_'
--- FAIL: TestPigLatinEncoding (0.00s)

What?

Turns out, that’s another configuration thing the scanner needs to be told about. Basically, we need to tell it that newline characters ( these: \n ) are tokens too. Pretty easy fix:

	scan.Whitespace ^= 1<<'\t' | 1<<' ' | 1<<'\n'

But hold on – this isn’t the only place we set up a scanner, is it? Nope! We also set one up when decoding as well.

Why do I bring this up? Well, to point out another thing refactoring & tests are good for. First off, I wouldn’t have caught this without tests until someone brought it to my attention. While I try to pay attention to details, stuff like a string missing a newline are easy to miss. Secondly, this is one of those things where I’m trying to train myself to ask good questions. Asking if there are other scanners where we might need to make the same change is one of those questions.

However, the answer isn’t to just go over and make the same change there. Rather, the answer is to take a step back and see if it’s worth refactoring out the scanner setup into a function. That way there’s only one place that needs to change when I find bugs like that.

Here’s the places I set up a scanner in the piglatin package:

 1// cipher/piglatin/piglatin.go
 2func (e *Encoder) encodeReaderIntoWriter(r io.Reader, w io.Writer) error {
 3	// set up the scanner
 4	scan := scanner.Scanner{}
 5	scan.Init(r)
 6	scan.Filename = "encoding"
 7	// include spaces and tabs as 'tokens'
 8	scan.Whitespace ^= 1<<'\t' | 1<<' ' | 1<<'\n'
 9
10	return e.scanTokens(scan, e.encodeToken)
11}
12
13// cipher/piglatin/decode.go
14// Decode  ...
15func (e *Encoder) Decode(r io.Reader) error {
16	scan := scanner.Scanner{}
17	scan.Init(r)
18	scan.Filename = "decoding"
19	scan.Whitespace ^= 1<<'\t' | 1<<' '
20	// tell the scanner to treat the zero width rune as part of a token, and not a separator
21	scan.IsIdentRune = e.scannerIsIdentRune
22
23	// start decoding tokens
24	return e.scanTokens(scan, e.decodeToken)
25}

By refactoring the common code out into this:

 1// getScanner ...
 2func (e *Encoder) getScanner(r io.Reader) scanner.Scanner {
 3	scan := scanner.Scanner{}
 4	scan.Init(r)
 5	scan.Filename = "piglatin"
 6
 7	// include spaces,tabs, and newlines as characters to include when scanning
 8	scan.Whitespace ^= 1<<'\t' | 1<<' ' | 1<<'\n'
 9
10	// tell the scanner to treat the zero width rune as part of a token, and not a separator
11	scan.IsIdentRune = e.scannerIsIdentRune
12	return scan
13}

I can ensure that future me ( or anybody else coming to look at this code ) won’t have to worry about missing a place the scanner in the piglatin package is created.

Anyways, reading from a file works! On to the next task!

CLI: Output To File #

The process for this part is pretty much the same as for reading from a file: write a new test that specifies an output file using a flag, then checks that the output file contains what I expect.

Besides getting caught tracking down a silly bug for ten minutes, this goes pretty smoothly. Because the code is pretty similar to the “read from file”, I’m not going to paste it in here. If you’re really curious you can take a look at the repository.

CLI: ROT13 & Decoding #

This part should be very straight-forward. The tests already take into account how to specify a cipher:

 1func TestCmd_EncodeReadFromStdin(t *testing.T) {
 2	encPig := cipher.EncoderTypePiglatin.String()
 3
 4	tests := []struct {
 5		cipher string
 6		input  []string
 7		expect string
 8		error  bool
 9	}{
10		{encPig, []string{"hello world"}, "ellohay orldway", false},
11		{encPig, []string{"hello", " ", "world"}, "ellohay   orldway", false},
12		{"nope", []string{"hello world"}, "hello world", true},
13	}
14
15    //... code
16}
17
18func TestCmd_EncodeReadFromFile(t *testing.T) {
19	tests := []struct {
20		id      int
21		cipher  string
22		success bool
23	}{
24		{1, "piglatin", true},
25		{1, "rot13", false},
26	}
27    //... code
28}
29
30func TestCmd_EncodeWriteOutputToFile(t *testing.T) {
31	tests := []struct {
32		cipher string
33		input  string
34		expect string
35	}{
36		{"piglatin", "hello", "elloh\u200cay"},
37	}
38    //... code
39}

So to see if ROT13 already works, I can just update those tests to have some ROT13 test cases.

And what do you know, it works:

--- PASS: TestCmd_EncodeReadFromStdin (0.00s)
    --- PASS: TestCmd_EncodeReadFromStdin/test_0_cipher_piglatin_input_[hello_world]_expect_elloh\u200cay_orldw\u200cay_error_false (0.00s)
    --- PASS: TestCmd_EncodeReadFromStdin/test_1_cipher_piglatin_input_[hello___world]_expect_elloh\u200cay___orldw\u200cay_error_false (0.00s)
    --- PASS: TestCmd_EncodeReadFromStdin/test_2_cipher_nope_input_[hello_world]_expect_hello_world_error_true (0.00s)
    --- PASS: TestCmd_EncodeReadFromStdin/test_3_cipher_rot13_input_[hello]_expect_uryyb_error_false (0.00s)
    --- PASS: TestCmd_EncodeReadFromStdin/test_4_cipher_rot13_input_[hello_world]_expect_uryyb_jbeyq_error_false (0.00s)
PASS
ok      github.com/seanhagen/cipherator/cmd/cli 0.004s

Awesome! Of course, it took more than just adding a few more test cases. There were a few things to fix, code that had to be written; but all told the number of lines I added or tweaked was less than ten. Plus, the tests guided me right to where I needed to make the required changes!

If I’m sounding like an evangalist for TDD: good 😄

Anyways, this part is done, on to decoding in the web app!

WEB: Decoding #

While I was here, I re-wrote the handlers so that there’s only one, and the path looks like this: /{operation}/{cipher}.

The handler looks like this now:

 1func cipherRouteHandler(w http.ResponseWriter, r *http.Request) {
 2	enc, err := getRequestEncoder(w, r)
 3	if returnIfErrorToHandle(w, err) {
 4		return
 5	}
 6
 7	err = processRequest(r, enc)
 8	if returnIfErrorToHandle(w, err) {
 9		return
10	}
11
12	w.WriteHeader(http.StatusOK)
13}

Isn’t that just so nice and pretty? So easy to read, and understand? There’s probably a better name for the returnIfErrorToHandle function, but other than that I think it looks pretty great.

WEB: Limit Max Body Size That Will Be Read #

Right! So this is something I mentioned possibly doing earlier, and here’s where I actually handle implementing this.

So you might be wondering why I want to do this at all. Well, the truth is I don’t really need to, this is just a coding challenge at the end of the day. But I learned of a way to do this recently and I want to make sure I understand how to test it.

The rationale behind wanting to do this is that it’s an area that was a bit of a blind spot for me until recently. When writing stuff for the web, one of the things that needs to be kept in mind is that whatever you’re writing is probably going to be accessible from the public internet. That means you need to handle some basic precautions.

In this case, I want to limit how much data someone can send before the server goes “alright, that’s enough data, I’m not accepting any more”.

So what am I going to use to test this? How about the full text of Shakespeare’s King Lear? How big is that?

Turns out: 156kb. Pretty tiny.

Let’s be generous though, and say our service will accept data up to 1Mb ( or 1024kb ) in size. That’s six full copies of King Lear!

Next: how do we get our handler to set that limit?

Of course there’s a package we can use to do this! I was going to use io.LimitedReader, but then I found limitio, a package from nanmu42. It does basically the same thing, but with one main difference: it can be configured to return an error other than EOF when it hits the size limit. This is handy, because it lets us know if the encoding finished, failed because of a real error, or hit the size limit.

And it turns out I know exactly where to put this:

 1func processRequest(r *http.Request, enc cipher.Handler) error {
 2	defer r.Body.Close()
 3
 4	vars := mux.Vars(r)
 5	op, ok := vars["operation"]
 6	if !ok {
 7		return fmt.Errorf("'operation' not a valid key in request vars")
 8	}
 9
10	var err error
11	switch op {
12	case "encode":
13		err = enc.Encode(r.Body)
14	case "decode":
15		err = enc.Decode(r.Body)
16	default:
17		err = fmt.Errorf("operation must be one of 'encode' or 'decode'")
18	}
19
20	return err
21}

But first I’ve got to write a test! It ends up looking a lot like the previous TestWeb_Routes test I wrote earlier, so I’m not going to show it here.

I did run into an issue when doing this part though. Originally, I had written up this nice little helper function:

 1func returnIfErrorToHandle(w http.ResponseWriter, err error) bool {
 2	if err == nil {
 3		return false
 4	}
 5
 6	if errors.Is(err, limitio.ErrThresholdExceeded) {
 7		http.Error(w, "error during opration", http.StatusRequestEntityTooLarge)
 8	} else {
 9		http.Error(w, "error during opration", http.StatusInternalServerError)
10	}
11
12	return true
13}

The idea was that if the limit was reached, line 6 would evaluate to true, and the status code would get set to HTTP status code 413, aka “Request Entity Too Large”. However, nothing I could do would get the test to pass – the status code was always 200.

Turns out, this is one of those “a spec whose first draft was initially written in 1989 has some annoying edge cases” things.

See, part of the HTTP spec is that the very very very first thing you send in a response is the status code.

You can test this out on any computer on the command line to prove this is true; here’s me doing it on my Linux computer:

$ curl -I https://example.com
HTTP/2 200 
content-encoding: gzip
cache-control: max-age=604800
content-type: text/html; charset=UTF-8
date: Sat, 03 Sep 2022 02:52:05 GMT
content-length: 648

And on my Windows 11 computer:

HTTP/1.1 200 OK
Content-Encoding: gzip
Cache-Control: max-age=604800
Content-Type: text/html; charset=UTF-8
Date: Sat, 03 Sep 2022 02:50:23 GMT
Content-Length: 648

While Windows apparently can’t do HTTP/2 yet, you can still see that using curl lets us see the very first thing sent back by the server. On Windows it’s HTTP/1.1 200 OK, on Linux it’s HTTP/2 200. For both of them, that 200 is the status code I’m concerned with at the moment. What I wanted was to have that status code be something other than 200 when something goes wrong – like 413, the status code for “you tried to send too much data”.

Because of the way HTTP works, the very very very first thing that has to be sent is that HTTP/... line. After that the server will send headers; those are the things like Content-Length and Content-Encoding you see. So before any data can be sent back to the user, first the status line and headers have to be sent. Because we don’t want to read in the data into a file ( or into memory ) before we parse it, we can’t take advantage of the limit reader.

Now, it’s not the biggest deal that I wasn’t able to get that working the way I wanted. It’s still useful to keep in there; limiting how much our application accepts is a handy feature! It just means that if the limit is hit, we can’t set the status code appropriately. Instead, we just print out an error message in the data we send back. That way, we can ensure the user knows what happened if they try to send too much data – so long as they actually look at the encoded or decoded text!

Also, if I was really going to deploy this it would probably be sitting behind some kind of load balancer. That load balancer should be where stuff like this is really handled, but having the web app itself able to handle these situations is still good. Who knows what exploits are out there someone could use to bypass your load balancer, or tricking it into allowing unlimited data through?

Lastly, there is also the Content-Length HTTP header, which should contain how many bytes the client wants to send. I’m not going to do that for two reasons: I don’t feel like writing some middleware right now, and its’ pretty easy for a client to say they’re only going to send 100kb and then send 100mb instead – although the server is well within its rights to straight up ignore everything after that first 100kb.

Final Clean-up & Refactor #

Alrighty!

Time for some final polishing and cleaning up. I’m going to go through each package, and clean things up. Refactoring, filling out GoDoc comments, trying to make sure the code is clear and readable – that kind of stuff.

Ciphers #

So first up, the cipher package! There’s not really much to do here; this package is mostly just the interfaces, the EncoderType constants, and the New constructor. After filling out the GoDoc comments, let’s take a look at the piglatin package.

One of the first things I wanted to do was refactor some of the code in encodeToken(token string) error out into their own smaller functions. That way the encodeToken function becomes readable again; it’s very clear which code is for encoding single-length tokens, and which is for encoding everything else. Also, this is where tests saved by butt.

This was my first pass at the refactor:

 1func (e *Encoder) encodeToken(token string) error {
 2	if len(token) == 0 {
 3		return nil
 4	}
 5
 6	// a builder to hold the encoded string we're building
 7	var build strings.Builder
 8
 9	// if the token is only a single character it's probably either 'I',
10	// 'a', 'A', or a special character.
11	if len(token) == 1 {
12		fmt.Printf("encoding single-char token: '%v'\n", token)
13		err := e.encodeSingleCharToken(token, &build)
14		if err != nil {
15			return err
16		}
17	}
18
19	fmt.Printf("encoding longer token: '%v'\n", token)
20	return e.encodeLongToken(token, &build)
21}

But the tests failed:

--- FAIL: TestPigLatinEncoding (0.00s)
    --- FAIL: TestPigLatinEncoding/Encode(string) (0.00s)
        --- FAIL: TestPigLatinEncoding/Encode(string)/test_8_encode_'hello_world'_to_'elloh\u200cay_orldw\u200cay' (0.00s)
            piglatin_test.go:47: 
                        Error Trace:    /home/sean/Code/Go/src/github.com/seanhagen/cipherator/cipher/piglatin/piglatin_test.go:47
                        Error:          Not equal: 
                                        expected: "elloh\u200cay orldw\u200cay"
                                        actual  : "elloh\u200cay   \u200cayorldw\u200cay"
                                    
                                        Diff:
                                        --- Expected
                                        +++ Actual
                                        @@ -1 +1 @@
                                        -elloh‌ay orldw‌ay
                                        +elloh‌ay   ‌ayorldw‌ay
                        Test:           TestPigLatinEncoding/Encode(string)/test_8_encode_'hello_world'_to_'elloh\u200cay_orldw\u200cay'

Can you tell why? Maybe if I include the output from those fmt.Printf statements:

encoding longer token: 'hello'
encoding single-char token: ' '
encoding longer token: ' '
encoding longer token: 'world'

Well hey, it’s almost like after encoding the short space, it doesn’t return!

A quick fix:

 1func (e *Encoder) encodeToken(token string) error {
 2	if len(token) == 0 {
 3		return nil
 4	}
 5
 6	// a builder to hold the encoded string we're building
 7	var build strings.Builder
 8
 9	// if the token is only a single character it's probably either 'I',
10	// 'a', 'A', or a special character.
11	if len(token) == 1 {
12		return e.encodeSingleCharToken(token, &build)
13	}
14
15	return e.encodeLongToken(token, &build)
16}

This isn’t the final form of the encodeToken function, by the way. I just wanted to share this little thing that happened while I was refactoring, mostly to show that a) I make mistakes too, and b) tests are great.

Also, while refactoring, I found a bug! Turns out I’m not quite handling capitalization correctly. For example, what should DUKE OF ALBANY encode to? If you guessed UKEDAY OFWAY ALBANYWAY, you’d be right! However, that’s not what my code is producing. Instead, it gives me this: UKEd‌ay oF‌way aLBANY‌way. Looks like I’ve got to add some code to handle words like this!

Of course, because I’m doing test-driven development the first step was a test case. And because my tests are using table-driven tests, I only need to add a single line! A few more changes, and that bug is squashed.

Closing Out #

I’ll be honest – I put this project down and then forgot about it for a while. One of the “fun” side-effects of ADHD. In any case, the code is in a decent enough state that I’m going to wrap this up here. All the tests are passing, and all my “additional” sub-challenges are sorted out.

If you want to see the code, you can check out the repository. I tried to create branches for each of the sections, but often got really far into the TDD-then-write-more-blog-post loop before remembering to create new branches. Sorry.

I really enjoyed writing this, and will probably do something similar for other coding challenges. These are fun ways to dive into technologies and techniques I’m unfamiliar with, as well as being a great way to practice TDD.

Or creating weird dependencies between a package concerned with being a command line tool and a package concerned with being an API and a package that’s just supposed to be a library ↩︎
Remember, API doesn’t mean “thing provided by a server that uses HTTP” or whatever – it literally means “Application Programming Interface”; so the methods provided by a library and the routes provided by a web service are both APIs, one is just remotely accessed ( kind of like an RPC, hey? ) ↩︎
If you’re coming from other languages, slices are almost arrays, but not quite – but for our purposes you can think of them as arrays for now. ↩︎
Or myself from the future ↩︎
Okay, probably more, but these are the two I’m going to focus on. ↩︎
😆 ↩︎
If you guessed “make it more like the encoder/decoder for JSON” you’d be right. ↩︎
Of course, I say this after ahving spent like two hours trying to write this section. ↩︎
Getting Bart prank-called by a list of words was not something I expected to ever happen, but here we are. ↩︎
Because I’m lazy. ↩︎