One of the fundamental algorithms in Computing Science is the LCS, or Longest Common Subsequence. This post will not attempt to explain the algorithm to any great degree. There are any number of references available to show how the algorithm works, as well as reference implementations you can obtain.
Here we will look at the implementation of a Golang version of LCS, which will would ultimately allow us to generate a Unix “unidiff” style output for two given files. As you start looking at the available implementations of LCS, you quickly find that they are basically always implemented on pairs of strings. The algorithm itself is easy enough to generalize to objects and types beyonds strings and characters, but… well, it’s just not done for you.
Usually at this point, the software engineering contingent that comes from a language with generics starts shouting about the lack of generics in Golang being a very big problem. While I agree that generics can be handy at times, I really hate them with a passion. Most of the time they are not required, and using them just because you can, usually results in grossly inefficient and bloated code. Also, I’ve yet to see an implementation of generics that is not either written in a disgustingly incomprehensible syntax, nor also suffers from run time typing issues; either by tossing type safety out the window, or using run-time checks and tossing exceptions at run time. Oh yeah, I hate exceptions as well… but that’s another topic.
So, let’s get started in having a look at the LCS implementations as given in some of the links above. I won’t bother to identify which came from which (feel free to lookup the references yourself). The first implementation looks like the following:
1func LCS1(a, b string) (int, string) {
2 arunes := []rune(a)
3 brunes := []rune(b)
4 aLen := len(arunes)
5 bLen := len(brunes)
6 lengths := make([][]int, aLen+1)
7 for i := 0; i <= aLen; i++ {
8 lengths[i] = make([]int, bLen+1)
9 }
10
11 // row 0 and column 0 are initialized to 0 already
12 for i := 0; i < aLen; i++ {
13 for j := 0; j < bLen; j++ {
14 if arunes[i] == brunes[j] {
15 lengths[i+1][j+1] = lengths[i][j] + 1
16 } else if lengths[i+1][j] > lengths[i][j+1] {
17 lengths[i+1][j+1] = lengths[i+1][j]
18 } else {
19 lengths[i+1][j+1] = lengths[i][j+1]
20 }
21 }
22 }
23
24 // read the substring out from the matrix
25 s := make([]rune, 0, lengths[aLen][bLen])
26 for x, y := aLen, bLen; x != 0 && y != 0; {
27 if lengths[x][y] == lengths[x-1][y] {
28 x--
29 } else if lengths[x][y] == lengths[x][y-1] {
30 y--
31 } else {
32 s = append(s, arunes[x-1])
33 x--
34 y--
35 }
36 }
37 // reverse string
38 for i, j := 0, len(s)-1; i < j; i, j = i+1, j-1 {
39 s[i], s[j] = s[j], s[i]
40 }
41 return len(s), string(s)
42}
As we can see, this version of LCS is rather specific to taking string
’s as
input, and operates on rune
’s within. There are some nice properties with
implementation. It works on UTF-8 characters. So it is “internationalized”
right from the get go. A win for Golang in my opinion. A possibly easier to
read version might be something like this:
1func Max(more ...int) int {
2 max_num := more[0]
3 for _, elem := range more {
4 if max_num < elem {
5 max_num = elem
6 }
7 }
8 return max_num
9}
10
11func LCS2(str1, str2 string) (int, string) {
12 len1 := len(str1)
13 len2 := len(str2)
14
15 table := make([][]int, len1+1)
16 for i := range table {
17 table[i] = make([]int, len2+1)
18 }
19
20 i, j := 0, 0
21 for i = 0; i <= len1; i++ {
22 for j = 0; j <= len2; j++ {
23 if i == 0 || j == 0 {
24 table[i][j] = 0
25 } else if str1[i-1] == str2[j-1] {
26 table[i][j] = table[i-1][j-1] + 1
27 } else {
28 table[i][j] = Max(table[i-1][j], table[i][j-1])
29 }
30 }
31 }
32 return table[len1][len2], Back(table, str1, str2, len1-1, len2-1)
33}
34
35//http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
36func Back(table [][]int, str1, str2 string, i, j int) string {
37 if i == 0 || j == 0 {
38 return ""
39 } else if str1[i] == str2[j] {
40 return Back(table, str1, str2, i-1, j-1) + string(str1[i])
41 } else {
42 if table[i][j-1] > table[i-1][j] {
43 return Back(table, str1, str2, i, j-1)
44 } else {
45 return Back(table, str1, str2, i-1, j)
46 }
47 }
48}
As you can see, it is very similar to the first one, but this one uses recusion
to reverse the final answer. It also has a helper function Max()
to make a
part of the innermost loop more readable. Of course, nothing we’ve done here
has helped in making this function more generic. If anything, we’ve made it
worse by introducing helper functions that would need to be modified and
re-implemented for each type we wish to use LCS on. One could possibly forgive
this extra readability, if the performance was faster, or about the same as the
slightly less readable version. As a future benchmark run will show, the
difference between LCS1
and LCS2
is only about 4%, with LCS2
being
slightly slower.
Let’s find a simple method to make this function more generic. I really want to
use it on arrays of strings, to be able to compare two sets of files. In order
to make this function generic, I’m first going to assume that we will be
working with sequences of int
’s. As a method to interface (yes, the irony is
not lost on me) with this new LCS3
function, we’ll define an interface like
so:
1type Interface interface {
2 Keys() []int
3}
This tells the compiler that we will be dealing with types that implement a
Keys()
function that returns a slice of int
’s. Note, in Go, interfaces are
implicit, in the sense that any type that implements a Keys() []int
function,
adheres to this interface, and can be used in places that expects this
interface. IE: types that you as the implementor of LCS3()
do not know about
at implementation time. Let us translate the LCS function to use this new
interface:
1func LCS3(a, b Interface) (int, []int) {
2 aKeys := a.Keys()
3 bKeys := b.Keys()
4
5 lengths := make([][]int, len(aKeys)+1)
6 for i := 0; i <= len(aKeys); i++ {
7 lengths[i] = make([]int, len(bKeys)+1)
8 }
9
10 // row 0 and column 0 are initialized to 0 already
11 for i := 0; i < len(aKeys); i++ {
12 for j := 0; j < len(bKeys); j++ {
13 if aKeys[i] == bKeys[j] {
14 lengths[i+1][j+1] = lengths[i][j] + 1
15 } else if lengths[i+1][j] > lengths[i][j+1] {
16 lengths[i+1][j+1] = lengths[i+1][j]
17 } else {
18 lengths[i+1][j+1] = lengths[i][j+1]
19 }
20 }
21 }
22
23 // read the substring out from the matrix
24 s := make([]int, 0, lengths[len(aKeys)][len(bKeys)])
25 for x, y := len(aKeys), len(bKeys); x != 0 && y != 0; {
26 if lengths[x][y] == lengths[x-1][y] {
27 x--
28 } else if lengths[x][y] == lengths[x][y-1] {
29 y--
30 } else {
31 s = append(s, aKeys[x-1])
32 x--
33 y--
34 }
35 }
36
37 ReverseIntSlice(s)
38
39 return len(s), s
40}
41
42func ReverseIntSlice(s []int) []int {
43 for i, j := 0, len(s)-1; i < j; i, j = i+1, j-1 {
44 s[i], s[j] = s[j], s[i]
45 }
46
47 return s
48}
As you can see, it looks remarkably similar to LCS1()
. The nice thing about
this version is that it works exclusivly with integers. Most compilers are
exceedingly good at being able to optimize working on arrays of integers. So I
am hopeful that this will end up being faster. However, you say, we now need to
translate every instance of this problem into an array of integers. Would this
not be slower than LCS1()
or LCS2()
for simple string input? Also, how do
you translate strings to []int
to use with this? This all seems too
complicated.
The key to making this usable for a set of strings lies in the interface definition we did earlier. Let’s define a new type and function:
1type RuneSlice []rune
2
3func (p RuneSlice) Keys() []int {
4 s := make([]int, 0, len(p))
5
6 for _, r := range p {
7 s = append(s, int(r))
8 }
9 return s
10}
We define a type RuneSlice
and a Keys() []int
function for this type. We
can now use instances of this type to call LCS3()
. Note, this function is not
“fast”. It creates a []int
and appends a copy of each rune to this new slice.
Let’s use this type to implement LCS3String()
:
1func LCS3String(a, b string) (int, string) {
2
3 _, runes := LCS3(RuneSlice(a), RuneSlice(b))
4 str := make([]rune, 0, len(runes))
5 for _, r := range runes {
6 str = append(str, rune(r))
7 }
8
9 return len(str), string(str)
10}
As you can see, a very small adaptor function can be created to create the
necessary input and translate the output back to what is expected. But surely,
this must be slower than either LCS1()
and LCS2()
. Fortunately,
benchmarking shows that this version is about 5% faster than the previously
faster LCS1
function.
So here we are, with a largely generic LCS function. How do we use this
function to be able to implement an LCS function that takes []string
type
arguments? One thing to notice is that we can use an optimization, and instead
of dealing with arrays of strings, we can use a checksum for each string
instead. The implementation looks something like:
1type StringSlice []string
2
3func (p StringSlice) Keys() []int {
4 f := fnv.New32a()
5
6 hash := make([]int, 0, len(p))
7 for i := range p {
8 f.Reset()
9 f.Write([]byte(p[i]))
10 hash = append(hash, int(f.Sum32()))
11 }
12 return hash
13}
Note, this particular implementation is not generic (again!). It actually
explicitly uses the 32-bit version of the fnv-a hash. If we wished, we could
implement a HashKeys()
version that also implemented a method to retrieve a
pointer to the hash.Hash
interface to use, at which point the hash function
that would be used would be generic, and could be replaced by other
implementations.
Below is a quick benchmark run for the various implementations. Note,
BenchmarkLCS4
and BenchmarkLCS5
are implementations that test finding the
LCS from arrays of strings. The benchmarks 1-3 are testing the various LCS
functions that work on a set of strings (finding the longest common
subsequence of runes/characters).
1go test -bench=. github.org/weingart/lcs
Output:
testing: warning: no tests to run
PASS
BenchmarkLCS1 10000 128831 ns/op
BenchmarkLCS2 10000 134044 ns/op
BenchmarkLCS3 10000 102317 ns/op
BenchmarkLCS4 10000 66528 ns/op
BenchmarkLCS5 10000 70554 ns/op
ok github.org/weingart/lcs 5.309s
The code is not complete, and could use some cleanup for it to be more
generally useful, as well as use more love on the front that generates diffs,
such that it would be able to output textual diffs that are patch(1)
compatible.