Repeated Subsequence Problem

  • Nov 15, 2012

This week began by implemeting the two new crawler types that I had determined a need for last week. The implementation of the crawlers themselves went as straightforward as I had anticipated last week, very smooth. Looks like just as soon as I've written all the crawelrs I think we need for now, I'm finally getting good at it!

I also was able to finish up getting the SpreadExpressionStatementCrawler Construct return to work. This is used for tests on multiple ExpressionStatements, but will only return valid results that exist entirely within one instance of the selected construct. For example, one of the tests described by a mentor in Michelle's study was to look for a DoTogether with exactly four method invocations and exactly two characters executing those methods. This test would look for four consecutive ExpressionStatements where exactly two unique characters are making the calls, and then return DoTogethers.

The remainder of my time this week was spent working on a new library for the API: what I'm calling the Sequence API. This is intended to be used with the new ExpressionStatementChunkCrawler to locate common sequences of ExpressionStatements within the code and return their instances. I felt like this is going to be the common useage for the ChunkCrawler, and I guessed it would be non-trivial code to implement (and I was right, more on that in a bit). My plan was to try to implement a method that would only return code sequences where the method calls were exactly identical, perhaps later overloading the method to take a comparator, which could be used to locate code segments passing other rules (e.g. methods are the same, but characters calling them can be different).

This seemed like it would be a fairly explored space, so I was hoping that this would simple matter of using Google & Google Scholar to find some good implementations (I was even going to go for naive solutions, due to the small size of the input), but it ended up being more interesting than I had thought.

First the good news: the Longest Repeated Substring problem does, in fact, have a commonly used solution. It involves suffix trees, which I've never been exposed to, but don't seem horrendously difficult to learn and implement (although certainly not easy). I've also located a naive implementation on Stack Overflow that looks promising as well.

Here's the thing: I question whether we really want to return the LONGEST repeated substring that appears in the code.

Consider this example (each letter represents a different method invocation):


Theoretically, we are likely running this code test because we want to teach the student to extract repeated code into its own method. A longest repeated substring method would extract "ABCDEFG" as the sequence that should become its own method. However, it's arguably a better solution to extract "ABCD" into its own method, and then extract "EFG" to its own method. The way the code crawlers work will prevent this from happening, however (to prevent against returning "A", "AB", "ABC"...etc all as correct solutions).

I made the terrible development mistake of starting to worry about this before I had actually implemented any form of repeated substring (longest or not), and ended up trying to reinvent the wheel (or rather, piece together the wheel based on the wheels I've seen my neighbors make) and make a function that would take this into account. To be short: trying to implement a modification based on ephemeral requirements did not go well.

So moving forward, here is my plan: I'm going to naively implement a longest repeated substring API function, even if I still have questions about whether that's the best option. Then, I'm going to let the question sit for a little bit to prevent it from taking up any more of my time. It's not preventing anything from moving forward with code tests, so I'm not going to make it prevent me from moving forward.

To rebound this blog post from the valley of time-sink-malaise, the good news is that (with this small exception), Code Tests seem to be more or less technically capable of coding all the tests created for Michelle's mentor study! Which is great! I want to get local Save and Load working again early next week, and then every day grab a handfull of tests from the study and implement them (i.e. do some stress testing for the system). And then besides that, I'm excited to try to get the code stencils/highlighting working again, and get visualizations for these new consecutive statements, spread statements, and code chunks!


  • No comments yet

Log In or Sign Up to leave a comment.