## Sunday, November 18, 2012

### Joy of solving easy problems

Here's a problem from a recent Open Cup contest:

The similarity of two strings is defined as the length of their longest common subsequence. That, in turn, is defined as the longest string that can be obtained from both strings by dropping some characers. For example, the similarity of strings CATGT and GACTT is 3, as they have common subsequences of length 3, ATT and CTT, but no common subsequence of length 4.

You are given a string consisting of letters A, C, G, T (by far the most popular background for string problems these days :)). You need to find another string consisting of those four letters, and of the same length as the given string, that is least similar to the given string. Of course, there might be many such strings - for example, for CATGT discussed above there any many strings of the same length with similarity 1, like GAACC or TAAAC.

Can you solve it quickly?

This problem is not difficult, but I've enjoyed receiving "Accepted" from the judging system for this problem a lot. Most contests need easy problems, and it is so much better when those problems are not "easy because you just have to implement what's described in the problem statement", but "easy because you can just figure it out quickly".

#### 14 comments:

1. You are misunderstanding what a common subsequence is. For AAA...AA the answer is not 1, but the number of A characters.  As Petr has written - the longest string that can be obtained from both strings by dropping some characers. This doesn't mean the characters are consecutive(next to each other).

2. Alexander SolovetsMay 10, 2013 at 1:33 PM

Indeed I have missed that point. Thanks.

3. Alexander SolovetsMay 10, 2013 at 1:33 PM

I don't think this is correct solution. Consider the following test: ATCATCATCATCATCATCGGG. Your solution would output GGG...GG with the similarity equal  to 3. While AAA...AA as well as other letters would give 1.

But in general I think that the pattern is correct. But rather than taking the least frequent character, I think it is better to take one that is continuously repeated the least number of times. Thus for the mentioned example the output would be on of: AA...A, TT...T, CC...C. At least I failed to come up with the counter-example.

4. Yes, exactly.

5. Huh, this problem appeared also in the Polish Collegiate Programming Contest 2012 3 weeks ago (problem DNA).

6. I don't think that's a coincidence - I believe the Open Cup contest was ran on the problems from that contest.

7. Not entirely correct. For an input ACCTTT your algorithm would give AAAAAA while the correct answer would be GGGGGG. But the idea is probably right :)

8.  Actually it would be correct the letter G repeated the least number of times.

9. Create a string of just the same letter, e.g. xxx...xxx, where x is the character that occurs the least number of times in the given string. The reason is that, that at least the length of the longest common subsequence is the number of occurences of one of the characters A, C, T, G in the given string, and hence greater than or equal to the number of occurences of the character that occurs the minimum number of times.

10. ThankGod UkachukwuMay 10, 2013 at 1:33 PM

I want t o be as good as you. Can you highlight how one can solve this kind of problem

11.  Intuition:-

1. Think of the input string as superimposition of same character sub-strings ("AAA...", "CC.." etc). Let the sub-string sizes be K1 <= K2 <= K3 <= K4.
2. Remember that the second string is of the same length. Again view it in terms of same character sub-strings.  Now note that you cannot pick a string where ALL the same character sub-strings are smaller than K1 in size, so K1 is the least amount of overlap between the two strings.
3. You will probably see that there are many possible answers now, what Rahul/Petr suggested is just one of them.

12. Cool problem!

What about finding the lexicographically first string, what's the best running time you can achieve?

13. Why "that at least the length of the longest common subsequence is the number of occurences of one of the characters A, C, T, G in the given string" in Rahul's comment. I couldn't get this statement. Any intuition ?

14. let ia, ic, it, ig be the counts in put. Let oa, oc, ot, og be counts in output.

We know that ia+ic+it+ig = oa + oc + ot + og = len(string). Therefore, for at least one x, ox >= ix; hence in particular the answer >= min(ix).