Friday, May 30, 2014

Coming up with tough dynamic programming solutions

Let’s talk about the hard problem from TopCoder SRM 619.

Consider two sequences of N (up to 100) integers, each integer between 1 and M. We call them similar if it's possible to remove at most two elements from each sequence so that they become equal. How many pairs of similar sequences are there for the given N and M (modulo 109+7)?

It’s quite natural to expect that this problem can be solved by dynamic programming: add characters one by one to both strings, and remember something about the characters already added. But how do we figure out what to remember – in other words, what the dynamic programming state is?

It occurred to me that we can approach this question itself algorithmically. Let’s start with the trivial dynamic programming – when we remember the entire strings. In other words, the solution simply is: answer(N, M, [], []), where answer(N, M, prefix1, prefix2) = {1 if len(prefix1) == N and the largest common substring of prefix1 and prefix2 is at least N-2; 0 if len(prefix1) == N and the largest common substring is less than N-2; sum(answer(N, M, prefix1 + [c1], prefix2 + [c2]) for all numbers c1 from 1 to M and for all numbers c2 from 1 to M otherwise}.

This algorithm has O(M2N) states, so it’s way too slow for any reasonable constraints. But we can still implement it and run for very small values of M and N. And then we can notice that some states of our DP are actually equivalent. First, if len(prefix1) ==  N, then there are just two equivalence classes of states: where then answer is 1 and where the answer is 0. Then, for each state with len(prefix) == N-1 we can check how many pairs of (c1, c2) lead to a state with answer 1 (the rest lead to a state with answer 0), and this is the only thing that matters – so there will be at most M2+1 different states. In the general situation, the states with length A are classified based on the multiset of states with length A+1 that are reachable from them.

This allows to write a program that will find the equivalence classes of states, and thus might give us a clue on what do we really need to remember in the DP state! The program is available here, here’s its output for N=6, M=2:
For len 6 there are 2 representatives.
For len 5 there are 4 representatives.
For len 4 there are 10 representatives.
For len 3 there are 9 representatives.
For len 2 there are 4 representatives.
For len 1 there are 2 representatives.
For len 0 there are 1 representatives.

So we have just 32 different states out of the total 5461 states!

But how do we translate this knowledge into a better solution? Well, the program referenced above also prints out the equivalence classes of states with length 3:
aaa$baa bba$baa aba$aaa bbb$bab bbb$abb aaa$aba abb$bab abb$bbb baa$aba baa$aaa aab$abb bab$bbb baa$bba aba$baa bab$abb abb$aab 
aaa$bbb bbb$aaa 
aaa$bba bba$aaa aab$bbb bbb$aab 
aaa$bab bbb$aba aba$bbb bab$aaa 
bba$aab aab$bba 
bba$aba aab$bab aba$bba bab$aab 
bba$abb bba$bab bba$bbb aba$aab bbb$bba aba$abb aab$baa aaa$aab baa$aab abb$bba aab$aaa aab$aba bab$bba bab$baa baa$bab aba$bab bab$aba abb$aba 
bba$bba bbb$bbb aba$aba aaa$aaa aab$aab bab$bab baa$baa abb$abb 
bbb$baa aaa$abb baa$abb abb$baa baa$bbb abb$aaa

By looking at those equivalence classes, we can notice certain symmetries. First, replacing a with b and vice versa clearly doesn’t change the state. Also, swapping the first and the second string doesn’t change the state.

Now we can use this knowledge to obtain a faster solution. Instead of considering all M2N possible states, we can already merge the states that differ only by swapping the strings and permuting the alphabet. We modify our program (new version available here) and run it for larger values of N and M, hoping to gain more insight. The result of a run for N=7, M=3, and printing equivalence classes of length 4 is available here.

We can notice two things: first, even for N=7 and M=3 there are only a handful of truly different states; and then, there are much more situations where states are equivalent which we’re still missing.

And now comes the part that I still haven’t figured out: can we make the jump from the above output to the final idea required to solve this problem? The final idea is: we should only remember the last 3 characters from each string (again up to a permutation of the alphabet), and the answers for the following 9 questions: what is the largest common subsequence of prefix1[1..l1] and prefix2[1..l2] for l1 from len(prefix1)-2 to len(prefix1) and for l2 from len(prefix2)-2 to len(prefix2). Since the largest common subsequence of prefix1 and prefix2 must be at least len(prefix1)-2 (otherwise we’ll be unable to complete both sequences in such a way that the problem constraints are satisfied), there’re only a few such states possible. This idea is explained in more detail in the editorial.

Can you see how the output of our program can lead to such insight? Do you have ideas on how to complete this approach? Are there any interesting articles that use this approach that I should read? Having a disciplined way to come up with dynamic programming solutions sounds quite nice :)

Sunday, May 25, 2014

This week in competitive programming

First, let's return to the previous week's Friday and the Warm-up Round of Yandex.Algorthm 2014 (results, top 5 on the left). Yandex.Algorithm is a tournament that steps a bit away from the usual formats. First, each round is ran using the TCM/Time scoring system which I described in my earlier post about SnarkNews Winter Series. The core of the system is the ability to send each solution either with or without instant feedback, scoring slightly more if you send it without instant feedback, but of course losing it completely if it turns out to be flawed in the end. Second, the selection system for the onsite finals in Berlin is: 25 participants with the best cumulative score in the three elimination stage rounds qualify (details are on the rules page, under heading "Contest structure"). The system is in constrast with the TopCoder system I described in the last week's post: here one can qualify by consistently showing results in top 20ish, whereas TopCoder requires you to have just one very good day and place in the top 12. Of course, in most cases the selection system does not matter that much and the stronger competitors will qualify :)

Now, let's get back to the last week and Tuesday's TopCoder SRM 621 (problems, results, top 5 on the left). The round was tourist's 19th SRM win, and it also brought (a return to) "target" 3000+ rating to three contestants: lyrically, marek.cygan and K.A.D.R. Congratulations Gennady, Kazuhiro, Marek and Iaroslav! I did not take part, so I don't have anything to share about the problems.

And finally, Codeforces Round 248 (problems, results, top 5 on the left) took place on Saturday. Again, I didn't solve the round and can't talk about the problems. The last two problems turned out to be too hard, so while WJMZBMR solved one of them and won in his own league, the others had to fight it out on challenges, and KADR came out on top. Congratulations LiJie (is this the right way to address you?..) and Iaroslav!

Thanks for reading, and see you next week!

Monday, May 19, 2014

This week in competitive programming

The last week had two tournament rounds. On Saturday, TopCoder Open 2014 Round 2A took place (problems, results, top 5 on the left, my screencast). What does "2A" mean? TopCoder has changed the format for the TopCoder Open two years ago. In 2011, the online elimination rounds were structured in a relatively straightforward way: 2000 participants from the qualification round participated in Round 1, and 850 of them advanced to Round 2. Out of 850 participants of Round 2, the best 350 advanced to Round 3. In the same manner, 150 advanced to Round 4, 60 advanced to Round 5, and 24 advanced to the onsite finals.

In other words, in order to qualify for the onsite round one had to place in top 30% of all participants five times. That usually translated to carefully checking and rechecking one's solutions being the most important aspect of the competition for the top algorithmists. Taking risks and performing exceptionally was not rewarded too much.

The format has changed completely starting from 2012. Now 2500 participants from the qualification round (which is now called Round 1A/1B/1C) can participate in three more rounds: 2A, 2B and 2C. In each of those rounds almost 2500 people can participate, and only 50 advance to Round 3 - just the top 2%! If you fail to advance from Round 2A, though, you can participate in the Round 2B, and possibly 2C if still needed. The 150 best competitors determined this way participate in rounds 3A and 3B, with 12 best participants from each round going to the onsite finals - this time the top 8% from each round. The onsite finals have 24 participants, just like in 2011.

There are still five rounds that select 24 best algorithmists from 2000+ participants. In the best case, though, one can participate in just two of them and advance to the onsite round. In my view, such sharp elimination percentages push people to take risks and do their best, and the competition becomes more exciting. My own chances for making the onsite finals have probably reduced, since consistency was always my strength and played nicely with the old system. The new system also allows for more interesting and challenging problems since the cutoff is higher.

The medium problem in this year's Round 2A is quite interesting to reason about: there are 50 wolves on a line segment of length L, i-th wolf starting from point ai and going to point bi. This sounds relatively simple - but the wolves are not allowed to pass each other anywhere except the ends of the segment. So if two wolves want to swap, they have to go either to the leftmost point of the segment, or the rightmost point of the segment. What is the minimum total travel distance they can travel in order for all of them to reach their destinations?

On Sunday, the second Qualification Round of the Russian Code Cup took place (problems, results, top 5 on the left). As I understand, the new Russian Code Cup website ran smoothly this time and the competitors could concentrate on solving the problems - congratulations to the organizers on overcoming all difficulties!

Thanks for reading, and see you next week!

Monday, May 12, 2014

This week in competitive programming

This week had two TopCoder SRMs which brought very good but still very disappointing results for me - two second places :) But I enjoyed solving the problems anyway, and want to share the most interesting ones.

The week started with TopCoder SRM 619 on Monday (problems, results, top 5 on the left, my screencast). Nobody managed to solve the hard problem during the SRM, but it has a very simple and nice problem statement and thus is interesting to solve: consider two sequences of N (up to 100) integers, each integer between 1 and M. We call them similar if it's possible to remove at most two elements from each sequence so that they become equal. How many pairs of similar sequences are there for the given N and M (modulo 109+7)? How would you approach such problem?

TopCoder SRM 620 happened on Saturday (problems, results, top 5 on the left, my screencast). The medium problem turned out to be the most interesting, and produced the most challenges. You were given at most 50 strings with at most 50 characters each, and could sort the strings by any character (keeping the current order when the corresponding character is the same). You needed to come up with a way to obtain the given ordering of the strings from the initial one by zero or more such sort operations. The problem is actually quite simple, but there are many ways to make a wrong step.

Google Code Jam 2014 Round 1C was the last chance to advance to Round 2, and took place on Sunday (problems, results, top 5 on the left). Congratulations to all who advanced - you have just two more rounds before the onsite in Los Angeles!

Any finally, Codeforces Round 245 wrapped up the week (problems, results, top 5 on the left). I did not take part, but have noticed the heated and extensive discussion on the forums - it's always nice to see people take an interest in the problems and their tests beyond simply participating in the contest :)

Thanks for reading, and see you next week!

Monday, May 5, 2014

This week in competitive programming

The week started with the final round of the Kotlin Challenge (problems in Russian, results in Russian, top 5 on the left). The competition took place onsite in the JetBrains office in St Petersburg, and featured just 25 best competitors. In the end, the round was decided on the following problem: you are given an arithmetic expression containings integers, operations "+","-","*","/", parentheses, and at most three occurrences of variable "x". The expression has at most 500 characters. Is there a real value of x that leads to a division by zero in this expression?

On Saturday, Google Code Jam Round 1B gave the second chance to qualify to Round 2 (problems, results, top 5 on the left). I couldn't follow this round, because it was in the middle of ...

Challenge 24 (problems, results, top 5 on the left)! The final 24-hour round for the top 30 teams took place in Budapest from 9am Saturday to 9am Sunday. As usual, the round featured many different types of problems, requiring very diverse skills from the competitors. The winning team, PrecisionScrewdrivers from Poland, found an edge by getting the maximum score of 4000 in problem B which required understanding and improving a large program in a functional language - congratulations! (photo below from the official site)

Here's one of the problems I've enjoyed a lot (problem L): you are given a dump of the OpenStreetMap's road graph of the Earth - about 700 million nodes (id + latitude + longitude) and on the order of one billion edges, two gzipped text files taking about ten gigabytes in total. Then, you're given ten pairs of nodes in this graph and are asked to find any path from the first node to the second one. The time limit is on the order of minutes or even hours - you run this on your own machine and submit the paths. The difficulty of this problem changes greatly depending on the amount of RAM you have. In our case the laptop with most RAM had 8 gigabytes, so we had to work inside that limit. What would you do?