Sunday, March 29, 2015

This week in competitive programming

TopCoder SRM 654 took place in the early hours of Thursday (problems, results, top 5 on the left). Less than half a year after winning an SRM for the first time, Endagorion has now earned his third SRM victory, a feat accomplished by just 50 contestants in the history of TopCoder - congratulations! Next step would be to become the 32nd person to win four SRMs, of course :)

Russian Code Cup 2015 Qualification Round 1 happened on Saturday (problems in Russian, results, top 5 on the left). This was the strongest qualification round since top 200 have qualified and will thus be unable to participate in the next two qualification rounds. Congratulations to Gennady on another flawless victory!

The last problem is worth mentioning at least for its very simple statement: how many strings of length n are there with the given value of a polynomial hash? n is up to 106, the polynomial hash is computed with the given base p and modulo m, m is up to 104 (more precisely, the hash is equal to a0+a1*p+a2*p2+... mod m, where ai is the i-th character of the string, which consists only of lowercase English letters). You need to output the answer modulo 998244353.

Finally, Open Cup 2014-15 Grand Prix of America happened on Sunday as usual (results, top 5 on the left, results of another contest with the same problemset but different time limits). One of the tricky problems, problem G, was about constructing a long string s using a short string t. We start with just the string t. Now we insert another occurrence of t anywhere (possibly before the first or after the last character), and repeat this process until we obtain string s. In the sample input, s was "hhehellolloelhellolo" and t was "hello". Given the string s with at most 200 characters, what's the shortest string t that could've been used to construct it?

Thanks for reading, and see you next week! Please also find a photo with a spring - if a bit gloomy - feeling on the left :)

Sunday, March 22, 2015

This week in competitive programming

TopCoder SRM 653 has ignited this week's contests on Tuesday (problems, results, top 5 on the left). Egor and Kazuhiro were in their own league with amazingly fast solutions both for the medium and for the hard problem, but Egor has squeezed out the victory during the challenge phase - congratulations!

The most interesting part of this round, in my view, was coming up with a challenge for the easy problem. You were given a sequence of numbers with some numbers replaced by wildcards, and were guaranteed that the sequence before replacement by wildcards consisted of several consecutive segments of equal numbers, where each number is equal to the length of the corresponding segment, for example: 3, 3, 3, 4, 4, 4, 4, 2, 2, 2, 2 (3x3+4x4+2x2+2x2), then 3, *, 3, *, *, 4, 4, *, *, 2, * with wildcards. The problem asked to check if there's more than one way to reconstruct the numbers that were replaced by wildcards, and many people have simply counted the number of ways to reconstruct the numbers, and compared it with 1.

Many of those people have used the 32-bit integer type to count the number of ways, but it's not sufficient to count the number of ways for a sequence of length 100, so their solutions might fail if the total number of ways is k*232+1. I found one such solution during the challenge phase, and tried to create a testcase to fail it - but could not, and neither did the system test fail it. However, right after the SRM Misha 'Endagorion' has posted such testcase on Codeforces. Can you come up with a tricky testcase without following that link?

Codeforces Round 296 (problems, results, top 5 on the left) happened later that day. I've skipped the round, but want to congratulate piob on the amazing victory which he achieved in just 54 minutes out of two hours!

On Saturday, VK Cup 2015 Round 1 pioneered a (relatively) original competition format: 2 person teams (problems, results, top 5 on the left). Congratulations to Boris and Adam on the victory! The pre-round favorite team "Never Sorry" has led through most of the contest, but had to resubmit the solution for the hardest problem several minutes before the end of the round and dropped to fourth place. The reason for their resubmission? Their solution made an out-of-bounds access, namely tried to reach n+1-st character in an n-character string. Since they were using C++, this would have flown just fine if that was an ordinary string, but they had their own class since the string was constructed implicitly, and it had an explicit assertion for out-of-bounds accesses. Indeed, removing line 87 "assert(false);" from their first submission makes it pass the system test!

I have to admit that this example goes against my philosophy that more strict languages like Java or Pascal lead to higher probability of passing the system test because more bugs can be caught during the coding phase. Of course, this is just one example :)

VK Cup 2015 Round 1 online mirror was held several hours later with a slightly modified problemset (problems, results, top 5 on the left). Congratulations to Ivan on his first victory on Codeforces!

Now, let's come back to the problem I described last week and the new data structure. You are given a tree with at most 105 vertices, where each edge has an integer length, and a sequence of 105 updates and queries. Each update tells to color all vertices in the tree that are at most the given distance from the given vertex with the given color. Each query requires you to output the current color of a given vertex.

The data structure as described in the Russian post-match discussion and in another Codeforces comment is called "Centroid Decomposition of a Tree". We start by finding the centroid of the tree: a vertex such that it splits the tree into components of size at most N/2, where N is the number of vertices in the tree. One way to find such vertex is to pick an arbitrary root, then run a depth-first search computing the size of each subtree, and then move starting from root to the largest subtree until we reach a vertex where no subtree has size greater than N/2.

Let's mark the centroid with label 0, and remove it. After removing the centroid the tree separates into several parts of size at most N/2. Naturally, now we do the same recursively for each part, only marking the new centroids with label 1, then we get even more parts of size at most N/4, mark their centroids with label 2, and so on, until we reach parts of size 1. Since the size decreases at least twice with each step, the labels will be at most log(N).

The process of construction is displayed in the pictures on the left. The right subtree displays an interval tree analogy, while the left subtree shows that more unusual things can happen.

Now, consider any two vertices A and B in the tree and the path connecting them, and let's find the vertex C with the smallest label on that path. It's not hard to see that the path connecting A and B lies entirely in the part that vertex C was the centroid of in the above process, and that A and B lie in different parts that appear after removing C. So our path is concatenation of two paths: from C to A, and from C to B.

Finding C given A and B is also easy: let's just keep a link from each vertex to its "parent" in the above process (if our vertex has label K, the parent will have label K-1), and let's repeatedly follow this link either on A or on B, whichever currently has a higher label, until the two coincide.

Notice that we've chosen O(NlogN) paths in the tree (from each centroid to all vertices in the corresponding part) such that every path is a concatenation of two paths from that set, and we can find those two paths in O(logN) time. Such decomposition of paths turns out useful in many problems.

Now, how does one solve the problem in question? Well, whenever we need to color all vertices B at distance at most D from the given vertex A with color X, we will group possible B's by C - the vertex with the smallest label on the path from A to B, as descried above. To find all possible C's, we just need to follow the "decomposition parent" links from A, and there are at most O(logN) such C's. For each candidate C, we will remember that all vertices in its part with distance at most D-dist(A,C) from C need to be colored with color X.

When we need to handle the second type of query, in other words when we know vertex B but not A, we can also iterate over possible candidate C's. For each C, we need to find the latest update recorded there where the distance is at least dist(B, C). After finding the latest update for each C, we just find the latest update affecting B by comparing them all, and thus learn the current color of B.

Finally, in order to find the last update for each C efficiently, we will keep the updates for each C in a stack where the distance decreases and the time increases (so the last item in the stack is always the last update, the previous item is the last update before that one that had larger distance, and so on). Finding the latest update with at least the given distance is now a matter of simple binary search.

As usual, I'm expecting that some of you have already known this data structure for ages. Still, I'd love to hear what do you think about my explanation above! Also please tell if you've read a better explanation somewhere else.

And in any case, check back next week!

Tuesday, March 17, 2015

This week in competitive programming

TopCoder SRM 652 took place on Monday (problems, results, top 5 on the left). The round was soon after the flight home from the Hacker Cup, so I've skipped it and thus don't have much to tell about the problems. Adam "subscriber", who has already been featured in top 5 on this blog several times, has won his first SRM - great job!

Here are the solution ideas for the problems from the last week's summary. The first problem described there was about picking the right order to apply skill improvements. The solution is explained very well in the editorial (heading "521D - Shop"), but the basic steps one needed to make were:
  1. All multiplications should be applied in the end, in arbitrary order, and higher multiplier is better than lower multiplier.
  2. All assignments should be applied before all other operations, and we should use at most one assignment per skill. Since we only apply one, we can imagine that we have an addition instead of an assignment: highest value that can be assigned minus initial value.
  3. For each particular skill, higher addition is always better than lower addition, so we should sort the additions in decreasing order and imagine applying them in that order. In this case, for each addition we know for sure the value of the corresponding skill before and after the addition, and thus we can imagine that we have a multiplication instead (new value divided by old value)!
  4. Now all our operations are multiplications, and we should simply sort them in decreasing order.
The second problem was about Conway's look-and-say sequence, and the main solution idea is actually described in the linked Wikipedia article as the "cosmological theorem": sooner or later, the sequence separates into several parts that never interact again, and it turns out that the set of all possible strings that do not separate into several parts is very small - there are around 100 such strings - so we get a linear recurrence relation on a 100-element vector and can use fast matrix exponentiation to apply it many times quickly.

And the third problem was about reconstructing the smallest parallelepiped drawn on the grid that contains the two given cells. Here's the solution that avoids a bulk of case studies that I was referring to: when the two given points are close to each other, we can just try all small parallelpipeds until we find one that covers both; when they are far from each other (let's say at least 10 apart), it's not hard to see that there's a simple lower boundary on the answer that is achievable. More specifically, let the parallelepiped horizontal side be a, vertical side be b, and the diagonal side be c. y-coordinate can be increased at most b+c-2 times  (and similarly x-coordinate can be increased at most a+c-2 times) inside the parallelepiped, so b+c-2 must be at least y2-y1, and thus a+b+c (which is almot the answer) must be at least 5+y2-y1. Also, the difference between the coordinates doesn't change when we move diagonally, so it can only increase at most a+b-2 times, so the answer must be at least 5+(y2-x2)-(y1-x1). The only remaining step is to understand that the maximum of those lower bounds is always achievable if the two given points are sufficiently far apart.

Now, let's finish digging into the Open Cup archives! Open Cup 2014-15 Grand Prix of China happened 2 weeks ago (results, top 5 on the left). Let me describe the problem that I couldn't solve. You're given a simple undirected connected graph with at most 8 vertices. Let's say each edge has a random uniform real weight between 0 and 1. What's the expected value of the minimum spanning tree of this graph? It would seem that with 8 vertices any solution should work, but I've managed to implement one that times out :) Of course, the key to creating a fast solution lies in linearity of expectation. Can you see how to use it?

And finally, Open Cup 2014-15 Grand Prix of Tatarstan happened this Sunday (results, top 5 on the left). I've learned a new beautiful data structure at this contest - a rare event in my old age. Here's the problem that required it: you are given a tree with at most 105 vertices, where each edge has an integer length, and a sequence of 105 updates and queries. Each update tells to color all vertices in the tree that are at most the given distance from the given vertex with the given color. Each query requires you to output the current color of a given vertex.

I couldn't invent the data structure during the contest time, nor did I know it in advance, so I've only managed to come up with a O(N*sqrtN) solution for the case where all update distances are equal. But it turns out that the problem is solvable in O(N*logN*logN) for arbitrary updates. Do you know which data structure helps here?

Thanks for reading, and see you next week!

Saturday, March 7, 2015

This week in competitive programming

Codeforces Round 295 happened early on Monday (problems, results, editorial with challenges, top 5 on the left). Let me describe problem D which I didn't solve during the round: you have an array with 105 positive integers, denoting your various skill levels in an online game. You're also given up to 105 possible improvements for your skills. Each improvement is applicable to one particular skill, and is of one of three types: set the skill level to the given value, add the given value to the skill level, or multipy the skill level by the given value. The skill that can be improved, the type of improvement, and the improvement value are fixed for each possible improvement, so the only freedom you have is which improvements you will apply, and in which order. You goal is to achieve the maximum product of all your skill levels using at most m improvements (m is also given).

This problem requires careful reduction of complexity until it becomes simple. The first step, for example, is to notice that it only makes sense to apply the "multiplication" improvements after all others, and the order of their application does not matter. I've managed to do a few more steps during the contest, but stopped short of the solution because I couldn't find a way to properly handle the "assignment" improvements. Can you see the remaining steps?

Facebook Hacker Cup 2015 Final Round happened on Friday in Menlo Park (results, top 5 on the left). I've managed a very good start by getting the first submission and then skipping the tricky geometry problem (the "Fox Lochs" column in the above table). After submitting the 20-point with almost two hours left, I had two strategies to choose from. I could go back to the geometry problem, implement it carefully, test it a lot, but thus likely not solve anything else - looking at the final scoreboard, that would've earned me the second or third place in case my solution was correct - but given that even Gennady had failed this problem, that's far from certain. Or I could try to solve one of the two harder problems which seemed to require some thinking but looked tractable.

I've decided to go after the 25-point problems, and after some thinking came up with a O(N*sqrtN) solution for the fourth problem ("Fox Hawks"). The problem was: you're given a boolean expression with at most 200000 boolean variables, each appearing exactly once, for example "((1 & (2 | 3)) | 4)". What's the k-th lexicographically set of variable values that evaluate this expression to true?

It was not clear whether O(N*sqrt(N)) would pass within the time limit, so I hoped for the best and started implementing it. The implementation was a bit tricky and required more than an hour (including writing a simple but slow solution and comparing on a lot of small testcases). I've finally downloaded the input with about 30 minutes left in the contest - and my solution turned out a bit too slow (solved 12 testcases out of 20 in the time limit) :( Had I implemented it a bit more efficiently, or had I used a more powerful computer, I might've got it, and it turns out that would've earned me the victory. Well, better luck next time!

After discussing my solution with Slava "winger" Isenbaev after the contest, I've also realized that it's not hard to change it into a O(N*logN) approach. I've done that after the contest, and after about 20 minutes of coding and 5 minutes of debugging I got a working solution that solved all cases in about 20 seconds (out of 6 minutes). Can you guess what's the O(N*logN) approach knowing that it's an improvement of an O(N*sqrtN) one?

Now, let's continue covering the Open Cup contests from February. Before presenting a new round, let me give the solution ideas for the problems I posted last week.

The first problem was about finding the k-th lexicographically borderless word of length n (up to 64), where a word is borderless when it has no non-trivial borders. In order to solve this problem, let's learn to count borderless words first. The first idea is to notice that if a word of length n has any borders, then it must also have a border of length at most n/2 (because if a prefix is equal to a suffix and they intersect, then we can find a shorter prefix that is equal to a suffix - see the picture on the left). Because of this, when n is odd, the number of borderless words of length n is equal to the size of the alphabet times the number of borderless words of length n-1 - we can simpy put an arbitrary character in the middle of a word of length n-1. And when n is even, the number of borderless words of length n is equal to size of the alphabet times the number of borderless words of length n-1 minus the number of borderless words of length n/2: we can also put an arbitrary character into the middle of a word of length n-1, but we need to subtract cases where the new word has a new border of size n/2. Finding k-th lexicographically borderless word is done in a very similar manner: instead of just counting all borderless words, we can use the same approach to count all borderless words with the given prefix.

The second problem was about finding two deterministic finite automata with at most n+1 states that, when used together, accept the given word and only that word. The automaton with n+2 states that accepts only the given word is straightforward. How do we get rid of one state? Well, let's glue two adjacent states together. This will result in an automaton that accepts the given word, but also all words where the letter in a certain position can be repeated an arbitrary number of times. If we do this once for the first letter, and the second time for the last letter, we'll obtain the two automata that we need, unless all letters in our word are equal. An in case all letters in our word are equal, it's not hard to see that there's no solution.

Now, on to new tricky problems! Open Cup 2014-15 Grand Prix of Karelia happened 3 weeks ago (results, top 5 on the left). The most difficult problem F was about the Conway's look-and-say sequence starting with 2: we start with single digit 2, then repeatedly describe what we see. For example, on the first step, we see one "2", so we write "12". On the second step, we see one "1" and one "2", so we write "1112". On the third step, we see three "1"s and one "2", so we write "3112", and so on. How many digits does the n-th step contain (modulo p)? n is up to 1018.

Open Cup 2014-15 Grand Prix of Udmurtia happened 2 weeks ago (results, top 5 on the left). Let's talk about a relatively easy problem for a change.

Problem B of this round was concerned with drawing parallelepipeds on a grid. More specifically, a parallelepiped is drawn on a grid like this: we start with a rectangle, then add three diagonal segments of the same length, and then connect their ends as well - see the picture on the left. The parallelepiped has three parameters: the two sides of the rectangle, and the size of the diagonal. All three parameters must be at least 3.

A parallelepiped was drawn, but then all squares but two were erased, so you're given the coordinates of the two remaining squares, each up to a billion. What's the smallest possible total number of cells in the original drawing?

This problem looks quite nasty from the outside, and it feels like it can have a lot of tricky corner cases. But it turns out it's possible to write a solution that sidesteps those and solves everything in quite general manner. How would you approach this problem?

Thanks for reading, and check back next week!

Sunday, March 1, 2015

This week in competitive programming

There were three big competitions planned for this week: Challenge24 2015 Electronic Contest, TopCoder SRM 651 and Open Cup 2014-15 Grand Prix of China. However, the first two faced some serious issues, and the Open Cup is yet to happen tomorrow. Given that I've falled a bit behind on the Open Cup rounds, I will keep this week's summary without any news, and continue describing the Open Cup rounds from February.

Last week we discussed a very nice problem from the Grand Prix of Japan: count the number of pairs of strings (s, t) such that the length of s is N, the length of t is M, each character is from an alphabet of size A, and t is a substring of s.

Suppose the string t is fixed. How do we count the number of strings s of given length that contain t? We will count two things: the number an of strings s of each length n between 1 and N that do not contain t, and the number bn of strings s of each length n that contain t, but only as a suffix and not before. First approximation for an is simple: we have to take a string of length n-1 and add one character to it, so we have an=an-1*A. Of course, that formula is not correct because it doesn't take t into account at all. How could t appear in the length-n string, if the first n-1 characters didn't contain t? Well, it must appear at the rightmost position, so the correct formula is an=an-1*A-bn.

But how do we find bn? Well, the first approximation is also simple: we have to take a string without any appearances of t, and add t in the end: bn=an-M. However, this is also not correct because by adding t in the end, we might've also added some occurrences of t before the end, so we need to subtract something. Well, let's look at the first appearance of t that we added - let's say it ends p (p>0) characters from the end. This can happen only if t has a border (a string that is both its prefix and its suffix) of length M-p - see the picture on the left. And for each string of length n-p that ends with t, there's exactly one way to extend it to a "bad" string of length n. So the correct formula is: bn=an-M-sum(bn-p), where the sum is taken over all positive p such that there's a border of length M-p in t.

In other words, it turns out that knowing all border lengths of t is sufficient for finding the number of strings of any given length that contain it. At the same time, not all sets of border lengths are possible, since having even one border implies a lot of equalities between the string characters, and in case we have several, the string quickly converges to having all characters equal, and thus all border lengths at the same time. This idea inspires a reasonably simple backtracking: let's try to build all non-contradictory sets of border lengths for a string of length M by recursively trying to include/not include the longest border (of length M-1), then M-2, and so on. We will bail out from the recursion as soon as we have a contradiction. In order to check if we have one, we will build equivalence classes of characters according to borders that do exist, and check if they imply a border that we've declared as not existing.

This recursion runs almost instantaneously for M=50, and reports that there are only 2249 non-contradictory sets of border lengths. We have almost solved our problem: we can iterate over all possible sets of border lengths, and find the number of strings s that contain a string t with such set of border lengths using the dynamic programming described above. But note that we must also multiply that by the number of such strings t - so we also need to find the number of strings with the given set of border lengths.

The latter, on the surface, is also computed easily: it's simply A to the power of the number of equivalence classes defined by the borders. But as you've probably guessed, this will overcount: here we won't throw out the strings that have some additional borders. But since we have just 2249 different sets of borders, we can fix this by simply subtracting the numbers of strings for all supersets of our set of border lengths, obtaining a dynamic programming approach for counting t's. And that completes the solution!

The next round of the Open Cup, Northern Grand Prix, happened three weeks ago (results, top 5 on the left). Just like the Grand Prix of Japan, this round featured many interesting problems, but I'd like to highlight two. Problem B was also about string borders: one needed to count the k-th lexicographically borderless word of length n (up to 64), where a word is borderless when it has no non-trivial borders. Riding the wave of the solution explained above, I've tried to use the same approach by using inclusion-exclusion principle over sets of possible border lengths - but unfortunately, this approach was a bit too slow for the time limit. Can you see how to solve this problem faster?

Problem D considered languages defined by deterministic finite automata. More precisely, one needed to define a language containing just the given word, and no other words. When the given word has length n, it's easy to construct an automaton with n+2 states that accepts the given word and only the given word: a chain of n+1 states that have transitions using the characters of the word, the (n+1)-st state being final and all other transitions leading to (n+2)-nd "sink state" which is not final. This problem, however, limited the automata to n+1 states. It's not hard to see that it's impossible to create an automaton with n+1 states that accepts the given word of length n and only that word. Because of that, you were asked two output two automata with at most n+1 states that have the language with just one word in their intersection - in other words, such that both of them accept the given word of length n, and there's no other word such that both of them accept it. Can you figure this one out?

Thanks for reading, and check back next week!