Algorithms Weekly by Petr Mitrichev: A bmerry week

TopCoder SRM 739 was the first round of Oct 8 - Oct 14 week (problems, results, top 5 on the left, analysis, my screencast). Nobody was able to solve all three problems correctly this time, but pashka's very fast solution to the hard problem earned him a comfortable first place. Congratulations on the win!

Here's that hard problem: you are given a number n between 3 and 500. You need to generate any simple (not necessarily convex) polygon with n vertices, where all vertices have integer coordinates between 1 and 25, inclusive, and which has the smallest possible area among such polygons.

AtCoder Grand Contest 028 took place on Saturday (problems, results, top 5 on the left, analysis, my screencast). With just a few rounds left in the year, the fight for the top 8 in the AtCoder Race Ranking is entering its deciding phase. Not so much for tourist, who is head and shoulders ahead of everybody else, and who has further cemented his Race Ranking lead with a win here. Well done!

Problem D in this round was a cute combinatorics exercise. You are given 2n points on the circumference of a circle, n is up to 300. We're going to form n pairs of points, such that each point is in exactly one pair, and draw straight line segments between the points in each pair. After all segments are drawn, we're going to count the number of connected components of segments (two segments are connected directly if they intersect). We have already formed k such pairs, and need to form the remaining n-k. You need to find the sum of the numbers of connected components over all possible ways to form the remaining n-k pairs.

Open Cup 2018-19 Grand Prix of Korea followed on Sunday (results, top 5 on the left, partial analysis). Just like last week, team Past Glory did not really need the whole 5 hours — congratulations on another convincing win!

Our team, on the other hand, used all available time, and solved the last problem with less than a minute remaining on the clock. You can see the diff between the previous incorrect submission and the last-minute correct one on the right. Just imagine: there's almost no time left, we got wrong answer, have implemented a stress-test and found a discrepancy, but there's no time to debug. What could be wrong in this code, anyway? Let's try swapping left and right — and voilà, stress-test passes, and there're still a few seconds to submit :)

Problem B in this round left me with a feeling "wow, how come such problem hadn't been set before!": you are given a 50x50 grid where some cells are empty, and some are walls. There are also walls surrounding the grid. Some empty cells are marked as targets, and there's also a ball in one of the empty cells. You can move the ball in each of the four cardinal directions, but once it starts rolling, it rolls in the same direction until it hits the wall, at which point you can pick a new direction, and so on. Can you roll the ball in such a way that it passes through all targets in some order?

Codeforces Round 516 ran in parallel with the Open Cup (problems, results, top 5 on the left, analysis). With nobody able to solve everything, choosing the right problems to solve was a necessary part of winning, and mnbvmar found the right set of problems and solved them faster than everybody else. Congratulations on the victory!

This round also brought somewhat sad news: Bruce, who placed second and regained his nutella status, is retiring from rated contests on a high note. I've certainly enjoyed competing against Bruce for a really long time (and competing in a team once, as we solved a World Finals mirror together a few years ago!), so I'm still hoping to chat to him at the onsites which don't have a rating system :)

In my previous summary, I have mentioned an IPSC problem: you are given a prime number p up to 10¹⁰⁰, a number k, and n<=100 triples of numbers a_i, b_i, m_i. Your goal is to count how many numbers x between 0 and p-1 satisfy the equation ((a_i * x + b_i) mod p) mod m_i = ((a_i * k + b_i) mod p) mod m_i for at least one i. Moreover, you need just the approximate count: an error of up to 1% is fine. There are 100 testcases to solve, and the duration of the contest is 5 hours.

A rather straightforward first step in solving this problem is to figure out how to solve each individual modular equation. I won't go into the details as it's not the most exciting part of the problem, so I will just state that we can find the number of solutions, test if a given number is a solution, and describe all solutions in a way that allows sampling a uniformly random solution, for example.

Now we have 100 huge sets of numbers that we know the size of and can test and generate, and need to estimate the size of their union with at most 1% error. The allowed error probability strongly hints at some kind of Monte-Carlo approach. The most straightforward Monte-Carlo approach would be: just repeatedly take a random number between 0 and p-1, and check if it belongs to one of the sets. This can give us an estimate with an error that's certainly less than 1% of our sampling pool, in other words less than 1% of p.

To see it more formally, we can say that in each trial, we sample from a random variable that is equal to either 0 or p, and is equal to p with probability a/p where a is our answer, and then give a mean of k such samples as our estimate. The central limit theorem tells us that such mean tends to behave like a normal variable with mean equal to the mean of individual sample, which is p*a/p=a, just as we need. The central limit theorem also tells us how to estimate the error: the standard deviation of that normal variable will be equal to the standard deviation of the individual sample divided by the square root of k. The standard deviation of the individual sample is sqrt(p*p*a/p-a*a)=sqrt(a*(p-a)), and we can assume that a normal variable will never exceed, say, its 6 standard deviations, so our error will be at most 6*sqrt(a*(p-a)/k). Unfortunately when a is very small relative to p, this is still very big relative to a.

At this point during the contest I've tried to come up with some ad-hoc fixes using common sense, and did not succeed. However, I should've simply continued using more formal statistics! The problem in the above approach is that we're sampling from the individual distribution with too big standard deviation — so let's come up with another one. Instead of sampling from all p possible numbers, let's use the fact that we can generate solutions for each equation: let's sample a random (i, x) pair where x is a solution to the i-th equation, in such a way that every valid (i, x) pair is equally likely. We can do that by first sampling i proportionally to the number of solutions of each equation, and then sampling a uniformly random solution x to the i-th equation.

Now we need to come up with a random variable with the mean equal to our answer. Let t_i be the number of solutions of the i-th equation, and let t be the sum of t_i. Given a sample (i, x) let's make our random variable equal to t if there's no other equation with a smaller index j < i such that x is also a solution of the j-th equation, or 0 otherwise. It's not hard to see that for each distinct x from the union of all solutions our random variable will be equal to t for exactly one i, and thus the mean of our random variable is equal to a*t/t=a, just as required (Another way to achieve the same mean would be to make our random variable equal to t divided by the number of equations that x is a solution for).

The standard deviation of this variable is equal to sqrt(a*(t-a)), and we know that t is at most a*n, so it is at most sqrt(a*a*(n-1))<=a*sqrt(n). Now we need to pick such k that 6*a*sqrt(n)/sqrt(k)<=0.01*a, or 600*sqrt(n)<=sqrt(k), or k>=n*360000, so as n=100 we get k>=36 million. This is already doable in a few seconds per testcase, but in reality we need even less attempts as 6 standard deviations is a very conservative bound.

During the contest I could not come up with such sampling idea, and instead tried to return the result as a sum of estimates of sizes of first set, second set minus first set, third set minus first two sets and so on, where each particular estimate is built using essentially the above approach. However, it meant I needed to decide how to allocate my samples between the n estimates, and I could not get that part right (from the above solution, it is clear that I simply needed to have sample counts proportional to t_i — and indeed, I did get my solution accepted after the end with this fix).

Thanks for reading, and check back for more!

Algorithms Weekly by Petr Mitrichev

Tuesday, December 25, 2018

A bmerry week

No comments:

Post a Comment