Insights from testing and preparing contests, and why problemsetting trends need to change

I’ve been involved with testing a lot of Codeforces contests and coordinating contests for my uni, and have seen various kinds of issues crop up. However, I was able to come up with a stable system for my uni contests that helped make the experience as educational and error-free as possible, and I have found the lack of some of those things to be the main reason behind issues faced by Codeforces contests.

Given that it takes months to get a round ready, it is better to be more careful than necessary compared to getting your round unrated or having bad feedback in general.

I’d like to thank Ari, null_awe and prabowo for reviewing the post and discussing more about problem-setting.

The current Codeforces contest guidelines are here for problem-setters and here for testers, and they have some great sanity checks that will help you make a good contest. But I still think it is missing some things that leads to bad contests at times.

Issues

Let’s list out some issues that plague contests.

Wrong intended solution or tests.
Weak tests.
TL or ML too tight or too loose.
Some hacks are not accepted.
Wrong IO format.
Checker issues — too stringent, too loose, or simply wrong.
Guessable problems.
Problems that have appeared before.
Unbalanced contests — difficulty-wise.
Unbalanced contests — topic-wise.
Long queue, system unavailability.

Out of these, the most glaring issue is having a wrong intended solution itself, and leads to making the contest unrated immediately.

Discussion

I think the fundamental problem behind most of these issues is two-fold:

Lack of rigor while setting a problem.
Not adopting a security-based approach to testing (pentests, for example).

Whenever I coordinated contests at my uni, I sent out a couple of guideline documents for setters and testers, that built upon these two ideas. The guiding principle for the first of these reasons (that every author and tester should understand, in my opinion) is consistency (and proofs of it). Not just one type of consistency, but consistency of the intent, constraints, statement, checker, validator and tests — all at the same time.

NOTE: Before we go forward, let’s clarify what we mean by a formal proof. In this comment, I clarified that we are not talking about scary super-technical proofs (such as ones you might encounter in advanced set theory). A proof that is logically sound, where steps follow from basic logical reasoning and don’t have gaps, is good enough. I call it a “formal proof” just because people on Codeforces call anything that looks reasonable a “proof” (which is obviously wrong and misleading) — and that is not what I am talking about. If you’re not very familiar with the idea of a mathematical proof, don’t worry. It is not meant to be too hard. I would recommend reading the chapter on proofs in this book, to understand what kind of logical statements I am talking about, and the idea behind proofs generally.

Let’s understand this by asking ourselves the following questions:

What’s the intent of the problem (not in terms of ideas, but from a contest perspective)? Usually it is “solve this problem in xyz time complexity and/or abc memory complexity” or “solve this problem in these many queries”. Constraints on the input and output, time limit and memory limit arise from this naturally.
What do we want the contestants to solve? This corresponds to a precise and unambiguous statement of the problem, including the IO format, guarantees on special kinds of tests (if any) and so on. Unfortunately, there is no way to formally check that your Polygon materials are consistent with this problem statement, which is why it becomes necessary to ensure that the statements are such that your solution can be derived formally from the problem statement.
How do the contestants prove that they have solved the problem? In other types of reasoning/math contests, you have to provide a proof of correctness to show that you have solved the problem. This is also the approach taken by academia. However, unless you force participants to submit a solution in proof languages like Coq or Lean, it is impossible (or very hard) to enforce that they have a proof of correctness for their arguments and time complexity of their algorithms. This is where tests come in. “Strong” tests mean that any code that comes from a solution that doesn’t satisfy the intent of the problem fails to get accepted on those tests. Since the constraints are bounded, the set of all possible tests is finite, but still huge enough that adding them all is infeasible. So to combat that, a good practice is to add all possible tests for small constraints, some random max tests, and tests that break all* possible heuristic solutions.
How to verify that our testing procedure is correct? Of course, if you feed in garbage data (i.e., data that doesn’t satisfy the preconditions of a program), the output might be garbage. If there are inconsistencies in the tests, it is impossible to verify if the submitted solutions are correct (as in they solve the problem as stated). The same goes for tests that fall outside the constraints.

Solutions to the issues — the guidelines

We will now list the guidelines that help make contests error-free. Most of these will have both an author and a tester part. Keep in mind that these guidelines will be along the lines of a security perspective.

Note that there are some things that we can completely verify in a mathematical sense by proofs. But there are also things that we can’t be sure of. One of the reasons behind it is that the halting problem is undecidable. So, for those kinds of things, we enforce some kinds of limits on the runtime or the memory requirements, and since we also want to check the behavior of submitted solutions on large inputs, we miss out on a lot of tests due to the combinatorial explosion of the number of tests that happens for most problems.

Provable things

Firstly, let’s see what we can prove the correctness of. Keep in mind that a proof of consistency is always an “if and only if” argument.

Statement and solution

This is the part that comes even before setting the problem on Polygon or any other contest preparation system. Ensure that you have the following things:

Formal statement
Formal algorithm
Formal proof of correctness of algorithm

None of this should have actual numbers in it (keep everything in terms of what is defined in the input, and don’t think about the constraints for now). Don’t even implement it in a programming language; use constructs in the programming language as well as standard algorithms whose proof of correctness is well-known as a black-box.

Your proof and algorithm should depend only on the statement and logic. Don’t implicitly add in any other assumptions or make wrong assumptions in the proof (if you write the proof, it will feel off anyway, so it’s quite hard to make a mistake like this if you follow this guideline).

Some notes:

The algorithm and the proof of correctness are two different things. Saying that “it can be easily seen that this greedy works” or “finding max flow in this graph gives the answer” is not a proof.
Write out everything mathematically, even if it is obvious to you. This is the foundation of the whole problem, and if there are issues in this, is there even a point to preparing the problem, or even thinking about it?
I recommend reading this article and the links it refers to, to understand what I mean by a proof here. If you don’t know how to structure things in the form of a proof, perhaps you are not yet ready for writing a problem for a contest yet. However, it is never too late to learn, and your idea behind the problem might be correct. A formal proof gives you a way to communicate things with others unambiguously, which is important since natural language is imprecise. Writing formal proofs is not hard, and apart from being a tool to communicate, it cements your understanding as well.
It might be possible that there is a solution faster/better than your intended solution. Don’t be afraid to use it as the intended solution (in fact, try to find as many solutions as possible).

Doing this step properly eliminates the possibility of having wrong statements or wrong intended solutions. It also avoids situations like this where the intended solution was “fast enough” on the tests but the problem in general didn’t seem to be solvable fast enough. Incidentally, that post is a good example of what can go wrong in a real-life setting.

Constraints and test validation

Setting constraints is the part which requires getting your hands dirty. Let’s ignore that for now, and let’s say we have some constraints.

How do you ensure that your tests are correct? Think of it as a competitive programming problem again, though a very simple one. This is where a validator comes in. With a validator, you validate the tests based on what the problem states. The reasoning behind proving the correctness of the validator should go like this:

We show that the validator accepts a test if and only if the test is valid according to the problem statement. For the if-direction, [do something]. Now for the only-if-direction, [do something].

I recommend keeping the numbers in this proof generic enough, as is done in most math proofs. We will plug in the numbers later, when we discuss the constraints. Having this proof handy is quite important. Sometimes, when preparing a problem, you end up changing the constraints later on, but forget to update the validator. If you keep this proof as a comment in the validator itself, and inspect all the contest materials before committing changes, you will immediately catch inconsistencies in the statement and the validator.

Also, validators are important when hacks are enabled. If your validator is too loose, you risk some AC solutions getting hacked. And if your validator is too tight, some hacks might be invalid, and there are some hidden conditions in the problem that the hacker might be able to guess, which makes the contest unfair. For example, this contest had such an issue.

For choosing constraints, you need to write some code to verify that the TL and ML are fine and so on. However, this is not something you can prove even in a handwavy sense (unless you know the assembly that your code generates and the time every instruction takes in being executed in that order, on the worst possible test), so we will keep it for later.

Checker

This corresponds to checking that the contestant’s solution indeed outputs something that is valid, according to the problem statement.

Again, as in the validator, using a formal proof, ensure that the output satisfies all the conditions for it to be accepted as a valid output by the statement, and don’t add any extra checks. This is important because there have been CF contests where the checker checked something much stronger about the structure of the solution. A very simple example is “print a graph that is connected”. This doesn’t mention that the graph can’t have multiple edges or loops, but many people will just write a checker that will add these conditions anyway. This is another reason why adding links to precise definitions in the problem is a good thing.

Sometimes writing the checker is harder than solving the original problem itself, and there have been instances in certain CF contests where writing the checker was the second subtask of a problem! Keep this in mind while setting problems, since it is useless if you can’t write checkers. Sometimes, it might make sense to ask for only a partial proof of solution from a contestant — like a certain aspect that you can compute only if you use some technique, or something that is weaker than asking for the output, but is a reasonable thing to ask for (this changes the problem itself, though, and hence it doesn’t show up as an issue on the contestant-facing side of problem-setting).

Things we have to manage heuristically

As we mentioned earlier, almost everything about setting constraints and creating tests is something that we have to manage heuristically. Both of these things go hand-in-hand, which makes the whole process quite iterative. Some obvious guidelines that help in these cases are given in the next section.

Condensed list of guidelines for testers and authors

Common guidelines for setters and testers

Read this post and the links in this post that point to CF guidelines. Also read the links to issues in actual CF contests and think about what went wrong.
Keep the contest preparation process interactive and discuss as many solutions as possible, as well as potential changes to problems.

Guidelines for setters

Ensure that you know what testers are meant to help you with, and read guidelines meant for them as a starting point.
Have a formal write-up describing the problem, algorithm and the proof, as described in the post. The formal editorial for a problem should be ready before you prepare the problem.
Keep the problem statements short, clean and unambiguous. Use lots of space in unavoidable situations. Avoid using examples in statements that unnecessarily obstruct the flow of the reader, and write the statement in a way that doesn’t need you to rely on examples in the first place.
Ensure that your model solution is readable and is provably correct (and it doesn’t have any undefined behavior and so on). Preferably write it in a “safe” language, and turn on all debug flags while testing the main solution (this is something I have wanted on Polygon, but there’s no support for that unfortunately).
Write strict validators and checkers, and prove their correctness too (both in terms of semantics as well as formatting). When writing checkers, do NOT assume that the jury solution is correct. Rather, check the correctness of both the jury solution as well as the participant’s solution.
Have exhaustive test cases — ideally, use a bruteforce to generate all possible tests for some small sub-constraints to ensure correctness, corner cases, tricky large cases for TLE and random large cases. Ensure that you cover all possible special cases, and look for specific solutions that can give AC but do not work on the general case (like faulty greedy solutions and so on). Keep in mind unordered_map and unordered_set like peculiarities in the languages.
Ensure that the time limits are not too tight, and they are at least 2-3 times the slowest AC solution you have. Try to keep constraints in a way that the most optimized unintended solution either doesn’t pass (by a good margin — like twice the TL), or barely passes. In the latter case, you should be reconsidering whether you want this problem to appear on the contest at all.
Before relying on the testers, try to test your own problems according to the tester guidelines.
Maintain something like a spreadsheet for tracking status and feedback of problems, and current progress.

Guidelines for testers

Ensure that you know what setters are meant to do, and read guidelines meant for them as a starting point.
Ensure that the problem statements are water-tight and self-contained (not a lot of background knowledge necessary, and wherever necessary, links should be present) and provided examples clarify unavoidable ambiguities (which should ideally be non-existent).
After testing, confirm that the authors’ proofs are all correct. This is not limited to just proofs of correctness of the intended solution, but also for the checker, validator and so on. Just because you couldn’t solve a problem in the virtual contest testing doesn’t mean you mustn’t care about the other problems.
Try to hack the problem: submit heuristic solutions, both in the sense of potential WA as well as potential TLE. If they pass on all cases, try to come up with cases they can’t pass on, or prove that this heuristic is correct/fast enough. Such a situation is quite instructive whenever it arises (from experience, this happens usually when there is a greedy solution involved which doesn’t really need the full strength of the greedy decision).
Try to write the slowest possible reasonable solutions that a contestant can come up with, with the same complexity as the author’s solution. Also try to come up with highly optimized brute-force solutions (using caching, bitsets, precomputation of answers and so on) and see if they can pass the test cases. Look for test cases that these can’t pass, but the official solution passes. In case the official solution and the optimized brute-force both have similar speeds, discuss what to do next with the coordinator and the author.
If you come up with an alternative approach with a better or slightly worse complexity, notify the problem setter, and depending on what the intended difficulty of the problem is, discuss whether to make the problem harder/easier.
Give detailed feedback on the quality of problem statements, problem content, editorials, solutions, test cases, perceived hardness of the contest, and anything else that will help improve the perceived quality of the contest.

Handling other non-obvious issues.

At this point, there are the following issues we haven’t addressed.

Guessable problems.
Problems that have appeared before.
Unbalanced contests — difficulty-wise.
Unbalanced contests — topic-wise.
Long queue, system unavailability.

The first four issues are solved only by testing and being consciously aware of the fact that they need to be fixed.

If problems are guessable, it makes sense to ask contestants to construct something that shows that their guess is correct. For example, it might be easy to guess that the size of the answer for a problem is \(n/2\). This is very guessable, and in this case, it is better to ask for the construction of the answer as well.

For avoiding problems that have appeared before, since there is currently no repository of all competitive programming (or math contest) problems that is searchable by context, the only way is to invite a lot of experienced testers with different backgrounds.

For avoiding unbalanced contests (both in terms of difficulty and topics), it is important to have a large variety of testers — people with various amounts of experience as well as various types of topics that they are strong at.

The last type of issue is something unique to online contests that are on a large scale — for example, CF contests. This is the main reason why pretests exist in the first case. To avoid long queues, usually there are at most 2-3 pretests for problems like A and B. To make strong pretests that are also minimal, there are two ways of choosing tests (which can be combined too):

Choose an exhaustive test (on smaller constraints), and a random max-test.
Pick all possible tests that fail solutions that you encountered in testing.

This is one of the reasons why the first couple of problems are multi-test in nature.

Conclusion

If there’s anything I missed out, please let me know, since it’s been quite some time since I last coordinated a contest. Feel free to discuss on various aspects of problem-setting in this context, and I would request coordinators to also share their to-do list when they start coordinating a contest, right from the very beginning of the review stage.

Issues#

Discussion#

Solutions to the issues — the guidelines#

Provable things#

Statement and solution#

Constraints and test validation#

Checker#

Things we have to manage heuristically#

Condensed list of guidelines for testers and authors#

Common guidelines for setters and testers#

Guidelines for setters#

Guidelines for testers#

Handling other non-obvious issues.#

Conclusion#