Summary of related research =========================== HypoFuzz is built on, and inspired by, a wide range of research and practice in software testing and verification. This page summarises and comments on selected parts of that literature, focussing on papers which are in some sense prior art for fuzzing property-based tests for Python. .. tip:: `See HypoFuzz in action here! <../example-dashboard/>`__ .. contents:: :local: Fuzzing background ------------------ Fuzzers can generally be divided into two categories: 1. *Generational* fuzzers generate new inputs from scratch, using a random number generator to choose various options. This covers everything from the simplest use of random numbers in a unit test, to highly sophisticated tools like CSmith :cite:`CSmith`. 2. *Mutational* fuzzers derive new inputs by mutating known inputs, and adding interesting examples to the 'pool' of known inputs. Greybox mutational fuzzers reliabily find security vulnerabilities in almost any previously-unfuzzed C code, but are only rarely applied to search for semantic or non-security bugs. Hypothesis takes a hybrid approach, using (mostly) generational fuzzing to find failing examples, and then mutating and replaying an internal representation to find the minimal failing example. HypoFuzz takes a smarter approach by exploiting mutation-based example generation, running a variety of instrumentation which is too expensive for sub-second unit tests, and adapting the fuzzing logic to each test function as it learns. More on this below. *The Fuzzing Book* :cite:`fuzzingbook2019` is a fantastic introduction to and overview of the field. While many of the papers cited below may not be relevant unless you're *implementing* a fuzzer like HypoFuzz, the book is a great resource for anyone involved in software testing. Fuzz / Fuzz Revisited / Fuzz 2020 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `Bart Miller's pioneering work on fuzzing `__ defined the field, and proved that unguided random fuzzing works scarily well. From 1990 :cite:`Fuzz1990` to 1995 :cite:`FuzzRevisited`, and again in 2020 :cite:`Fuzz2020`, the persistence of bugs which can be caught by such simple tools seems timeless. Unfortunately, so does the very slow adoption of such tools - if you're reading this sentence, you have unusual (and excellent!) taste in testing technologies. AFL (classic) ~~~~~~~~~~~~~ `Pulling JPEGs out of thin air `__ made a splash: `AFL `__ was the first fuzzer tool to reach mainstream awareness, and its success - measured in important bugs rather than citations or benchmarks - revitalised the field. The key insights were that lightweight instrumentation for coverage guided fuzzing would often outperform fancier but slower techniques, and that usability counts - with almost no configuration and a robust design applicable to any project, AFL saw much wider adoption and therefore impact than previous tools. Since 2017, `AFL++ has been maintained by the community `__ :cite:`AFL++` with a variety of bugfixes, patches, and additional features - many of which are covered below. LibFuzzer ~~~~~~~~~ `LibFuzzer `__ targets functions, rather than whole binaries, and typically runs in-process. :hydocs:`Hypothesis' .fuzz_one_input ` function is directly inspired by the ``LLVMFuzzOneInput`` entry point, though Hypothesis tests have much more sophisticated support for `structured fuzzing `__. Property-based testing ---------------------- It's common to observe that property-based testing (PBT) is conceptually related to fuzzing - see for example Dan Luu's `AFL + QuickCheck = ? `__ or Nelson Elhage's `Property-Based Testing Is Fuzzing `__ and `Property Testing Like AFL `__. For an essay on the *differences*, see David MacIver's `What is Property-Based Testing `__. The core of Hypothesis in in fact a blackbox structure-aware fuzzer, and of course HypoFuzz itself is a greybox fuzzer built on our shared IR layer. Three things make HypoFuzz different from tradional fuzzers. 1. HypoFuzz is designed to work with many more targets than most fuzzers - we operate on *test suites*, not single binaries. 2. Because we're fuzzing property-based tests, HypoFuzz looks for semantics errors - not just crashes - and can check properties that are only expected to hold for a subset of valid inputs. 3. It's designed to fit into your development cycle, and be used by developers - so that the bugs get caught *before* the code ships. Hypothesis ~~~~~~~~~~ Hypothesis :cite:`MacIver2019` is implemented around a bytestring representation for all test cases. All "strategies" (data generators) can transparently generate random instances via a PRNG, or replay past test-cases by substituting a recorded bytestring for the PRNG stream. :cite:`MacIver2020` goes into more depth about the design of this IR layer, and in particular how it enables efficient test-case reduction and normalisation. This is the key to reporting minimal and de-duplicated failing examples, and makes using a fuzzer much more productive (and less frustrating). The IR layer has also proven invaluable as a clean and universal interface to support other techniques such as targeted property-based testing :cite:`TargetedPBT` - we get to automate (:cite:`AutomatingTargetedPBT`) the setup for free, and support multi-dimensional optimisation into the bargain. See :func:`hypothesis:hypothesis.target` for details. 'Fuzzer taming' with test-case reduction ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Because Hypothesis presents a single `reduced `__ and normalised :cite:`OneTestToRuleThemAll` failing input for each unique exception type and location, HypoFuzz largely avoids the `fuzzer taming problem `__ :cite:`TamingCompilerFuzzers`. 'Strategies' are parser-combinators designed for structured fuzzing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Hypothesis users specify the allowed inputs to their test function by composing "strategies", which are internally used to parse PRNG or replayed bytestrings into valid data. Users may compose strategies with arbitrary code, including code under test, but while in principle this leads to an unrestricted grammar the structure is usually tractable (`see here for some details `__). Strategies are also designed such that, in the absence of user-defined filters, most random bytestrings can be parsed into valid examples - which makes it easy to support a hybrid generational/mutational fuzzer. Some also use `swarm testing `__ :cite:`SwarmTesting`, which improves the diversity of "weird" examples generated without any user interaction at all. Increasing our usage of this and `other techniques `__ is an ongoing project for Hypothesis. Other property-based fuzzers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (Java) junit-quickcheck + JQF + Zest + RLCheck ++++++++++++++++++++++++++++++++++++++++++++++ Starting with the ``junit-quickcheck`` library, `JQF `__ :cite:`JQF` provides an interface to run property-based tests with a variety of fuzzing backends including AFL, Zest :cite:`Zest` (adding validity metrics), and PerfFuzz. `RLCheck `__ :cite:`RLCheck` is distinctive as a blackbox fuzzer, using reinforcement learning to generate valid inputs according to some predicate. While expressing constraints as predicates on a more general input description is more natural for users, most PBT libraries require a constructive approach to generation for acceptable performance - even when seriously unintuitive. (Rust) proptest + propfuzz + propverify +++++++++++++++++++++++++++++++++++++++ The `proptest `__ library for Rust is directly inspired by Hypothesis. Showing the power of a good intermediate representation, recent tools have attempted to build on top of this to provide both `fuzzing `__ and `formal verification `__ with (almost) the same user-facing API. There are `plans for Hypothesis to support symbolic execution via `__ :pypi:`crosshair-tool`, and a promising proof-of-concept, but no firm timeline unless someone volunteers to take on the project. (C / C++) TrailofBits' DeepState, Google's ``fuzztest`` +++++++++++++++++++++++++++++++++++++++++++++++++++++++ `DeepState `__ :cite:`DeepState` provides a common interface to various symbolic execution and fuzzing engines - write your tests once with a Google Test-style API, and then run them with a variety of backends and at various stages of your development cycle. Google's `fuzztest `__ library is described as a tool that bridges the gap between fuzzing and property-based testing, allowing you to write fuzz test side by side with regular unit tests. ``fuzztest`` always runs with coverage guidance, but is designed to be used as part of a testing (rather than standard fuzzing) workflow. (Haskell) QuickFuzz +++++++++++++++++++ QuickFuzz :cite:`QuickFuzz` uses the venerable QuickCheck :cite:`QuickCheck` and file format parsers from `Hackage `__ to implement an unguided generational fuzzer. (Coq) FuzzChick +++++++++++++++ FuzzChick :cite:`FuzzChick` is a coverage-guided backed for QuickChick :cite:`QuickChick`, a property-based testing library for the `Coq `__ theorem prover. Mutation operators ------------------ Structure-aware mutation with AFLSmart ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ AFLSmart :cite:`AFLSmart` proposes using "smart mutation operators", specifically adding, deleting, or replacing chunks of one seed input with corresponding chunks of another input. They find that this is a substantial improvement over structure-naive converage-guided fuzzing, and that (as you'd expect) adding feedback offers a very large improvement over blackbox generational fuzzing. While they use `"Peach pits" `__ to define the input grammar - and as the blackbox baseline - we can get the same structural information directly from instrumentation in the Hypothesis internals without any additional work for users or implementors. Doing so will also give Hypothesis `better ways to explain why your test failed `__ essentially for free. Note that *structure-aware mutation* is a different technique to what is often called *structure-aware fuzzing* (e.g. `here `__) - the latter is simply a parsing step to allow e.g. classic AFL to operate on structured data, and Hypothesis gives us a well-tuned version of that for free. Adaptive mutation operator selection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `MOpt-AFL `__ :cite:`MOpt-AFL` finds that the effectiveness of mutation strategies varies by target, and evaluates an adaptive particle-swarm algorithm to customise the mutation logic accordingly. :cite:`OneFuzzingStrategyToRuleThemAll` study "Havoc" mode, in which multiple randomly-selected mutation operators are applied in a single step. They find that this typically outperforms a one-operator-at-a-time approach, and that dynamically tuning the operator weights with a (non-stationary) multi-arm-bandit approach yields further large improvements. TOFU :cite:`TOFU` varies the weighting of mutation operators with distance to the goal; preferring large (add, delete, splice, etc.) operations while distant and small (e.g. bitflip) when closer. Inputs from Hell ~~~~~~~~~~~~~~~~ :cite:`InputsFromHell` generates inputs matching a grammar, with a twist: by observing the frequency with which various generation choices appear in a sample, you can *invert* this distribution to instead generate dissimilar inputs. While partly subsumed by rare-branch-targeting tricks (under scheduling inputs, below), this trick might also have some synergistic effects. Scheduling inputs ----------------- AFLFast, FairFuzz, and AlphaFuzz ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ AFLFast :cite:`AFLFast` and FairFuzz :cite:`FairFuzz` observe that some branches are covered by a higher proportion of inputs than others - for example, code which rejects invalid inputs is usually overrepresented. When AFL-Fast selects an input to mutate, it biases the choice towards inputs which execute rare branches - and finds both an order-of-magnitude performance improvement and more bugs than previous approaches. Technically, the trick is to represent the probability of covering each branch from a random mutation of each input as a Markov chain, and then using the inverse of the stationary distribution as our choice weights. AlphaFuzz :cite:`AlphaFuzz` observes that because mutation operators tend to make local changes, modelling the lineage of each seed (again, as a Markov chain) further improves on AFL-Fast by accounting for semantic diversity among seeds that reach the same set of branches. However, I doubt this would help HypoFuzz, given our larger mutation steps and strong reduction and normalization of seeds. FairFuzz shares the goal of increasing coverage of rare branches, but does so by detecting regions of the input which may be required to do so and disabling mutations of those regions. Their evaluation finds that this noticeably improves coverage on code with deeply nested conditionals, against a baseline which includes an early version of AFL-Fast (``-explore`` schedule added in 2.33, evaulation uses 2.40, ``-fast`` schedule seems to be best). Directed fuzzing ~~~~~~~~~~~~~~~~ A `directed fuzzer `__, such as `AFL-go `__ :cite:`AFLgo`, prioritizes inputs which are 'closer' to a target location. This can be used to focus on recently-changed code paths, areas flagged as bug-prone by static analysis, functions seen in logged errors to reproduce a crash, etc. TOFU :cite:`TOFU` also exploits input structure, and claims that this is substantially responsible for it's -40% improvement over AFL-go. :cite:`wang2020sok` survey the state-of-the-art in directed greybox fuzzing as of mid-2020. HypoFuzz could get the control-flow graph from coverage.py, which tracks possible branches in order to report un-covered branches, so the implementation is straightforward. The tradeoff between simplicity and power-requiring-configuration is less obvious; we're inclined to initially stick to zero-config direction towards recent patches and/or lines flagged by e.g. :pypi:`flake8`; though the balance between directed and general exploration might take some tuning. Directed swarm testing :cite:`DirectedSwarmTesting` takes a slightly different approach: it is assumed that *some* randomly generated test cases will execute the target code, and so the goal is to increase that proportion by biasing the swarm configuration towards including 'trigger' features and omitting 'suppressors'. SyML :cite:`SyMLPatternLearning` learn patterns among vulnerability-triggering paths in known-buggy programs, and find that the learned features are predictive in unrelated programs. Originally motivated by mitigating path explosion in symbolic execution, it seems equally applicable to directed fuzzing and could be a substantial advantage for a centralized platform where there are more programs (and bugs) to learn from. Predictive fuzzing, scaling laws, & when to stop ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `Dr. Marcel Böhme `__ has done groundbreaking work characterising the behaviour of fuzzers (as well as co-creating AFLfast, AFLsmart, and AFLgo!), in order to understand the assurances that fuzzing can provide and quantify the residual risk :cite:`AssuranceInTestingRoadmap`. `Pythia `__ :cite:`STADS` adds statistical predictions to AFL, including bounds on the probability of finding a bug, estimated progress towards maximal coverage, and a difficulty metric. These metrics are obviously of interest to users, and can also be used to schedule those targets with the highest expected value - maximising the overall rate of progress. Applying this scheduling insight to seeds rather than targets yields Entropic :cite:`Entropic`, which prioritizes those seeds which maximise the rate of discovery of new information about the behaviour of the fuzzed program. This shows `substantial improvement over baseline LibFuzzer `__, and is now heavily used by `OSS-Fuzz `__. Finally, :cite:`ExponentialCost` describes empirical scaling laws for fuzzers - spending more CPU time finds a given set of bugs or coverage proportionally faster, but finding *new* or *additional* bugs or coverage requires exponentially more computation. This means that spending a little effort on very many targets is often worthwhile, but simply throwing more compute at a given target is eventually of limited value. On the other hand, improving the fuzzer or diversifing its behaviour is correspondingly very valuable for well-fuzzed targets! Seed selection -------------- Corpus distillation ~~~~~~~~~~~~~~~~~~~ Corpus distillation refers to the technique of selecting an appropriately minimal subset of a large initial corpus which covers the same set of branches in the code under test (``afl-cmin``, if you've used that). While traditionally defined only for coverage, this is trivially extensible to other metrics - just ensure that there are no discarded inputs which would be kept if freshly discovered by the fuzzer. :cite:`Moonlight` evaluates a variety of approaches to designing input corpora, given a typically much larger initial corpus (which might be `scraped from the internet `__ or created with a generative fuzzer), and finds that minimising both the number of inputs in the seed pool and their cumulative size improves fuzzer performance - and that no single approach dominates the others. Reducing (:cite:`DeltaDebugging` or ``afl-tmin``) and normalising (:cite:`OneTestToRuleThemAll`) failing test-cases is a well-known as technique to assist in debugging, and supported - often called *shrinking* - by all good property-based testing tools. HypoFuzz uses Hypothesis' world-class test case reduction to calculate the minimal example for each feature of interest - covered branch, high score from :func:`hypothesis:hypothesis.target`, etc. - and uses this as a basis for further fuzzing as well as reporting failing examples. We are unaware of previous work which uses this approach or evaluates it in comparison to less-intensive distillation. We expect that it works very well if-and-only-if combined with generative and structure-aware fuzzing, to allow for exploitation of the covering structure without unduly standardising unrelated parts of the input. Nezha - efficient differential testing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `Nezha `__ :cite:`Nezha` provides efficient differential testing, by taking the product of the coverage for each input fed to multiple targets. While the original AFL docs observed that a distilled corpus from one e.g. jpeg library would often trigger bugs in another, as branches to handle edge cases select for edge-case inputs which may be mishandled by the other, using joint instead of independent coverage has similar advantages to that of ensemble fuzzing. This is relatively easy to implement using :pypi:`coverage` dynamic contexts and a context manager or decorator API *within a given process*; while we'd also like to support differential coverage between Python versions or operating systems that will require some deeper changes to HypoFuzz's execution model. Domain-specific targets with FuzzFactory ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `FuzzFactory `__ :cite:`FuzzFactory` observes that coverage may not be the only metric of interest, and extends the feedback mechanism in AFL to support user-specified labels. This essentially brings targeted property-based testing (above) to fuzzing workflows, and provides prior art (outside Hypothesis' implementation) of the multi-objective approach - finding that this is often much more effective than optimising component metrics independently. Virtual branches with IJON ~~~~~~~~~~~~~~~~~~~~~~~~~~ `IJON `__ :cite:`IJON` adds custom feedback to AFL. The ``IJON_SET`` macro adds a 'virtual branch' based on the value passed, so that at least one input exhibiting whatever custom behaviour will be retained in the seed pool (HypoFuzz implements this with the :func:`hypothesis:hypothesis.event` function). The ``IJON_MAX`` macro is equivalent to :func:`hypothesis:hypothesis.target`, similar to FuzzFactory above. IJON is particularly notable for winning 29 out of 32 *Super Mario Bros* levels, a feat more typical of dedicated reinforcement learning systems, as well as fuzzing a Trusted Platform Module, complex format parsers, mazes, and a hash map. Diversity ~~~~~~~~~ A key point here is that fuzzing and testing tools should search for *diverse* inputs, to avoid getting trapped in a "optimal" but non-bug-finding state. For example, IJON optimized x-distance *at each distinct altitude* to avoid dead-ends. Hypothesis tracks the :wikipedia:`pareto frontier ` of metrics passed to :func:`hypothesis.target` (plus some internal metrics). For observable dimensions where there is *not* a clear "best" direction and may be thousands of dimensions, such as the hit-count of each branch, there are a variety of approaches. AFL and related fuzzers "bucketize" the hitcount and then track uniqueness up to a 64k hash of this vector, as a compromise between performance (driven by CPU cache sizes) and collision rate (typically 10-15% for library-like targets, but up to 75% for larger applications :cite:`CollAFL`). HypoFuzz's approach of keeping the best (shortlex-minimal) seed covering each branch is reminiscent of SugarSearch :cite:`SugarSearch`; that paper opens with a lovely survey of `quality-diversity algorithms `__ algorithms - including CVT-MAP-elites :cite:`MAP-elites,CVT-MAP-elites`, which might be nice to try for prioritization in high-dimensional spaces. BeDivFuzz :cite:`BeDivFuzz` proposes measuring behavioural diversity using the :wikipedia:`'Hill numbers' ` from ecology; HypoFuzz already selects seeds via (a mixed distribution including) sampling seeds in inverse proportion to the observed frequency of the rarest branch covered by each. Coverage -------- Before diving in to the use of coverage information as feedback for test-case generation in fuzzers, it's worth covering the use of code coverage in a software development cycle. *How to Misuse Code Coverage* :cite:`HowToMisuseCoverage` still resonates: "I wouldn't have written four coverage tools if I didn't think they're helpful. But they're only helpful if they're used to *enhance* thought, not *replace* it.". More than 20 years later, `code coverage best practices `__ from the Google Testing Blog gives similar advice. *Coverage and its discontents* :cite:`CoverageDiscontents` explores the role of coverage metrics in test-suite evaluation, and argues that there is an underlying uncertainty as to what exactly measuring coverage should achieve, how we would know if it can, and what we as researchers and developers can do about it. `Verification, coverage and maximization: the big picture `__ aims to explain how coverage is used to optimize the verification process, what it means to auto-maximize coverage, and how people have tried to do it - from a background in hardware design, which brings an instructively different perspective to analogous problems. (similar to Dan Luu's `AFL + QuickCheck = ? `__, above) Reducing coverage overhead by rewriting the target ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Full-speed fuzzing :cite:`FullSpeedFuzzing` reduces the performance overhead of coverage measurement by rewriting the target - because most executions do not find new coverage, this allows you to instrument a very small proportion of executions. While offering very impressive speedups, this doesn't support differential metrics or non-coverage metrics, and the rewriting trick would be rather difficult in Python. Nonetheless, the `PLASMA-UMass `__ team have released `Slipcover `__, a super-low-overhead coverage tool for Python based on just this principle - and explicitly list fuzzing as one of the applications. Augumenting PyPy's tracing JIT to report coverage information would probably also be fruitful, and very fast given the JIT-friendly repeated execution pattern of fuzzing. Faster coverage measurement for Python ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :pypi:`coverage` typically slows instrumented programs by a factor of several times, typically ranging from 2-5x but with as much as 70x known on some workloads. There have been several proposals to improve this - e.g. `Python Coverage could be fast `__ - and relatively small grants could make a very large impact. Abandoning most of the features in :pypi:`coverage` (reporting, analysis of untaken branches, aggregation across interpreters, etc.) to focus solely on the branch-reporting logic used by a fuzzer `can also offer substantial speedups `__. Sensitive coverage metrics ~~~~~~~~~~~~~~~~~~~~~~~~~~ *Be Sensitive and Collaborative: Analyzing Impact of Coverage Metrics in Greybox Fuzzing* :cite:`SensitiveAndCollaborative` compares a range of coverage metrics - from branch coverage, to n-gram-coverage (chains of branches, when standard branch coverage is 2-gram), full path coverage, and several others. Due to resource limits - time, memory, compute - no metric dominates all others, suggesting that adapting the metric per-target might be helpful. Compressing coverage information ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ankou :cite:`Ankou` measures coverage of the *number of times* each branch was executed, i.e. order-insensitive path coverage. To manage the very large number of covering inputs, they use a dynamic distance-based metric to retain only dissimilar inputs rather than all covering inputs. By comparison AFL bucketizes branch hit-counts. Miscellaneous ------------- Ensemble fuzzing with seed sharing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ EnFuzz :cite:`EnFuzz` demonstrates that combining diverse fuzzers both improves their joint performance (given equal resources), and makes the performance much more robust. The argument that this works by allowing specialised fuzzers to build on each other's work, including iteratively, is compelling. Cupid :cite:`Cupid` demonstrates significant practical advances in ensemble fuzzing, defining a *complementarity* metric (union of the expected value of the set of covered branches for each fuzzer). This allows for efficient selection of fuzzers to be ensembled based only on 'solo' runs of each. Because Cupid leaves seed scheduling to future work and is based on target-independent characterisation, this technique is used to design HypoFuzz 'tactics' but not for runtime adaptation. It's less clear how to leverage this for HypoFuzz, since there aren't many other fuzzers targeting Hypothesis tests. You could use :pypi:`python-afl`, :pypi:`pythonfuzz`, or `python-hfuzz `__ on Hypothesis' :hydocs:`.fuzz_one_input ` hook if you were careful enough about the database location; we intend to evaluate this approach but don't expect an advantage from adding structure-naive fuzzers. We think the general lesson is more like that of swarm testing: diversity is the key to effective fuzzing. Knowing that in advance though, we can build our single fuzzer to execute a mixture of the relevant behaviours with the desired distribution, and even make that distribution adaptive with respect to each target. Hybrid concrete/symbolic fuzzing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This literature review has largely ignored symbolic execution, because support for Python is at a very early stage and does not scale to real-world programs. For native code, *concolic execution* - tools which combine concrete and symbolic execution of tests - date back to DART :cite:`DART` and CUTE :cite:`CUTE` in 2005; while Microsoft's SAGE :cite:`SAGE` found `roughly one-third of all the bugs `__ discovered by file fuzzing during the development of Windows 7 - running *after* static analysis and other fuzzers. Inputs synthesised by symbolic or concolic approaches could provide the initial seed pool for a classic mutational fuzzer, and provide a way to 'get unstuck' on conditions which are hard to satisfy by chance. There are `plans for Hypothesis to support symbolic execution via `__ :pypi:`crosshair-tool`, and a promising proof-of-concept, but no firm timeline unless someone volunteers to take on the project. Scaling fuzzers up to many cores ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The `scaling behaviour of fuzzers is often neglected `__, which can make academic evaluations running on single cores misleading as users in industry run campaigns on tens, hundreds, or even thousands of cores. For example, classic AFL quickly (5-20 cores) bottlenecks on ``fork()``, and adding more than 40 cores may *reduce total throughput*. IO bottlenecks are also common in filesystem accesses for ensemble fuzzing campaigns. :cite:`PAFL` finds that this problem is *worse* among more advanced fuzzers - if you share seeds but not e.g. the branch hit-counts for AFL-Fast, each process must duplicate the discovery process. P-AFL adds a mechanism for global sharing of guidance information as well as seeds, and additionally focusses each process on fuzzing a subset of the branches in the program - which diversifies the search process and effectively ensembles variants of a single base fuzzer. We plan to mitigate this in HypoFuzz, by sharing ephemeral state between instances and runs via the database. Visualising fuzzer performance ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ HypoFuzz does not offer many configuration options, but users are effectively co-developers of the fuzzer because they provide the system under test, the test function, and the strategies which define possible inputs. Providing clear and detailed - but not overwhelming - information about what the fuzzer is doing can therefore support a wider feedback loop of improvement to the tests and ultimately better bug-detection. Brandon Falk's `some fuzzing thoughts `__ points out that a log-x-axis is almost always the right way to view fuzzer progress graphs, especially considering the well-known exponential scaling curve :cite:`ExponentialCost`. Cornelius Aschermann's `on measuring and visualising fuzzer performance `__ suggests a range of other helpful visualisations, including the proportion of inputs from various generation or mutation strategies which cover each known branch. *Evaluating Fuzz Testing* :cite:`EvaluatingFuzzTesting` investigates serious problems in previous evaluations, and provides the now-canonical guidelines for evaluating fuzzers. Essential reading if you wish to publish an evaluation, or simply decide whether some tweak was actually helpful without getting the sign of the relationship wrong due to random noise. References ---------- *While not all the referenced papers are open access, they do all have freely accessible PDFs. Enjoy!* .. bibliography:: literature.bib