Summary of related research¶
HypoFuzz is built on, and inspired by, a wide range of research and practice in software testing and verification. This page summarises and comments on selected parts of that literature, focussing on papers which are in some sense prior art for fuzzing property-based tests for Python.
Fuzzers can generally be divided into two categories:
Generational fuzzers generate new inputs from scratch, using a random number generator to choose various options. This covers everything from the simplest use of random numbers in a unit test, to highly sophisticated tools like CSmith [YCER11].
Mutational fuzzers derive new inputs by mutating known inputs, and adding interesting examples to the ‘pool’ of known inputs. Greybox mutational fuzzers reliabily find security vulnerabilities in almost any previously-unfuzzed C code, but are only rarely applied to search for semantic or non-security bugs.
Hypothesis takes a hybrid approach, using (mostly) generational fuzzing to find failing examples, and then mutating and replaying an internal representation to find the minimal failing example. HypoFuzz takes a smarter approach by exploiting mutation-based example generation, running a variety of instrumentation which is too expensive for sub-second unit tests, and adapting the fuzzing logic to each test function as it learns. More on this below.
The Fuzzing Book [ZGBohme+19] is a fantastic introduction to and overview of the field. While many of the papers cited below may not be relevant unless you’re implementing a fuzzer like HypoFuzz, the book is a great resource for anyone involved in software testing.
Bart Miller’s pioneering work on fuzzing defined the field, and proved that unguided random fuzzing works scarily well. From 1990 [MFS90] to 1995 [MKL+98], and again in 2020 [MZH20], the persistence of bugs which can be caught by such simple tools seems timeless. Unfortunately, so does the very slow adoption of such tools - if you’re reading this sentence, you have unusual (and excellent!) taste in testing technologies.
Pulling JPEGs out of thin air made a splash: AFL was the first fuzzer tool to reach mainstream awareness, and its success - measured in important bugs rather than citations or benchmarks - revitalised the field.
The key insights were that lightweight instrumentation for coverage guided fuzzing would often outperform fancier but slower techniques, and that usability counts - with almost no configuration and a robust design applicable to any project, AFL saw much wider adoption and therefore impact than previous tools.
LibFuzzer targets functions, rather than
whole binaries, and typically runs in-process.
function is directly inspired by the
LLVMFuzzOneInput entry point, though
Hypothesis tests have much more sophisticated support for structured fuzzing.
It’s common to observe that property-based testing (PBT) is conceptually related to fuzzing - see for example Dan Luu’s AFL + QuickCheck = ? or Nelson Elhage’s Property-Based Testing Is Fuzzing and Property Testing Like AFL. For an essay on the differences, see David MacIver’s What is Property-Based Testing.
The core of Hypothesis in in fact a blackbox structure-aware fuzzer, and of course HypoFuzz itself is a greybox fuzzer built on our shared IR layer. Three things make HypoFuzz different from tradional fuzzers.
HypoFuzz is designed to work with many more targets than most fuzzers - we operate on test suites, not single binaries.
Because we’re fuzzing property-based tests, HypoFuzz looks for semantics errors - not just crashes - and can check properties that are only expected to hold for a subset of valid inputs.
It’s designed to fit into your development cycle, and be used by developers - so that the bugs get caught before the code ships.
Hypothesis [MHDC19] is implemented around a bytestring representation for all test cases. All “strategies” (data generators) can transparently generate random instances via a PRNG, or replay past test-cases by substituting a recorded bytestring for the PRNG stream.
[MD20] goes into more depth about the design of this IR layer, and in particular how it enables efficient test-case reduction and normalisation. This is the key to reporting minimal and de-duplicated failing examples, and makes using a fuzzer much more productive (and less frustrating).
The IR layer has also proven invaluable as a clean and universal interface
to support other techniques such as targeted property-based testing
[LoscherS17] - we get to automate ([LoscherSagonas18])
the setup for free, and support multi-dimensional optimisation into the
hypothesis.target() for details.
Hypothesis users specify the allowed inputs to their test function by composing “strategies”, which are internally used to parse PRNG or replayed bytestrings into valid data. Users may compose strategies with arbitrary code, including code under test, but while in principle this leads to an unrestricted grammar the structure is usually tractable (see here for some details).
Strategies are also designed such that, in the absence of user-defined filters, most random bytestrings can be parsed into valid examples - which makes it easy to support a hybrid generational/mutational fuzzer.
Some also use swarm testing [GZE+12], which improves the diversity of “weird” examples generated without any user interaction at all. Increasing our usage of this and other techniques is an ongoing project for Hypothesis.
Starting with the
junit-quickcheck library, JQF
[PLS19] provides an interface to run property-based tests with a variety of fuzzing
backends including AFL, Zest [PLS+19a] (adding validity metrics), and PerfFuzz.
RLCheck [RLPS20] is distinctive as a blackbox fuzzer, using reinforcement learning to generate valid inputs according to some predicate. While expressing constraints as predicates on a more general input description is more natural for users, most PBT libraries require a constructive approach to generation for acceptable performance - even when seriously unintuitive.
The proptest library for Rust is directly inspired by Hypothesis. Showing the power of a good intermediate representation, recent tools have attempted to build on top of this to provide both fuzzing and formal verification with (almost) the same user-facing API.
There are plans for Hypothesis to support symbolic execution via crosshair-tool, and a promising proof-of-concept, but no firm timeline unless someone volunteers to take on the project.
DeepState [GG18] provides a common interface to various symbolic execution and fuzzing engines - write your tests once with a Google Test-style API, and then run them with a variety of backends and at various stages of your development cycle.
Google’s fuzztest library is described as
a tool that bridges the gap between fuzzing and property-based testing, allowing you
to write fuzz test side by side with regular unit tests.
fuzztest always runs
with coverage guidance, but is designed to be used as part of a testing (rather than
standard fuzzing) workflow.
AFLSmart [PhamBohmeSantosa+19] proposes using “smart mutation operators”, specifically adding, deleting, or replacing chunks of one seed input with corresponding chunks of another input. They find that this is a substantial improvement over structure-naive converage-guided fuzzing, and that (as you’d expect) adding feedback offers a very large improvement over blackbox generational fuzzing.
While they use “Peach pits” to define the input grammar - and as the blackbox baseline - we can get the same structural information directly from instrumentation in the Hypothesis internals without any additional work for users or implementors. Doing so will also give Hypothesis better ways to explain why your test failed essentially for free.
Note that structure-aware mutation is a different technique to what is often called structure-aware fuzzing (e.g. here) - the latter is simply a parsing step to allow e.g. classic AFL to operate on structured data, and Hypothesis gives us a well-tuned version of that for free.
[WJX+22] study “Havoc” mode, in which multiple randomly-selected mutation operators are applied in a single step. They find that this typically outperforms a one-operator-at-a-time approach, and that dynamically tuning the operator weights with a (non-stationary) multi-arm-bandit approach yields further large improvements.
TOFU [WLR20] varies the weighting of mutation operators with distance to the goal; preferring large (add, delete, splice, etc.) operations while distant and small (e.g. bitflip) when closer.
[SPH+20] generates inputs matching a grammar, with a twist: by observing the frequency with which various generation choices appear in a sample, you can invert this distribution to instead generate dissimilar inputs. While partly subsumed by rare-branch-targeting tricks (under scheduling inputs, below), this trick might also have some synergistic effects.
AFLFast [BohmePhamRoychoudhury19] and FairFuzz [LS18] observe that some branches are covered by a higher proportion of inputs than others - for example, code which rejects invalid inputs is usually overrepresented.
When AFL-Fast selects an input to mutate, it biases the choice towards inputs which execute rare branches - and finds both an order-of-magnitude performance improvement and more bugs than previous approaches. Technically, the trick is to represent the probability of covering each branch from a random mutation of each input as a Markov chain, and then using the inverse of the stationary distribution as our choice weights.
AlphaFuzz [ZSZC21] observes that because mutation operators tend to make local changes, modelling the lineage of each seed (again, as a Markov chain) further improves on AFL-Fast by accounting for semantic diversity among seeds that reach the same set of branches. However, I doubt this would help HypoFuzz, given our larger mutation steps and strong reduction and normalization of seeds.
FairFuzz shares the goal of increasing coverage of rare branches, but does so by
detecting regions of the input which may be required to do so and disabling
mutations of those regions. Their evaluation finds that this noticeably improves
coverage on code with deeply nested conditionals, against a baseline which includes
an early version of AFL-Fast (
-explore schedule added in 2.33, evaulation uses
-fast schedule seems to be best).
A directed fuzzer, such as AFL-go [BohmePNR17], prioritizes inputs which are ‘closer’ to a target location. This can be used to focus on recently-changed code paths, areas flagged as bug-prone by static analysis, functions seen in logged errors to reproduce a crash, etc. TOFU [WLR20] also exploits input structure, and claims that this is substantially responsible for it’s -40% improvement over AFL-go. [WZL+20] survey the state-of-the-art in directed greybox fuzzing as of mid-2020.
HypoFuzz could get the control-flow graph from coverage.py, which tracks possible branches in order to report un-covered branches, so the implementation is straightforward. The tradeoff between simplicity and power-requiring-configuration is less obvious; we’re inclined to initially stick to zero-config direction towards recent patches and/or lines flagged by e.g. flake8; though the balance between directed and general exploration might take some tuning.
Directed swarm testing [AGGC16] takes a slightly different approach: it is assumed that some randomly generated test cases will execute the target code, and so the goal is to increase that proportion by biasing the swarm configuration towards including ‘trigger’ features and omitting ‘suppressors’.
SyML [RZD+21] learn patterns among vulnerability-triggering paths in known-buggy programs, and find that the learned features are predictive in unrelated programs. Originally motivated by mitigating path explosion in symbolic execution, it seems equally applicable to directed fuzzing and could be a substantial advantage for a centralized platform where there are more programs (and bugs) to learn from.
Dr. Marcel Böhme has done groundbreaking work characterising the behaviour of fuzzers (as well as co-creating AFLfast, AFLsmart, and AFLgo!), in order to understand the assurances that fuzzing can provide and quantify the residual risk [Bohme19].
Pythia [Bohme18] adds statistical predictions to AFL, including bounds on the probability of finding a bug, estimated progress towards maximal coverage, and a difficulty metric. These metrics are obviously of interest to users, and can also be used to schedule those targets with the highest expected value - maximising the overall rate of progress.
Applying this scheduling insight to seeds rather than targets yields Entropic [BohmeF20a], which prioritizes those seeds which maximise the rate of discovery of new information about the behaviour of the fuzzed program. This shows substantial improvement over baseline LibFuzzer, and is now heavily used by OSS-Fuzz.
Finally, [BohmeF20b] describes empirical scaling laws for fuzzers - spending more CPU time finds a given set of bugs or coverage proportionally faster, but finding new or additional bugs or coverage requires exponentially more computation. This means that spending a little effort on very many targets is often worthwhile, but simply throwing more compute at a given target is eventually of limited value. On the other hand, improving the fuzzer or diversifing its behaviour is correspondingly very valuable for well-fuzzed targets!
Corpus distillation refers to the technique of selecting an appropriately minimal
subset of a large initial corpus which covers the same set of branches in the code
under test (
afl-cmin, if you’ve used that). While traditionally defined only
for coverage, this is trivially extensible to other metrics - just ensure that there
are no discarded inputs which would be kept if freshly discovered by the fuzzer.
[HGH+19] evaluates a variety of approaches to designing input corpora, given a typically much larger initial corpus (which might be scraped from the internet or created with a generative fuzzer), and finds that minimising both the number of inputs in the seed pool and their cumulative size improves fuzzer performance - and that no single approach dominates the others.
Reducing ([ZH02] or
afl-tmin) and normalising
([GHK17]) failing test-cases is a well-known as technique
to assist in debugging, and supported - often called shrinking - by all good
property-based testing tools. HypoFuzz uses Hypothesis’ world-class test case
reduction to calculate the minimal example for each feature of interest - covered
branch, high score from
hypothesis.target(), etc. - and uses
this as a basis for further fuzzing as well as reporting failing examples.
We are unaware of previous work which uses this approach or evaluates it in comparison to less-intensive distillation. We expect that it works very well if-and-only-if combined with generative and structure-aware fuzzing, to allow for exploitation of the covering structure without unduly standardising unrelated parts of the input.
While the original AFL docs observed that a distilled corpus from one e.g. jpeg library would often trigger bugs in another, as branches to handle edge cases select for edge-case inputs which may be mishandled by the other, using joint instead of independent coverage has similar advantages to that of ensemble fuzzing.
This is relatively easy to implement using coverage dynamic contexts and a context manager or decorator API within a given process; while we’d also like to support differential coverage between Python versions or operating systems that will require some deeper changes to HypoFuzz’s execution model.
This essentially brings targeted property-based testing (above) to fuzzing workflows, and provides prior art (outside Hypothesis’ implementation) of the multi-objective approach - finding that this is often much more effective than optimising component metrics independently.
IJON [AschermannSchumiloAbbasiHolz20] adds custom feedback to
IJON_SET macro adds a ‘virtual branch’ based on the value passed, so
that at least one input exhibiting whatever custom behaviour will be retained in
the seed pool (HypoFuzz implements this with the
IJON_MAX macro is equivalent to
similar to FuzzFactory above.
IJON is particularly notable for winning 29 out of 32 Super Mario Bros levels, a feat more typical of dedicated reinforcement learning systems, as well as fuzzing a Trusted Platform Module, complex format parsers, mazes, and a hash map.
A key point here is that fuzzing and testing tools should search for diverse inputs, to avoid getting trapped in a “optimal” but non-bug-finding state. For example, IJON optimized x-distance at each distinct altitude to avoid dead-ends.
Hypothesis tracks the pareto frontier of metrics passed
hypothesis.target() (plus some internal metrics). For observable dimensions
where there is not a clear “best” direction and may be thousands of dimensions,
such as the hit-count of each branch, there are a variety of approaches.
AFL and related fuzzers “bucketize” the hitcount and then track uniqueness up to a 64k hash of this vector, as a compromise between performance (driven by CPU cache sizes) and collision rate (typically 10-15% for library-like targets, but up to 75% for larger applications [GZQ+18]).
HypoFuzz’s approach of keeping the best (shortlex-minimal) seed covering each branch is reminiscent of SugarSearch [HZKM22]; that paper opens with a lovely survey of quality-diversity algorithms algorithms - including CVT-MAP-elites [MC15, VCM18], which might be nice to try for prioritization in high-dimensional spaces.
BeDivFuzz [NG22] proposes measuring behavioural diversity using the ‘Hill numbers’ from ecology; HypoFuzz already selects seeds via (a mixed distribution including) sampling seeds in inverse proportion to the observed frequency of the rarest branch covered by each.
Before diving in to the use of coverage information as feedback for test-case generation in fuzzers, it’s worth covering the use of code coverage in a software development cycle.
How to Misuse Code Coverage [Mar97] still resonates: “I wouldn’t have written four coverage tools if I didn’t think they’re helpful. But they’re only helpful if they’re used to enhance thought, not replace it.”. More than 20 years later, code coverage best practices from the Google Testing Blog gives similar advice.
Coverage and its discontents [GAG14] explores the role of coverage metrics in test-suite evaluation, and argues that there is an underlying uncertainty as to what exactly measuring coverage should achieve, how we would know if it can, and what we as researchers and developers can do about it.
Verification, coverage and maximization: the big picture aims to explain how coverage is used to optimize the verification process, what it means to auto-maximize coverage, and how people have tried to do it - from a background in hardware design, which brings an instructively different perspective to analogous problems. (similar to Dan Luu’s AFL + QuickCheck = ?, above)
Full-speed fuzzing [NH18] reduces the performance overhead of coverage measurement by rewriting the target - because most executions do not find new coverage, this allows you to instrument a very small proportion of executions.
While offering very impressive speedups, this doesn’t support differential metrics or non-coverage metrics, and the rewriting trick would be rather difficult in Python. Nonetheless, the PLASMA-UMass team have released Slipcover, a super-low-overhead coverage tool for Python based on just this principle - and explicitly list fuzzing as one of the applications.
Augumenting PyPy’s tracing JIT to report coverage information would probably also be fruitful, and very fast given the JIT-friendly repeated execution pattern of fuzzing.
coverage typically slows instrumented programs by a factor of several times, typically ranging from 2-5x but with as much as 70x known on some workloads. There have been several proposals to improve this - e.g. Python Coverage could be fast - and relatively small grants could make a very large impact.
Abandoning most of the features in coverage (reporting, analysis of untaken branches, aggregation across interpreters, etc.) to focus solely on the branch-reporting logic used by a fuzzer can also offer substantial speedups.
Be Sensitive and Collaborative: Analyzing Impact of Coverage Metrics in Greybox Fuzzing [WDS+19] compares a range of coverage metrics - from branch coverage, to n-gram-coverage (chains of branches, when standard branch coverage is 2-gram), full path coverage, and several others. Due to resource limits - time, memory, compute - no metric dominates all others, suggesting that adapting the metric per-target might be helpful.
Ankou [ManesKC20] measures coverage of the number of times each branch was executed, i.e. order-insensitive path coverage. To manage the very large number of covering inputs, they use a dynamic distance-based metric to retain only dissimilar inputs rather than all covering inputs. By comparison AFL bucketizes branch hit-counts.
EnFuzz [CJM+19] demonstrates that combining diverse fuzzers both improves their joint performance (given equal resources), and makes the performance much more robust. The argument that this works by allowing specialised fuzzers to build on each other’s work, including iteratively, is compelling.
Cupid [GulerGorzG+20] demonstrates significant practical advances in ensemble fuzzing, defining a complementarity metric (union of the expected value of the set of covered branches for each fuzzer). This allows for efficient selection of fuzzers to be ensembled based only on ‘solo’ runs of each. Because Cupid leaves seed scheduling to future work and is based on target-independent characterisation, this technique is used to design HypoFuzz ‘tactics’ but not for runtime adaptation.
It’s less clear how to leverage this for HypoFuzz, since there aren’t many other fuzzers targeting Hypothesis tests. You could use python-afl, pythonfuzz, or python-hfuzz on Hypothesis’ .fuzz_one_input hook if you were careful enough about the database location; we intend to evaluate this approach but don’t expect an advantage from adding structure-naive fuzzers.
We think the general lesson is more like that of swarm testing: diversity is the key to effective fuzzing. Knowing that in advance though, we can build our single fuzzer to execute a mixture of the relevant behaviours with the desired distribution, and even make that distribution adaptive with respect to each target.
This literature review has largely ignored symbolic execution, because support for Python is at a very early stage and does not scale to real-world programs.
For native code, concolic execution - tools which combine concrete and symbolic execution of tests - date back to DART [GKS05] and CUTE [SMA05] in 2005; while Microsoft’s SAGE [GLM+08] found roughly one-third of all the bugs discovered by file fuzzing during the development of Windows 7 - running after static analysis and other fuzzers.
Inputs synthesised by symbolic or concolic approaches could provide the initial seed pool for a classic mutational fuzzer, and provide a way to ‘get unstuck’ on conditions which are hard to satisfy by chance. There are plans for Hypothesis to support symbolic execution via crosshair-tool, and a promising proof-of-concept, but no firm timeline unless someone volunteers to take on the project.
The scaling behaviour of fuzzers is often neglected,
which can make academic evaluations running on single cores misleading as users
in industry run campaigns on tens, hundreds, or even thousands of cores.
For example, classic AFL quickly (5-20 cores) bottlenecks on
and adding more than 40 cores may reduce total throughput.
IO bottlenecks are also common in filesystem accesses for ensemble fuzzing campaigns.
[LJC+18] finds that this problem is worse among more advanced fuzzers - if you share seeds but not e.g. the branch hit-counts for AFL-Fast, each process must duplicate the discovery process. P-AFL adds a mechanism for global sharing of guidance information as well as seeds, and additionally focusses each process on fuzzing a subset of the branches in the program - which diversifies the search process and effectively ensembles variants of a single base fuzzer.
We plan to mitigate this in HypoFuzz, by sharing ephemeral state between instances and runs via the database.
HypoFuzz does not offer many configuration options, but users are effectively co-developers of the fuzzer because they provide the system under test, the test function, and the strategies which define possible inputs. Providing clear and detailed - but not overwhelming - information about what the fuzzer is doing can therefore support a wider feedback loop of improvement to the tests and ultimately better bug-detection.
Brandon Falk’s some fuzzing thoughts points out that a log-x-axis is almost always the right way to view fuzzer progress graphs, especially considering the well-known exponential scaling curve [BohmeF20b].
Cornelius Aschermann’s on measuring and visualising fuzzer performance suggests a range of other helpful visualisations, including the proportion of inputs from various generation or mutation strategies which cover each known branch.
Evaluating Fuzz Testing [KRC+18] investigates serious problems in previous evaluations, and provides the now-canonical guidelines for evaluating fuzzers. Essential reading if you wish to publish an evaluation, or simply decide whether some tweak was actually helpful without getting the sign of the relationship wrong due to random noise.
While not all the referenced papers are open access, they do all have freely accessible PDFs. Enjoy!
Mohammad Amin Alipour, Alex Groce, Rahul Gopinath, and Arpit Christi. Generating focused random tests using directed swarm testing. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, 70–81. New York, NY, USA, 2016. Association for Computing Machinery. URL: https://rahul.gopinath.org/resources/issta2016/alipour2016focused.pdf, doi:10.1145/2931037.2931056.
Marcel Böhme. Assurance in software testing: a roadmap. In Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, ICSE-NIER '19, 5–8. IEEE Press, 2019. URL: https://arxiv.org/abs/1807.10255, doi:10.1109/ICSE-NIER.2019.00010.
Marcel Böhme and Brandon Falk. Boosting fuzzer efficiency: an information theoretic perspective. In Proceedings of the ACM European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2020. Association for Computing Machinery, 2020. URL: https://mboehme.github.io/paper/FSE20.Entropy.pdf.
Marcel Böhme and Brandon Falk. Fuzzing: on the exponential cost of vulnerability discovery. In Proceedings of the ACM European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) 2020. Association for Computing Machinery, 2020. URL: https://mboehme.github.io/paper/FSE20.EmpiricalLaw.pdf.
Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoudhury. Directed greybox fuzzing. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS '17, 2329–2344. New York, NY, USA, 2017. Association for Computing Machinery. URL: https://mboehme.github.io/paper/CCS17.pdf, doi:10.1145/3133956.3134020.
Yang Chen, Alex Groce, Chaoqiang Zhang, Weng-Keen Wong, Xiaoli Fern, Eric Eide, and John Regehr. Taming compiler fuzzers. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, 197–208. New York, NY, USA, 2013. Association for Computing Machinery. URL: http://www.cs.utah.edu/~regehr/papers/pldi13.pdf, doi:10.1145/2491956.2462173.
Yuanliang Chen, Yu Jiang, Fuchen Ma, Jie Liang, Mingzhe Wang, Chijin Zhou, Xun Jiao, and Zhuo Su. Enfuzz: ensemble fuzzing with seed synchronization among diverse fuzzers. In USENIX Security Symposium. 2019. URL: https://www.usenix.org/system/files/sec19-chen-yuanliang.pdf.
Koen Claessen and John Hughes. Quickcheck: a lightweight tool for random testing of haskell programs. Proceedings of the Fifth ACM SIGPLAN International Conference on Functional Programming, 2000. URL: https://www.cs.tufts.edu/~nr/cs257/archive/john-hughes/quick.pdf, doi:10.1145/357766.351266.
Andrea Fioraldi, Dominik Maier, Heiko Eißfeldt, and Marc Heuse. Afl++ : combining incremental steps of fuzzing research. In 14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association, August 2020. URL: https://www.usenix.org/system/files/woot20-paper-fioraldi.pdf.
Shuitao Gan, Chao Zhang, Xiaojun Qin, Xuwen Tu, Kang Li, Zhongyu Pei, and Zuoning Chen. Collafl: path sensitive fuzzing. In 2018 IEEE Symposium on Security and Privacy (SP), volume, 679–696. 2018. URL: https://chao.100871.net/papers/oakland18.pdf, doi:10.1109/SP.2018.00040.
Patrice Godefroid, Nils Klarlund, and Koushik Sen. Dart: directed automated random testing. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '05, 213–223. New York, NY, USA, 2005. Association for Computing Machinery. URL: https://patricegodefroid.github.io/public_psfiles/pldi2005.pdf, doi:10.1145/1065010.1065036.
Patrice Godefroid, Michael Y Levin, David A Molnar, and others. Automated whitebox fuzz testing. In NDSS, volume 8, 151–166. 2008. URL: https://patricegodefroid.github.io/public_psfiles/ndss2008.pdf.
Peter Goodman and Alex Groce. DeepState: symbolic unit testing for c and c$\mathplus $$\mathplus $. In Proceedings 2018 Workshop on Binary Analysis Research. Internet Society, 2018. URL: https://www.ndss-symposium.org/wp-content/uploads/2018/07/bar2018_9_Goodman_paper.pdf, doi:10.14722/bar.2018.23009.
Gustavo Grieco, Mart\'ın Ceresa, Agust\'ın Mista, and Pablo Buiras. QuickFuzz testing for fun and profit. Journal of Systems and Software, 134:340–354, December 2017. URL: http://www.cse.chalmers.se/~mista/assets/pdf/jss17.pdf, doi:10.1016/j.jss.2017.09.018.
Alex Groce, Mohammad Amin Alipour, and Rahul Gopinath. Coverage and its discontents. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software, Onward! 2014, 255–268. New York, NY, USA, 2014. Association for Computing Machinery. URL: https://agroce.github.io/onwardessays14.pdf, doi:10.1145/2661136.2661157.
Alex Groce, Josie Holmes, and Kevin Kellar. One test to rule them all. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis - ISSTA 2017. ACM Press, 2017. URL: https://agroce.github.io/issta17.pdf, doi:10.1145/3092703.3092704.
Alex Groce, Chaoqiang Zhang, Eric Eide, Yang Chen, and John Regehr. Swarm testing. In Proceedings of the 2012 International Symposium on Software Testing and Analysis, ISSTA 2012, 78–88. New York, NY, USA, 2012. Association for Computing Machinery. URL: https://www.cs.utah.edu/~regehr/papers/swarm12.pdf, doi:10.1145/2338965.2336763.
Emre Güler, Philipp Görz, Elia Geretto, Andrea Jemmett, Sebastian Österlund, Herbert Bos, Cristiano Giuffrida, and Thorste Holz. Cupid: automatic fuzzer selection for collaborative fuzzing. December 2020. URL: https://www.ei.ruhr-uni-bochum.de/media/emma/veroeffentlichungen/2020/09/26/ACSAC20-Cupid_TiM9H07.pdf, doi:10.1145/3427228.3427266.
David Herel, Dominika Zogatova, Matej Kripner, and Tomas Mikolov. Emergence of novelty in evolutionary algorithms. In The 2022 Conference on Artificial Life. MIT Press, 2022. URL: https://doi.org/10.1162%2Fisal_a_00501, doi:10.1162/isal_a_00501.
Adrian Herrera, Hendra Gunadi, Liam Hayes, Shane Magrath, Felix Friedlander, Maggi Sebastian, Michael Norrish, and Antony L. Hosking. Corpus distillation for effective fuzzing: a comparative evaluation. 2019. arXiv:1905.13055.
George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. Evaluating fuzz testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS '18, 2123–2138. New York, NY, USA, 2018. Association for Computing Machinery. URL: https://www.cs.umd.edu/~mwh/papers/fuzzeval.pdf, doi:10.1145/3243734.3243804.
Andreas Löscher and Konstantinos Sagonas. Targeted property-based testing. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2017, 46–56. New York, NY, USA, 2017. Association for Computing Machinery. URL: https://proper-testing.github.io/papers/issta2017.pdf, doi:10.1145/3092703.3092711.
Leonidas Lampropoulos, Michael Hicks, and Benjamin C. Pierce. Coverage guided, property based testing. Proceedings of the ACM on Programming Languages, 3(OOPSLA):1–29, 2019. URL: https://lemonidas.github.io/pdf/FuzzChick.pdf, doi:10.1145/3360607.
Caroline Lemieux and Koushik Sen. Fairfuzz: a targeted mutation strategy for increasing greybox fuzz testing coverage. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, 475–485. New York, NY, USA, 2018. Association for Computing Machinery. URL: https://www.carolemieux.com/fairfuzz-ase18.pdf, doi:10.1145/3238147.3238176.
Jie Liang, Yu Jiang, Yuanliang Chen, Mingzhe Wang, Chijin Zhou, and Jiaguang Sun. Pafl: extend fuzzing optimizations of single mode to industrial parallel mode. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2018, 809–814. New York, NY, USA, 2018. Association for Computing Machinery. URL: http://wingtecher.com/themes/WingTecherResearch/assets/papers/fse18-pafl.pdf, doi:10.1145/3236024.3275525.
Chenyang Lyu, Shouling Ji, Chao Zhang, Yuwei Li, Wei-Han Lee, Yu Song, and Raheem Beyah. MOPT: optimized mutation scheduling for fuzzers. In 28th USENIX Security Symposium (USENIX Security 19), 1949–1966. Santa Clara, CA, August 2019. USENIX Association. URL: https://www.usenix.org/system/files/sec19-lyu.pdf.
David MacIver, Zac Hatfield-Dodds, and Many Contributors. Hypothesis: a new approach to property-based testing. Journal of Open Source Software, 4(43):1891, 2019. URL: https://doi.org/10.21105/joss.01891, doi:10.21105/joss.01891.
David R. MacIver and Alastair F. Donaldson. Test-Case Reduction via Test-Case Generation: Insights from the Hypothesis Reducer (Tool Insights Paper). In Robert Hirschfeld and Tobias Pape, editors, 34th European Conference on Object-Oriented Programming (ECOOP 2020), volume 166 of Leibniz International Proceedings in Informatics (LIPIcs), 13:1–13:27. Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2020/13170, doi:10.4230/LIPIcs.ECOOP.2020.13.
Valentin J. M. Manès, Soomin Kim, and Sang Kil Cha. Ankou: guiding grey-box fuzzing towards combinatorial difference. In Proceedings of the International Conference on Software Engineering, 1024–1036. 2020. URL: https://www.jiliac.com/files/ankou-icse2020.pdf.
Brian Marick. How to misuse code coverage. 1997. URL: http://www.exampler.com/testing-com/writings/coverage.pdf.
Barton Miller, David Koski, Cjin Lee, Vivekananda Maganty, Ravi Murthy, Ajitkumar Natarajan, and Jeff Steidl. Fuzz revisited: a re-examination of the reliability of unix utilities and services. 1998. URL: http://www.paradyn.org/papers/fuzz-revisited.pdf.
Barton P. Miller, Louis Fredriksen, and Bryan So. An empirical study of the reliability of unix utilities. Commun. ACM, 33(12):32–44, December 1990. URL: http://www.paradyn.org/papers/fuzz.pdf, doi:10.1145/96267.96279.
Barton P. Miller, Mengxiao Zhang, and Elisa R. Heymann. The relevance of classic fuzz testing: have we solved this one? 2020. arXiv:2008.06537.
Stefan Nagy and Matthew Hicks. Full-speed fuzzing: reducing fuzzing overhead through coverage-guided tracing. 2018. arXiv:1812.11875.
Hoang Lam Nguyen and Lars Grunske. Bedivfuzz: integrating behavioral diversity into generator-based fuzzing. 2022. arXiv:2202.13114.
Rohan Padhye, Caroline Lemieux, and Koushik Sen. JQF: coverage-guided property-based testing in java. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis - ISSTA 2019. ACM Press, 2019. URL: https://cs.berkeley.edu/~rohanpadhye/files/zest-issta19.pdf, doi:10.1145/3293882.3339002.
Rohan Padhye, Caroline Lemieux, Koushik Sen, Mike Papadakis, and Yves Le Traon. Semantic fuzzing with zest. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis - ISSTA 2019. ACM Press, 2019. URL: https://people.eecs.berkeley.edu/~rohanpadhye/files/zest-issta19.pdf, doi:10.1145/3293882.3330576.
Rohan Padhye, Caroline Lemieux, Koushik Sen, Laurent Simon, and Hayawardh Vijayakumar. FuzzFactory: domain-specific fuzzing with waypoints. Proceedings of the ACM on Programming Languages, 3(OOPSLA):1–29, October 2019. URL: https://dl.acm.org/doi/pdf/10.1145/3360600, doi:10.1145/3360600.
Zoe Paraskevopoulou, Cătălin Hriţcu, Maxime Dénès, Leonidas Lampropoulos, and Benjamin C. Pierce. Foundational property-based testing. In Interactive Theorem Proving, pages 325–343. Springer International Publishing, 2015. URL: https://prosecco.gforge.inria.fr/personal/hritcu/publications/foundational-pbt.pdf, doi:10.1007/978-3-319-22102-1_22.
Sameer Reddy, Caroline Lemieux, Rohan Padhye, and Koushik Sen. Quickly generating diverse valid test inputs with reinforcement learning. In Proceedings of the 42st International Conference on Software Engineering. IEEE Press, 2020. URL: https://www.carolemieux.com/rlcheck_preprint.pdf.
Nicola Ruaro, Kyle Zeng, Lukas Dresel, Mario Polino, Tiffany Bao, Andrea Continella, Stefano Zanero, Christopher Kruegel, and Giovanni Vigna. SyML: Guiding Symbolic Execution Toward Vulnerable States Through Pattern Learning, pages 456–468. Association for Computing Machinery, New York, NY, USA, 2021. URL: https://conand.me/publications/ruaro-syml-2021.pdf.
Koushik Sen, Darko Marinov, and Gul Agha. Cute: a concolic unit testing engine for c. In Proceedings of the 10th European Software Engineering Conference Held Jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ESEC/FSE-13, 263–272. New York, NY, USA, 2005. Association for Computing Machinery. URL: http://mir.cs.illinois.edu/marinov/publications/SenETAL05CUTE.pdf, doi:10.1145/1081706.1081750.
Ezekiel Soremekun, Esteban Pavese, Nikolas Havrikov, Lars Grunske, and Andreas Zeller. Inputs from hell: learning input distributions for grammar-based test generation. IEEE Transactions on Software Engineering, PP:1–1, 08 2020. URL: https://publications.cispa.saarland/3167/7/inputs-from-hell.pdf, doi:10.1109/TSE.2020.3013716.
Vassilis Vassiliades, Konstantinos Chatzilygeroudis, and Jean-Baptiste Mouret. Using centroidal voronoi tessellations to scale up the multidimensional archive of phenotypic elites algorithm. IEEE Transactions on Evolutionary Computation, 22(4):623–630, 2018. doi:10.1109/TEVC.2017.2735550.
Jinghan Wang, Yue Duan, Wei Song, Heng Yin, and Chengyu Song. Be sensitive and collaborative: analyzing impact of coverage metrics in greybox fuzzing. In 22nd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2019), 1–15. Chaoyang District, Beijing, September 2019. USENIX Association. URL: https://www.usenix.org/conference/raid2019/presentation/wang.
Pengfei Wang, Xu Zhou, Kai Lu, Yingying Liu, and Tai Yue. Sok: the progress, challenges, and perspectives of directed greybox fuzzing. 2020. arXiv:2005.11907.
Zi Wang, Ben Liblit, and Thomas Reps. Tofu: target-oriented fuzzer. 2020. arXiv:2004.14375.
Mingyuan Wu, Ling Jiang, Jiahong Xiang, Yanwei Huang, Heming Cui, Lingming Zhang, and Yuqun Zhang. One fuzzing strategy to rule them all. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), volume, 1634–1645. 2022. URL: http://zhangyuqun.com/publications/icse2022b.pdf, doi:10.1145/3510003.3510174.
Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. Finding and understanding bugs in c compilers. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, 283–294. New York, NY, USA, 2011. Association for Computing Machinery. URL: https://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf, doi:10.1145/1993498.1993532.
Andreas Zeller, Rahul Gopinath, Marcel Böhme, Gordon Fraser, and Christian Holler. The fuzzing book. In The Fuzzing Book. Saarland University, 2019. URL: https://www.fuzzingbook.org/ (visited on 2019-09-09 16:42:54+02:00).
Andreas Zeller and Ralf Hildebrandt. Simplifying and isolating failure-inducing input. IEEE Trans. Softw. Eng., 28(2):183–200, February 2002. URL: https://www.cs.purdue.edu/homes/xyzhang/fall07/Papers/delta-debugging.pdf, doi:10.1109/32.988498.
C. Aschermann, S. Schumilo, A. Abbasi, and T. Holz. IJON: exploring deep state spaces via fuzzing. In 2020 IEEE Symposium on Security and Privacy (SP), volume, 1597–1612. 2020. URL: https://www.syssec.ruhr-uni-bochum.de/media/emma/veroeffentlichungen/2020/02/27/IJON-Oakland20.pdf.
M. Böhme, V. Pham, and A. Roychoudhury. Coverage-based greybox fuzzing as markov chain. IEEE Transactions on Software Engineering, 45(5):489–506, 2019. URL: https://mboehme.github.io/paper/TSE18.pdf, doi:10.1109/TSE.2017.2785841.
A. Löscher and K. Sagonas. Automating targeted property-based testing. In 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST), volume, 70–80. 2018. URL: https://proper-testing.github.io/papers/icst2018.pdf, doi:10.1109/ICST.2018.00017.
T. Petsios, A. Tang, S. Stolfo, A. D. Keromytis, and S. Jana. Nezha: efficient domain-independent differential testing. In 2017 IEEE Symposium on Security and Privacy (SP), 615–632. 2017. URL: https://www.ieee-security.org/TC/SP2017/papers/390.pdf.
V. Pham, M. Böhme, A. E. Santosa, A. R. Caciulescu, and A. Roychoudhury. Smart greybox fuzzing. IEEE Transactions on Software Engineering, 2019. URL: https://thuanpv.github.io/publications/TSE19_aflsmart.pdf, doi:10.1109/TSE.2019.2941681.