Use Cases

Using TrustworthySearch

In this section, we describe possible ways to test TrustworthySearch in your continuous-integration and QA systems. We understand that different teams within your organization may have different needs and pain points when it comes to using simulation to drive improvements, and we believe TrustworthySearch can be used effectively in a variety of workflows. Overall, we want to decrease the time it takes from starting a test to receiving valuable feedback on what to work on next.

Integration strategies for testing and continuous integration (CI)

There are multiple ways to use the TrustworthySearch API. Here are a few sample integration strategies.

Test nightly builds by evaluating risk on certain scenarios
A/B test for differences between two ADAS/AV stacks on a given scenario
Build an importance-sampling library of scenarios to use in fixed regression tests
Build importance samplers for performing enhanced or ``'fuzzed'' log replay with perturbations that are learnt to be dangerous for a certain score function (or functions).
Make the ego-vehicle policy part of the search space to perform sensitivity analysis to changes with respect to certain parameters

You are likely already testing your system using one of these strategies. Next we'll discuss how you might go about determining whether TrustworthySearch provides value on top of your current system.

Measuring the benefits of TrustworthySearch

There are many ways to test whether TrustworthySearch is making improvements to an existing testing or continuous integration system.

Consider a scenario you wish to test. For a given number of simulations, determine if TrustworthySearch provides more useful failures and failure analysis than an existing testing approach. This includes
- Coverage: which algorithm discovers more/all failure modes?
- Prioritization: TrustworthySearch outputs a ranking of which failure modes are more important to fix first. Is this better or worse than the current prioritization method?
- Failure sampler: TrustworthySearch builds a parametric importance sampler for the discovered failure modes that can be used to generate more ``hard'' test cases as a warm-start to test future versions of the ADAS/AV stack (the sampler can be thought of as generator for a library of scenarios similar to the discovered failure modes). Is this sampler of failure modes better or worse than the current way that previous failures are used to inform generation of similar scenarios?
Consider an ADAS/AV stack that has been tested before and has a known problem on a certain scenario. Run TrustworthySearch on the stack to see if it can find the problem and perform risk analysis on it more efficiently and/or completely than current techniques (including coverage and prioritization of failure modes).
Consider two ADAS/AV stacks, one of which has a known problem and the other which claims to have fixed this problem without introducing further problems. Run TrustworthySearch on the second stack to see if it is truly a Pareto improvement on the first stack. Determine if this test is faster or more complete (with respect to coverage and prioritization) than current evaluation techniques.
Consider two ADAS/AV stacks, but you do not know which of the two is better. Rather than perform risk analysis on each of them independently, run TrustworthySearch to determine regions where the two stacks maximally differ in performance (conditioned on both of their performances being below some global threshold as well). This means finding regions where the first stack has worse performance than the second stack and vice versa. Determine whether this process occurs more efficiently and completely (with respect to coverage and prioritization of the discovered regions) than current approaches to A/B testing.