Internet-Draft MLRsearch July 2023
Konstantynowicz & Polak Expires 11 January 2024 [Page]
Benchmarking Working Group
Intended Status:
M. Konstantynowicz
Cisco Systems
V. Polak
Cisco Systems

Multiple Loss Ratio Search


This document proposes improvements to [RFC2544] throughput search by defining a new methodology called Multiple Loss Ratio search (MLRsearch). The main objectives for MLRsearch are to minimize the total test duration, search for multiple loss ratios and improve results repeatibility and comparability.

The main motivation behind MLRsearch is the new set of challenges and requirements posed by testing Network Function Virtualization (NFV) systems and other software based network data planes.

MLRsearch offers several ways to address these challenges, giving user configuration options to select their preferred way.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 11 January 2024.

Table of Contents

1. Purpose and Scope

The purpose of this document is to describe Multiple Loss Ratio search (MLRsearch), a throughput search methodology optimized for software DUTs.

Applying vanilla [RFC2544] throughput bisection to software DUTs results in a number of problems:

MLRsearch aims to address these problems by applying the following set of enhancements:

MLRsearch configuration options are flexible enough to support both conservative settings (unconditionally compliant with [RFC2544], but longer search duration and worse repeatability) and aggressive settings (shorter search duration and better repeatability but not compliant with [RFC2544]).

No part of [RFC2544] is intended to be obsoleted by this document.

2. Terminology

When a subsection is defining a term, the first paragraph acts as a definition. Other paragraphs are treated as a description, they provide additional details without being needed to define the term.

Definitions should form a directed acyclic graph of dependencies. If a section contains subsections, the section definition may depend on the subsection definitions. Otherwise, any definition may depend on preceding definitions. In other words, if the section definition were to come after subsections, there would be no forward dependencies for people reading just definitions from start to finish.

Descriptions provide motivations and explanations, they frequently reference terms defined only later. Motivations in section descriptions are the reason why section text comes before subsection text.

2.1. General notions

General notions are the terms defined in this section.

It is useful to define the following notions before delving into MLRsearch architecture, as the notions appear in multiple places with no place being special enough to host definition.

2.1.1. General and specific quantities

General quantity is a quantity that may appear multiple times in MLRsearch specification, perhaps each time in a different role. The quantity when appearing in a single role is called a specific quantity.

It is useful to define the general quantity, so definitions of specific quantities may refer to it. We say a specific quantity is based on a general quantity, if the specific quantity definition refers to and relies on the general quantity definition.

It is natural to name specific quantities by adding an adjective (or a noun) to the name of the general quantity. But existing RFCs typically explicitly define a term acting in a specific role, so the RFC name directly refers to a specific quantity, while the corresponding general quantity is defined only implicitly. Therefore this documents defines general quantities explicitly, even if the same term already appears in an RFC.

In practice, it is required to know which unit of measurement is used to accompany a numeric value of each quantity. The choice of a particular unit of measurement is not important for MLRsearch specification though, so specific units mentioned in this document are just examples or recommendations, not requirements.

When reporting, it is REQUIRED to state the units used.

2.1.2. Composite

A composite is a set of named attributes. Each attribute is either a specific quantity or a composite.

MLRsearch specification frequently groups multiple specific quantities into a composite. Description of such a composite brings an insight to motivations why this or other terms are defined as they are. Such insight will be harder to communicate with the specific quantities alone.

Also, it simplifies naming of specific quantities, as they usually can share a noun or adjective referring to their common composite. Most of relations between composites and their specific quantities can be described using plain English.

Perhaps the only exception involves referring to specific quantities as attributes. For example if there is a composite called 'target', and one of its specific quantities is 'target width' defined using a general quantity 'width', we can say 'width is one of target attributes'.

2.1.3. SUT

As defined in RFC 2285: The collective set of network devices to which stimulus is offered as a single entity and response measured.

While RFC 2544 mostly refers to DUT as a single (network interconnecting) device, section 19 makes it clear multiple DUTs can be treated as a single system, so most of RFC 2544 also applies to testing SUT.

MLRsearch specification only refers to SUT (not DUT), even if it consists of just a single device.

2.1.4. Trial

A trial is the part of test described in RFC 2544 section 23.

When traffic has been sent and SUT response has been observed, we say the trial has been performed, or the trial has been measured. Before that happens, multiple possibilities for upcoming trial may be under consideration.

2.1.5. Load

Intended, constant load for a trial, usually in frames per second.

Load is the general quantity implied by Constant Load of RFC 1242, Data Rate of RFC 2544 and Intended Load of RFC 2285. All three specify this value applies to one (input or output) interface, so we can talk about unidirectional load also when bidirectional or multi-port traffic is applied.

MLRsearch does not rely on this distinction, it works also if the load values correspond to an aggregate rate (sum over all SUT tested input or output interface unidirectional loads), as long as all loads share the same semantics.

Several RFCs define useful quantities based on Offered Load (instead of Intended Load), but MLRsearch specification works only with (intended) load. Those useful quantities still serve as motivations for few specific quantities used in MLRsearch specification.

MLRsearch assumes most load values are positive. For some (but not all) specific quantities based on load, zero may also be a valid value.

2.1.6. Duration

Intended duration of the traffic for a trial, usually in seconds.

This general quantity does not include any preparation nor waiting described in section 23 of RFC 2544. Section 24 of RFC 2544 places additional restrictions on duration, but those restriction apply only to some of the specific quantities based on duration.

Duration is always positive in MLRsearch.

2.1.7. Duration sum

For a specific set of trials, this is the sum of their durations.

Some of specific quantities based on duration sum are derived quantities, without a specific set of trials to sum their durations.

Duration sum is never negative in MLRsearch.

2.1.8. Width

General quantity defined for an ordered pair (lower and higher) of load values, which describes a distance between the two values.

The motivation for the name comes from binary search. The binary search tries to approximate an unknown value by repeatedly bisecting an interval of possible values, until the interval becomes narrow enough. Width of the interval is a specific quantity and the termination condition compares that to another specific quantity acting as the threshold. The threshold value does not have a specific interval associated, but corresponds to a 'size' of the compared interval. As size is a word already used in definition of frame size, a more natural word describing interval is width.

The MLRsearch specification does use (analogues of) upper bound and lower bound, but does not actually need to talk about intervals. Still, the intervals are implicitly there, so width is the natural name.

Actually, there are two popular options for defining width. Absolute width is based on load, the value is the higher load minus the lower load. Relative width is dimensionless, the value is the absolute width divided by the higher load. As intended loads for trials are positive, relative width is between 0.0 (including) and 1.0 (excluding).

Relative width as a threshold value may be useful for users who do not presume what is the typical performance of SUT, but absolute width may be a more familiar concept.

MLRsearch specification does not prescribe which width has to be used, but widths MUST be either all absolute or all relative, and it MUST be clear from report which option was used (it is implied from the unit of measurement of any width value).

2.1.9. Loss ratio

The loss ratio is a general quantity, dimensionless floating point value assumed to be between 0.0 and 1.0, both including. It is computed as the number of frames forwarded by SUT, divided by the number of frames that should have been forwarded during the trial.

If the number of frames that should have been forwarded is zero, the loss ratio is considered to be zero (but it is better to use high enough loads to prevent this).

Loss ratio is basically the same quantity as Frame Loss Rate of RFC 1242, just not expressed in percents.

RFC1242 Frame Loss Rate: Percentage of frames that should have been forwarded by a network device under steady state (constant) load that were not forwarded due to lack of resources.

(RFC2544 restricts Frame Loss Rate to a type of benchmark, for loads 100% of 'maximum rate', 90% and so on.)

2.1.10. Exceed ratio

This general quantity is a dimensionless floating point value, defined using two duration sum quantities. One duration sum is referred to as the good duration sum, the other is referred to as the bad duration sum. The exceed ratio value is computed as the bad duration sum value divided by the sum of the two sums. If both sums are zero, the exceed ratio is undefined.

As there are no negative duration sums in MLRsearch, exceed ratio values are between 0.0 and 1.0 (both including).

2.2. Architecture

MLRsearch architecture consists of three main components: the manager, the controller and the measurer.

The search algorithm is implemented in the controller, and it is the main focus of this document.

Most implementation details of the manager and the measurer are out of scope of this document, except when describing how do those components interface with the controller.

2.2.1. Manager

The manager is the component that initializes SUT, traffic generator (called tester in RFC 2544), the measurer and the controller with intended configurations. It then handles the execution to the controller and receives its result.

Managers can range from simple CLI utilities to complex Continuous Integration systems. From the controller point of view it is important that no additional configuration (nor warmup) is needed for SUT and the measurer to perform trials.

The interface between the manager and the controller is defined in the controller section.

One execution of the controller is called a search. Some benchmarks may execute multiple searches on the same SUT (for example when confirming the performance is stable over time), but in this document only one invocation is concerned (others may be understood as the part of SUT preparation).

Creation of reports of appropriate format can also be understood as the responsibility of the manager. This document places requirements on which information has to be reported.

2.2.2. Measurer

The measurer is the component which performs one trial as described in RFC 2544 section 23, when requested by the controller.

From the controller point of view, it is a function that accepts trial input and returns trial output.

This is the only way the controller can interact with SUT. In practice, the measurer has to do subtle decisions when converting the observed SUT behavior into a single trial loss ratio value. For example how to deal with out of order frames or duplicate frames.

On software implementation level, the measurer is a callable, injected by the manager into the controller instance.

The act of performing one trial (act of turning trial input to trial output) is called a measurement, or trial measurement. This way we can talk about trials that were measured already and trials that are merely planned (not measured yet). Trial input

The load and duration to use in an upcoming trial.

This is a composite.

Other quantities needed by the measurer are assumed to be constant and set up by the manager before search starts (see traffic profile), so they do not count as trial input attributes. Trial load

Trial load is the intended load for the trial.

This is a specific quantity based on load, directly corresponding to RFC 2285 intended load. Trial duration

Trial duration is the intended duration for the trial.

This is a specific quantity based on duration, so it specifies only the traffic part of the trial, not the waiting parts. Traffic profile

Any other configuration values needed by the measurer to perform a trial.

The measurer needs both trial input and traffic profile to perform the trial. As trial input contains the only values that vary during one the search, traffic profile remains constant during the search.

Traffic profile when understood as a composite is REQUIRED by RFC 2544 to contain some specific quantities (for example frame size). Several more specific quantities may be RECOMMENDED.

Depending on SUT configuration (e.g. when testing specific protocols), additional values need to be included in the traffic profile and in the test report. (See other IETF documents.) Trial ouput

A composite consisting of trial loss ratio and trial forwarding rate.

Those are the only two specific quantities (among other quantities possibly measured in the trial, for example offered load) that are important for MLRsearch. Trial loss ratio

Trial loss ratio is a specific quantity based on loss ratio. The value is related to a particular measured trial, as measured by the measurer. Trial forwarding rate

Trial forwarding rate is a derived quantity. It is computed as one minus trial loss ratio, that multiplied by trial load.

Despite the name, the general quantity this specific quantity corresponds to is load (not rate). The name is inspired by RFC 2285, which defines Forwarding Rate specific to one output interface.

As the definition of loss ratio is not neccessarily per-interface (one of details left for the measurer), using the definition above (instead of RFC 2285) makes sure trial forwarding rate is always between zero and the trial load (both including). Trial result

Trial result is a composite consisting of trial input attributes and trial output attributes.

Those are all specific quantites related to a measured trial MLRsearch needs.

While distinction between trial input and output is important when defining the interface between the controller and the measurer, it is easier to talk about trial result when describing how measured trials influence the controller behavior.

2.2.3. Controller

The component of MLRsearch architecture that calls the measurer and returns conditional throughputs to the manager.

This component implements the search algorithm, the main content of this document.

Contrary to Throughput as defined in RFC 1242, the definition of conditional throughput is quite sensitive to the controller input (as provided by the manager), and its full definition needs several terms which would otherwise be hidden as internals of the controller implementation.

The ability of conditional throughput to be less sensitive to performance variance, and the ability of the controller to find conditional throughputs for multiple search goals within one search (and in short overall search time) are strong enough motivations for the need of increased complexity.

2.2.4. Controller input

A composite of max load, min load, and a set of search goals.

The search goals (as elements of the set of search goals) are usually not named and unordered.

It is fine if all search goals of the set have the same value of a particular attribute. In that case, the common value may be treated as a global attribute (similarly to max and min load).

The set of search goals MUST NOT be empty. Two search goals within the set MUST differ in at least one attribute. The manager MAY avoid both issues by presenting empty report or de-duplicating the search goals, but it is RECOMMENDED for the manager to raise an error to its caller, as the two conditions suggest the test is improperly configured. Max load

Max load is a specific quantity based on load. No trial load is ever higher than this value.

RFC 2544 section 20 defines maximum frame rate based on theoretical maximum rate for the frame size on the media. RFC 2285 section 3.5.3 specifies Maximum offered load (MOL) which may be lower than maximum frame rate. There may be other limitations preventing high loads, for examples resources available to traffic generator.

The manager is expected to provide a value that is not greater than any known limitation. Alternatively, the measurer is expected to work at max load, possibly reporting as lost any frames that were not able to leave Traffic Generator.

From the controller point of view, this is merely a global upper limit for any trial load candidates. Min load

Min load is a specific quantity based on load. No trial load is ever lower than this value.

The motivation of this quantity is to prevent trials with too few frames sent to SUT.

Also, practically if a SUT is able to reach only very small forwarding rates (min load indirectly serves as a threshold for how small), it may be considered faulty (or perhaps the test is misconfigured). Search goal

A composite of 7 attributes (see subsections).

If not otherwise specified, 'goal' always refers to a search goal in this document.

The controller input may contain multiple search goals. The name Multiple Loss Ratio search was created back when goal loss ratio was the only attribute allowed to vary between goals.

Each goal will get its conditional throughput discovered and reported at the end of the search.

The definitions of the 7 attributes are not very informative by themselves. Their motivation (and naming) becomes more clear from the impact they have on conditional throughput. Goal loss ratio

A specific quantity based on loss ratio. A threshold value for trial loss ratios. MUST be lower than one.

Trial loss ratio values will be compared to this value, a trial will be considered bad if its loss ratio is higher than this.

For example, RFC 2544 throughput has goal loss ratio of zero, a trial is bad once a sigle frame is lost.

Loss ratio of one would classify each trial as good (regardless of loss), which is not useful. Goal initial trial duration

A specific quantity based on duration. A threshold value for trial durations. MUST be positive.

MLRsearch is allowed to use trials as short as this when focusing on this goal. The conditional throughput may be influenced by shorter trials, (measured when focusing on other search goals). Goal final trial duration

A specific quantity based on duration. A threshold value for trial durations. MUST be no smaller than goal initial trial duration.

MLRsearch is allowed to use trials as long as this when focusing on this goal. If more data is needed, repeated trials at the same load and duration are requested by the controller. Goal min duration sum

A specific quantity based on duration sum. A threshold value for a particular duration sum.

MLRsearch requires at least this amount of (effective) trials for a particular load to become part of MLRsearch outputs.

It is possible (though maybe not prectical) for goal min duration sum to be smaller than goal final trial duration.

In practice, the sum of durations actually spent on trial measurement can be smaller (when trial results are quite one-sided) or even larger (in presence of shorter-than-final trial duration results at the same load).

If the sum of all (good and bad) long trials is at least this, and there are no short trials, then the load is guaranteed to be classified as either an upper or a lower bound.

In some cases, the classification is known sooner, when the 'missing' trials cannot change the outcome.

When short trials are present, the logic is more complicated. Goal exceed ratio

A specific quantity based on exceed ratio. A threshold value for particulat sets of trials.

An attribute used for classifying loads into upper and lower bounds.

If the duration sum of all (current duration) trials is at least min duration sum, and more than this percentage of the duration sum comes from bad trials, this load is an upper bound.

If there are shorter duration trials, the logic is more complicated. Goal width

A specific quantity based on width. A threshold value for a particular width. MUST be positive.

This defines the exit condition for this search goal.

Relevant bounds (of the final target) need to be this close before conditional throughput can be reported. Preceding targets

A non-negative integer affecting the behavior of the controller.

How many additional non-final targets to add. Each next preceding target has double width and min duration sum geometrically closer to initial trial duration.

The usage of preceding targets is an important source of MLRsearch time savings (compared to simpler search algorithms).

Having this value configurable lets the manager tweak the overall search duration based on presumed knowledge of SUT performance stability.

2.2.5. Controller internals

Terms not directly corresponding to the controller's input nor output, but needed indirectly as dependencies of the conditional throughput definition.

Following these definitions specifies virtually all of the controller (MLRsearch algorithm) logic. Pre-initial trials

Up to three special trials executed at the start of the search. The first trial load is max load, subsequent trial load are computed from preceding trial forwarding rate.

The main loop of the controller logic needs at least one trial result, and time is saved if the trial results are close to future conditional throughput values.

The exact way to compute load for second and third trial (and whether even measure second or third trial) are not specified here, as the implementation details have negligible effect on the reported conditional throughput. Search target

A composite of 5 specific quantites (see subsections). Frequently called just target.

Similar to (but distinct from) the search goal.

Each search goal prescribes a final target, probably with a chain of preceding targets.

More details in the Derived targets section. Target loss ratio

Same as loss ratio of the corresponding goal. Target exceed ratio

Same as exceed ratio of the corresponding goal. Target width

Similar to goal width attribute. Doubled from goal width for each level of preceding target. Target min duration sum

Similar to goal min duration sum attribute. Geometrically interpolated between initial target duration and goal min duration sum. Target trial duration

When MLRsearch focuses on this target, it measures trials with this duration. The value is equal to the minimum of goal final trial duration and target min duration sum.

Also, this value is used to classify trial results as short (if trial duration is shorter than this) or long. Derived targets

After receiving the set of search goals, MLRsearch internally derives a set of search targets.

The derived targets can be seen as forming a chain, from initial target to final target. The chain is linked by a reference from a target to its preceding (towarsds initial) target.

The reference may be implemented as 6th attribute od target. Final target

The final target is the target where the most of attribute values are directly copied from the coresponding search goal. Final target width is the same as goal width, final target trial duration is the same as goal final trial duration, and final target min duration sum is the same as the goal min duration sum.

The conditional throughput is found when focusing on the final target. All non-final targets do not directly affect the conditional throughput, they are there just as an optimization. Preceding target

Each target may have a preceding target. Goal attribute Preceding targets governs how many targets are created in addition to the final target corresponding to the search goal.

Any preceding target has double width, meaning one balanced bisection is needed to reduce preceding target width to the next target width.

Preceding target min duration sum is exponentially smaller, aiming for prescribed initial target min duration sum.

Preceding target trial duration is either its min duration sum, or the corresponding goal's final trial duration, whichever is smaller.

As the preceding min duration sum is shorter than the next duration sum, MLRsearch is able to achieve the preceding target width sooner (than with the next target min duration sum).

This way an approximation of the conditional throughput is found, with the next target needing not as much time to improve the approximation (compared to not starting with the approximation). Initial target

Initial target is a target without any other target preceding it. Initial target min duration sum is equal to the corresponding goal's initial trial duration.

As a consequence, initial target trial duration is equal to its min duration sum. Trial classification

Any trial result can be classified according to any target along two axes.

The two classifications are independent.

This classification is important for defining the conditional throughput. Short trial

If the (measured) trial duration is shorter than the target trial duration, the trial is called long. Long trial

If the (measured) trial duration is not shorter than the target trial duration, the trial is called long. Bad trial

If the (measured) trial loss ratio is larger than the target loss ratio, the trial is called bad.

For example, if the target loss ratio is zero, a trial is bad as soon as one frame was lost. Good trial

If the (measured) trial loss ratio is not larger than the target loss ratio, the trial is called good.

For example, if the target loss ratio is zero, a trial is good only when there were no frames lost. Load stat

A composite of 8 quantities (see subsections) The quantites depend on a target and a load, and are computed from all trials measured at that load so far.

The MLRsearch output is the conditional througput, which is a specific quantity based on load. As MLRsearch may measure multiple trials at the same load, and those trials may not have the same duration, we need a way to classify a set of trial results at the same load.

As the logic is not as straightforward as in other parts of MLRsearch algorithm, it is best defined using the following derived quantities.

Load stat is the composite for one load and one target. Set of load stats for one load an all targets is commonly called load stats. Long good duration sum

Sum of durations of all long good trials (at this load, according to this target). Long bad duration sum

Sum of durations of all long bad trials (at this load, according to this target). Short good duration sum

Sum of durations of all short good trials (at this load, according to this target). Short bad duration sum

Sum of durations of all short bad trials (at this load, according to this target). Effective bad duration sum

One divided by tagret exceed ratio, that plus one. Short good duration sum divided by that. Short bad duration sum minus that, or zero if that would be negative. Long bad duration sum plus that is the effective bad duration sum.

Effective bad duration sum is the long bad duration sum plus some fraction of short bad duration sum. The fraction is between zero and one (both possibly including).

If there are no short good trials, effective bad duration sum becomes the duration sum of all bad trials (long or short).

If an exceed ratio computed from short good duration sum and short bad duration sum is equal or smaller than the target exceed ratio, effective bad duration sum is equal to just long bad duration sum.

Basically, short good trials can only lessen the impact of short bad trials, while short bad trials directly contribute (unless lessened).

A typical example of why a goal needs higher final trial duration than initial trial duration is when SUT is expected to have large buffers, so a trial may be too short to see frame losses due to a buffer becoming full. So a short good trial does not give strong information. On the other hand, short bad trial is a strong hint SUT would lose many frames at that load and long duration. But if there is a mix of short bad and short good trials, MLRsearch should not cherry-pick only the short bad ones.

The presented way of computing the effective bad duration sum aims to be a fair treatment of short good trials.

If the target exceed ratio is zero, the given definition contains positive infinty as an intermediate value, but still simplifies to a finite result (long bad duration sum plus short bad duration sum). Missing duration sum

The target min duration sum minus effective bad duration sum and minus long good duration sum, or zero if that would be negative.

MLRsearch may need up to this duration sum of additional long trials before classifing the load. Optimistic exceed ratio

The specific quantity based on exceed ratio, where bad duration sum is the effective bad duration sum, and good duration sum is the long good duration sum plus the missing duration sum.

This is the value MLRsearch would compare to target exceed ratio assuming all of the missing duration sum ends up consisting of long good trials.

If there was a bad long trial, optimistic exceed ratio becomes larger than zero. Additionally, if the target exceed ratio is zero, optimistic exceed ratio becomes larger than zero even on one short bad trial. Pessimistic exceed ratio

The specific quantity based on exceed ratio, where bad duration sum is the effective bad duration sum plus the missing duration sum, and good duration sum is the long good duration sum.

This is the value MLRsearch would compare to target exceed ratio assuming all of the missing duration sum ends up consisting of bad good trials.

Note that if the missing duration sum is zero, optimistic exceed ratio becomes equal to pessimistic exceed ratio.

This is the role target min duration sum has, it guarantees the two load exceed ratios eventually become the same. Otherwise, pessimistic exceed ratio is always bigger than the optimistic exceed ratio.

Depending on trial results, the missing duration sum may not be large enough to change optimistic (or pessimistic) exceed ratio to move to the other side compared to target exceed ratio. In that case, MLRsearch does not need to measure more trials at this load when focusing on this target. Target bounds

With respect to a target, some loads may be classified as upper or lower bound, and some of the bounds are treated as relevant.

The subsequent parts of MLRsearch rely only on relevant bounds, without the need to classify other loads. Upper bound

A load is classified as an upper bound for a target, if and only if both optimistic exceed ratio and pessimstic load exceed ratio are larger than the target exceed ratio.

During the search, it is possible there is no upper bound, for example because every measured load still has too high missing duration sum.

If the target exceed ratio is zero, and the load has at least one bad trial (short or long), the load becomes an upper bound. Lower bound

A load is classified as a lower bound for a target, if and only if both optimistic exceed ratio and pessimstic load exceed ratio are no larger than the target exceed ratio.

During the search, it is possible there is no lower bound, for example because every measured load still has too high missing duration sum.

If the target exceed ratio is zero, all trials at the load of a lower bound must be good trials (short or long).

Note that so far it is possible for a lower bound to be higher than an upper bound. Relevant upper bound

For a target, a load is the relevant upper bound, if and only if it is an upper bound, and all other upper bounds are larger (as loads).

In some cases, the max load when classified as a lower bound is also effectively treated as the relevant upper bound. (In that case both relevant bounds are equal.)

If that happens for a final target at the end of the search, the controller output may contain max load as the relevant upper bound (even if the goal exceed ratio was not exceeded), signalling SUT performs well even at max load.

If the target exceed ratio is zero, the relevant upper bound is the smallest load where a bad trial (short or long) has been measured. Relevant lower bound

For a target, a load is the relevant lower bound if two conditions hold. Both optimistic exceed ratio and pessimstic load exceed ratio are no larger than the target exceed ratio, and there is no smaller load classified as an upper bound.

This is a second place where MLRsearch is not symmetric (the first place was effective bad duration sum).

While it is not likely for a MLRsearch to find a smaller upper bound and a larger load satisfying first condition for the lower bound, it still may happen and MLRsearch has to deal with it. The second condition makes sure the relevant lower bound is smaller than the relevant upper bound.

In some cases, the min load when classified as an upper bound is also effectively treated as the relevant lower bound. (In that case both relevant bounds are equal.)

If that happens for a final target at the end of the search, the controller output may contain min load as the relevant lower bound even if the exceed ratio was 'overstepped', signalizing the SUT does not even reach the minimal required performance.

The manager has to make sure this is distingushed in report from cases where min rate is a legitimate conditional throughput (e.g. the exceed ratio was not overstepped at the min load). Relevant bounds

The pair of the relevant lower bound and the relevant upper bound.

Useful for determining the width of the relevant bounds. Any of the bounds may be the effective one (max load or min load).

A goal is achieved (at the end of the search) when the final target's relevant bounds have width no larger than the goal width. Candidate selector

A stateful object (a finite state machine) focusing on a single target, used to determine next trial input.

Initialized for a pair of targets: the current target and its preceding target (if any).

Private state (not shared with other selectors) consists of mode and flags. Public state (shared with all selectors) is the actual relevant bounds for both targets (current and precedinig).

After accepting a trial result, each selector can nominate one candidate (or no candidate) for the next trial measurement. Current target

This is the target this selector tries to achieve. Preceding target

The target (if any) preceding to the current target.

While this selector does not focus on the preceding target, the relevant bounds for the preceding target are used as hints when the current bound does not have enough of its relevant bounds. Candidate

The trial input (if any) this selecor nominates.

The trial duration attribute is always the current target trial duration. The trial load attribute depends on the selector state.

Candidates have defined ordering, to simplify finding the winner. If load differs, the candidate with lower load is preferred. If load is the same but duration differs, the candidate with larger duration is preferred. Selector mode

During its lifetime, selector proceeds through the following modes. In order, but some modes may be skipped or revisited.

Each mode has its own strategy of determining the candidate load (if any). Waiting

Not enough relevant bounds (even for the preceding target). In this mode, the selector abstains from nominating a candidate.

This selector leaves this mode when preceding target's selector is done. Halving

Candidate is in the middle of the relevant bounds of the preceding target.

If the relevant bounds are narrow enough already, this mode is skipped. As the preceding target had double width, just one halving load needs to be measured.

Selector uses a flag to avoid re-entering this mode once it finished measuring the halved load. Upgrading

This mode activates when one relevant bound for the current target is present and there is a matching relevant bound of the preceding target within the current target width. Candidate is the load of the matching bound from the preceding target.

At most one bound load is measured, depending on halving outcome. Private flags are used to avoid upgrading at later times once selector finished measuring the upgraded load. Extending

Refined already but the other relevant bound for the current target is still missing. Nominate new candidate according to external search. Initial target selectors skip all previous modes.

A private value is used to track the width to be used in next load extension (increasing geometrically). For initial target selectors, the starting width may be chosen based on pre-initial trial results.

If both relevant bounds are present at the current load, but the lower bound is far away (compared to tracked width), the candidate from this mode is preferred (as long as the load is larger than the candidate load of bisecting mode). Bisecting

Both relevant bounds for the current target are available, but they are too far from each other. Candidate is in the middle.

Contrary to halving, the candidate load does not need to be at the exact middle. For example if the width of the current relevant bounds is three times as large as the target width, it is advantageous to split the interval in 1:2 ratio (choosing the lower candidate load), as it can save one bisect. Done

Both relevant bounds for the current target are available, the width is no larger than the target width. No candidate.

If a selector reaches the done state, it is still possible later trials invalidate its relevant lower bound (by proving a lower load is in fact a new uper bound), making the selector transition into extending or bisecting mode. Active selector

Derived from a common goal, the earliest selector which nominates a candidate is considered to be the active selector for this goal. Candidates from other selectors of the same goal are ignored.

It is quite possible selectors focusing on other goals have already found a lower bound relevant to multiple targets in a chain. In that case, we want the most-initial of the target selectors (not already in done mode) to have the nomination.

Otherwise (when in extending mode and missun relevant upper bound) the closer-to-final selectors would nominate candidates at lower load but at too high duration sum, preventing some of the time savings. Winner

If the candidate previously nominated by a selector was the one that got measured, the candidate is called a winner.

A selector observing its previous candidate was a winer can use simplified logic when determining the mode, as it knows no other selectors may have changed the relevant loads unexpectedly.

2.2.6. Controller output

The output object the controller returns to the manager is a mapping assigning each search goal its conditional output (if it exists).

The controller MAY include more information (if manager accepts it), for example load stat at relevant bounds.

There MAY be several ways how to communicate the fact a conditional output does not exist (e.g. min load is classified as an upper bound). The manager MUST NOT present min load as a conditional output in that case.

If max load is a lower bound, it leads to a valid conditional output value. Conditional throughput

The conditional throughput is the average of trial forwarding rates across long good trials measured at the (offered load classified as) relevant lower bound (for the goal, at the end of the search). The average is the weighted arithmetic mean, weighted by trial duration.

If the goal exceed ratio is zero, the definition of the relevant bounds simplifies significantly. If additionally the goal loss ratio is zero, and the goal min duration sum is equal to goal final trial duration, conditional throughput becomes conditionally compliant with RFC 2544 throughput. If the goal final trial duration is at least 60 seconds, the conditional througput becomes unconditionally compliant with RFC 2544 throughput.

3. Problems

3.1. Long Test Duration

Emergence of software DUTs, with frequent software updates and a number of different packet processing modes and configurations, drives the requirement of continuous test execution and bringing down the test execution time.

In the context of characterising particular DUT's network performance, this calls for improving the time efficiency of throughput search. A vanilla bisection (at 60sec trial duration for unconditional [RFC2544] compliance) is slow, because most trials spend time quite far from the eventual throughput.

[RFC2544] does not specify any stopping condition for throughput search, so users can trade-off between search duration and achieved precision. But, due to exponential behavior of bisection, small improvement in search duration needs relatively big sacrifice in the throughput precision.

3.2. DUT within SUT

[RFC2285] defines: - DUT as - The network forwarding device to which stimulus is offered and response measured [RFC2285] (section 3.1.1). - SUT as - The collective set of network devices to which stimulus is offered as a single entity and response measured [RFC2285] (section 3.1.2).

[RFC2544] specifies a test setup with an external tester stimulating the networking system, treating it either as a single DUT, or as a system of devices, an SUT.

In case of software networking, the SUT consists of a software program processing packets (device of interest, the DUT), running on a server hardware and using operating system functions as appropriate, with server hardware resources shared across all programs and the operating system.

DUT is effectively "nested" within SUT.

Due to a shared multi-tenant nature of SUT, DUT is subject to interference (noise) coming from the operating system and any other software running on the same server. Some sources of noise can be eliminated (e.g. by pinning DUT program threads to specific CPU cores and isolating those cores to avoid context switching). But some noise remains after all such reasonable precautions are applied. This noise does negatively affect DUT's network performance. We refer to it as an SUT noise.

DUT can also exhibit fluctuating performance itself, e.g. while performing some "stop the world" internal stateful processing. In many cases this may be an expected per-design behavior, as it would be observable even in a hypothetical scenario where all sources of SUT noise are eliminated. Such behavior affects trial results in a way similar to SUT noise. We use noise as a shorthand covering both DUT fluctuations and genuine SUT noise.

A simple model of SUT performance consists of a baseline noiseless performance, and an additional noise. The baseline is assumed to be constant (enough). The noise varies in time, sometimes wildly. The noise can sometimes be negligible, but frequently it lowers the observed SUT performance in a trial.

In this model, SUT does not have a single performance value, it has a spectrum. One end of the spectrum is the noiseless baseline, the other end is a noiseful performance. In practice, trial results close to the noiseful end of the spectrum happen only rarely. The worse performance, the more rarely it is seen in a trial.

Focusing on DUT, the benchmarking effort should aim at eliminating only the SUT noise from SUT measurement. But that is not really possible, as there are no realistic enough models able to distinguish SUT noise from DUT fluctuations.

However, assuming that a well-constructed SUT has the DUT as its performance bottleneck, the "DUT noiseless performance" can be defined as the noiseless end of SUT performance spectrum. (At least for throughput. For other quantities such as latency there will be an additive difference.) By this definition, DUT noiseless performance also minimizes the impact of DUT fluctuations.

In this document, we reduce the "DUT within SUT" problem to estimating the noiseless end of SUT performance spectrum from a limited number of trial results.

Any improvements to throughput search algorithm, aimed for better dealing with software networking SUT and DUT setup, should employ strategies recognizing the presence of SUT noise, and allow discovery of (proxies for) DUT noiseless performance at different levels of sensitivity to SUT noise.

3.3. Repeatability and Comparability

[RFC2544] does not suggest to repeat throughput search. And from just one throughput value, it cannot be determined how repeatable that value is. In practice, poor repeatability is also the main cause of poor comparability, e.g. different benchmarking teams can test the same SUT but get different throughput values.

[RFC2544] throughput requirements (60s trial, no tolerance to single frame loss) force the search to fluctuate close the noiseful end of SUT performance spectrum. As that end is affected by rare trials of significantly low performance, the resulting throughput repeatability is poor.

The repeatability problem is the problem of defining a search procedure which reports more stable results (even if they can no longer be called "throughput" in [RFC2544] sense). According to baseline (noiseless) and noiseful model, better repeatability will be at the noiseless end of the spectrum. Therefore, solutions to the "DUT within SUT" problem will help also with the repeatability problem.

Conversely, any alteration to [RFC2544] throughput search that improves repeatability should be considered as less dependent on the SUT noise.

An alternative option is to simply run a search multiple times, and report some statistics (e.g. average and standard deviation). This can be used for "important" tests, but it makes the search duration problem even more pronounced.

3.4. Throughput with Non-Zero Loss

[RFC1242] (section 3.17) defines throughput as: The maximum rate at which none of the offered frames are dropped by the device.

and then it says: Since even the loss of one frame in a data stream can cause significant delays while waiting for the higher level protocols to time out, it is useful to know the actual maximum data rate that the device can support.

Contrary to that, many benchmarking teams settle with non-zero (small) loss ratio as the goal for a "throughput rate".

Motivations are many: modern protocols tolerate frame loss better; trials nowadays send way more frames within the same duration; impact of rare noise bursts is smaller as the baseline performance can compensate somewhat by keeping the loss ratio below the goal; if SUT noise with "ideal DUT" is known, it can be set as the loss ratio goal.

Regardless of validity of any and all similar motivations, support for non-zero loss goals makes any search algorithm more user-friendly. [RFC2544] throughput is not friendly in this regard.

Searching for multiple goal loss ratios also helps to describe the SUT performance better than a single goal result. Repeated wide gap between zero and non-zero loss conditional throughputs indicates the noise has a large impact on the overall SUT performance.

It is easy to modify the vanilla bisection to find a lower bound for intended load that satisfies a non-zero-loss goal, but it is not that obvious how to search for multiple goals at once, hence the support for multiple loss goals remains a problem.

3.5. Inconsistent Trial Results

While performing throughput search by executing a sequence of measurement trials, there is a risk of encountering inconsistencies between trial results.

The plain bisection never encounters inconsistent trials. But [RFC2544] hints about possibility if inconsistent trial results in two places. The first place is section 24 where full trial durations are required, presumably because they can be inconsistent with results from shorter trial durations. The second place is section 26.3 where two successive zero-loss trials are recommended, presumably because after one zero-loss trial there can be subsequent inconsistent non-zero-loss trial.

Examples include:

  • a trial at the same load (same or different trial duration) results in a different packet loss ratio.
  • a trial at higher load (same or different trial duration) results in a smaller packet loss ratio.

Any robust throughput search algorithm needs to decide how to continue the search in presence of such inconsistencies. Definitions of throughput in [RFC1242] and [RFC2544] are not specific enough to imply a unique way of handling such inconsistencies.

Ideally, there will be a definition of a quantity which both generalizes throughput for non-zero-loss (and other possible repeatibility enhancements), while being precise enough to force a specific way to resolve trial inconsistencies. But until such definition is agreed upon, the correct way to handle inconsistent trial results remains an open problem.

4. How the problems are addressed

Configurable loss ratio in MLRsearch search goals are there in direct support for non-zero-loss conditional throughput. In practice the conditional throughput results' stability increases with higher loss ratio goals.

Multiple trials with noise tolerance enhancement, as implemented in MLRsearch using non-zero goal exceed ratio value, also indirectly increases the result stability. That allows MLRsearch to achieve all the benefits of Binary Search with Loss Verification, as recommended in [RFC9004] (section 6.2) and specified in [TST009] (section 12.3.3).

The main factor improving the overall search time is the introduction of preceding targets. Less impactful time savings are achieved by pre-initial trials, halving mode and smart splitting in bisecting mode.

In several places, MLRsearch is "conservative" when handling (potentially) inconsistent results. This includes the requirement for the relevant lower bound to be smaller than any upper bound, the unequal handling of good and bad short trials, and preference to lower load when choosing the winner among candidates.

While this does no guarantee good search stability (goals focusing on higher loads may still invalidate existing bounds simply by requiring larger min duration sums), it lowers the change of SUT having an area of poorer performance below the reported conditional througput loads. In any case, the definition of conditional throughput is precise enough to dictate "conservative" handling of trial inconsistencies.

5. IANA Considerations

No requests of IANA.

6. Security Considerations

Benchmarking activities as described in this memo are limited to technology characterization of a DUT/SUT using controlled stimuli in a laboratory environment, with dedicated address space and the constraints specified in the sections above.

The benchmarking network topology will be an independent test setup and MUST NOT be connected to devices that may forward the test traffic into a production network or misroute traffic to the test management network.

Further, benchmarking is performed on a "black-box" basis, relying solely on measurements observable external to the DUT/SUT.

Special capabilities SHOULD NOT exist in the DUT/SUT specifically for benchmarking purposes. Any implications for network security arising from the DUT/SUT SHOULD be identical in the lab and in production networks.

7. Acknowledgements

Many thanks to Alec Hothan of OPNFV NFVbench project for thorough review and numerous useful comments and suggestions.

8. References

8.1. Normative References

Bradner, S., "Benchmarking Terminology for Network Interconnection Devices", RFC 1242, DOI 10.17487/RFC1242, , <>.
Mandeville, R., "Benchmarking Terminology for LAN Switching Devices", RFC 2285, DOI 10.17487/RFC2285, , <>.
Bradner, S. and J. McQuaid, "Benchmarking Methodology for Network Interconnect Devices", RFC 2544, DOI 10.17487/RFC2544, , <>.
Morton, A., "Updates for the Back-to-Back Frame Benchmark in RFC 2544", RFC 9004, DOI 10.17487/RFC9004, , <>.

8.2. Informative References

" CSIT Test Methodology - MLRsearch", , <>.
"MLRsearch 0.4.0, Python Package Index", , <>.
"TST 009", n.d., <>.

Authors' Addresses

Maciek Konstantynowicz
Cisco Systems
Vratko Polak
Cisco Systems