Effective Use of MVS Workload Manager Controls
[ Table of Contents
| Previous | Next
]
This section gives an overview of the central decision-making part of the
workload management function. It describes how MVS adjusts the CPU and
storage access of work based on customer goals. The MVS component responsible
for gathering the data and making these decisions is the System Resource
Manager (SRM).
Performance monitors frequently use state sampling as a way of determining
where work spends its time. With MVS/ESA V5, SRM will periodically sample
the state of every dispatchable unit. These samples are accumulated for
each address space, and then further accumulated into the service class
period associated with the address space. The states that SRM cares about
are those states which reflect usage of, or delay for, a resource that
SRM can allocate. Those resources are the processor (CPU) and storage.
A list of the states SRM samples include:
-
Using the CPU
-
Waiting for access to the CPU
-
Waiting for a page fault (separate states are sampled depending on the
type of page fault).
-
Waiting for a swap-in to be started
-
Waiting for a swap-in to complete
The collection of state samples are used in several ways. First the velocity
achieved by an address space or service class period comes directly from
these samples. The velocity goal briefly mentioned earlier is simply a
percentage of:
The number of samples where work is using the CPU
divided by
The number of samples where work wished it could use the CPU
Besides allowing a direct comparison to a velocity goal, SRM uses the
sampling to determine where work is spending its time. This information
allows SRM to determine what resource is the primary bottleneck for the
work and allows SRM to assess the impact of some possible action it might
take.
It would be foolish to make drastic tuning decisions based on a single
sample. So SRM keeps enough recent history to have a clear picture of delays.
When a service class period has a velocity goal, the amount of history
that SRM needs to keep is determined by the number of address spaces active
in that period. If only a single address space is in the period, SRM will
keep several minutes worth of samples. With more address spaces, the history
could easily reflect the recent 20 seconds or so. In addition to gathering
data via sampling, SRM maintains information on the response time achieved
by work completing in each service class period. Again, SRM needs to avoid
making a drastic tuning decision based on just a few completions when managing
to a response time goal. Therefore, SRM maintains some history on the response
times for recently completed transactions. That history may extend as far
back as 20 minutes if necessary. But when there are a significant number
of completed transactions, SRM can use much more recent data for making
its tuning decisions.
Thus far, this paper has described that SRM has address space samples and
response time data, and it can look at that information over a variable
time period depending on the number of address spaces or the frequency
of completions. In addition, SRM projects the expected response time of
work currently 'in-flight'. Comparing the actual or projected accomplishments
for some piece of work to its goal is very straightforward. Frequently,
SRM looks at every service class period to compare the actual to the goal.
Since there are several types of goals, SRM needs some way to compare
how well or how poorly one service class period is doing compared to other
work. That comparison is possible through the use of a performance Index.
The performance index is a calculated value reflecting how well the
work in each service class is meeting its defined goal over a time interval.
A performance index of 1.0 indicates the service class period is exactly
meeting its goal. A performance index greater than 1 indicates the service
class period is missing its goal, whereas a PI less than 1.0 indicates
the service class period is beating its goal. The PI thus makes it possible
to compare a period having a velocity goal with a period having an average
response time goal with a period having a percentile response time goal,
and figure out which is farthest away from the goal (has the largest PI).
In addition to just knowing whether a service class is meeting its goal
or not, SRM must assemble some 'insight' before deciding to change resource
allocations. This is accomplished with exactly the same tool used by every
mathematician or analyst studying trends --- a plot showing relationships
between an independent variable, usually on 'x' axis, and a dependent variable,
usually 'y' axis. SRM assembles many plots. One example for swappable work
in a service class period is: given a Multiprogramming level (x), what
is the maximum number of users ready to run (y). SRM uses this plot to
predict the impact on number of ready users when assessing a change to
an MPL target. Another plot for swappable work in a service class period
is: given a percentage (x) of ready users who are actually swapped in,
how many milliseconds (y) of swap-in delay occur for the average transaction.
This plot shows how response time may be improved by increasing MPL slots
or the expected impact to response time by reducing MPL slots.
On a timed basis, SRM looks for the service class period of highest importance
that is missing its goal. That period is called the receiver. Using the
samples of delays, SRM figures out what resource might be able to help
that receiver. When the resource is identified, SRM looks for other work
that can donate the resource. By analyzing the sampled state information
and plots and historical data for both the donor and receiver, SRM determines
whether the donation is a good idea or not.
The above paragraph is a very simplistic statement of the approach SRM
takes. SRM does not wait for a service class period to miss its goal before
taking action. It can select a receiver even when the PI is less than 1.
And when looking for a resource that could help the receiver, SRM attacks
delays according to the order of most expected benefit.
Suppose the receiver that is identified is a service class of CICS banking
transactions (BANKING). Before SRM can know how to help that service class,
SRM has to identify all the address spaces which are involved in handling
those banking transactions. CICS/ESA 4.1 will identify its regions by using
new programming interfaces introduced in MVS/ESA SP V5.
All TOR(s) and AOR(s) and FOR(s) at the CICS/ESA V4.1 level are identified.
SRM creates internal service classes on each MVS image to define the topology
of the regions which are all responsible for those BANKING transactions.
Then the address space samples collected for those regions are used to
determine which delays are being experienced by the regions, and to assess
how SRM can help them.
The above sections (describing how SRM calculates a PI and chooses a receiver
based on goal importance and gathered data) intentionally ignored resource
groups. However, resource group minimums are honored first.
Dictating the processor cycles available to work could be in direct
conflict with the setting of goals for that same or other work. If some
work is running in a resource group with a minimum constraint specified,
and that work is demanding CPU capacity, the SRM attempts to provide that
work with the specified minimum capacity even at the expense of the goals
of other work.
Before choosing a receiver as described in the earlier section, SRM
will help any resource groups that are not meeting their specified minimum
service units per second, as long as some work in that resource group is
either missing its own goal, or has a discretionary goal.
Likewise, if a given resource group has exceeded its maximum capacity
allowed, the MVS Workload Manager will 'cap' that work at the expense of
the work meeting its goals. This could occur even if the work was just
selected as the receiver most in need of help.
[ Table of Contents
| Previous | Next
]