Search	in

Effective Use of MVS Workload Manager Controls

[ Table of Contents | Previous | Next ]

3.0 The Heart of Workload Manager

This section gives an overview of the central decision-making part of the workload management function. It describes how MVS adjusts the CPU and storage access of work based on customer goals. The MVS component responsible for gathering the data and making these decisions is the System Resource Manager (SRM).

3.1 Address space sampling

Performance monitors frequently use state sampling as a way of determining where work spends its time. With MVS/ESA V5, SRM will periodically sample the state of every dispatchable unit. These samples are accumulated for each address space, and then further accumulated into the service class period associated with the address space. The states that SRM cares about are those states which reflect usage of, or delay for, a resource that SRM can allocate. Those resources are the processor (CPU) and storage. A list of the states SRM samples include:

Using the CPU
Waiting for access to the CPU
Waiting for a page fault (separate states are sampled depending on the type of page fault).
Waiting for a swap-in to be started
Waiting for a swap-in to complete

The collection of state samples are used in several ways. First the velocity achieved by an address space or service class period comes directly from these samples. The velocity goal briefly mentioned earlier is simply a percentage of:

The number of samples where work is using the CPU
divided by
The number of samples where work wished it could use the CPU

Besides allowing a direct comparison to a velocity goal, SRM uses the sampling to determine where work is spending its time. This information allows SRM to determine what resource is the primary bottleneck for the work and allows SRM to assess the impact of some possible action it might take.

3.2 Maintaining enough history

It would be foolish to make drastic tuning decisions based on a single sample. So SRM keeps enough recent history to have a clear picture of delays. When a service class period has a velocity goal, the amount of history that SRM needs to keep is determined by the number of address spaces active in that period. If only a single address space is in the period, SRM will keep several minutes worth of samples. With more address spaces, the history could easily reflect the recent 20 seconds or so. In addition to gathering data via sampling, SRM maintains information on the response time achieved by work completing in each service class period. Again, SRM needs to avoid making a drastic tuning decision based on just a few completions when managing to a response time goal. Therefore, SRM maintains some history on the response times for recently completed transactions. That history may extend as far back as 20 minutes if necessary. But when there are a significant number of completed transactions, SRM can use much more recent data for making its tuning decisions.

3.3 Performance Index

Thus far, this paper has described that SRM has address space samples and response time data, and it can look at that information over a variable time period depending on the number of address spaces or the frequency of completions. In addition, SRM projects the expected response time of work currently 'in-flight'. Comparing the actual or projected accomplishments for some piece of work to its goal is very straightforward. Frequently, SRM looks at every service class period to compare the actual to the goal.

Since there are several types of goals, SRM needs some way to compare how well or how poorly one service class period is doing compared to other work. That comparison is possible through the use of a performance Index.

The performance index is a calculated value reflecting how well the work in each service class is meeting its defined goal over a time interval. A performance index of 1.0 indicates the service class period is exactly meeting its goal. A performance index greater than 1 indicates the service class period is missing its goal, whereas a PI less than 1.0 indicates the service class period is beating its goal. The PI thus makes it possible to compare a period having a velocity goal with a period having an average response time goal with a period having a percentile response time goal, and figure out which is farthest away from the goal (has the largest PI).

3.4 Plots

In addition to just knowing whether a service class is meeting its goal or not, SRM must assemble some 'insight' before deciding to change resource allocations. This is accomplished with exactly the same tool used by every mathematician or analyst studying trends --- a plot showing relationships between an independent variable, usually on 'x' axis, and a dependent variable, usually 'y' axis. SRM assembles many plots. One example for swappable work in a service class period is: given a Multiprogramming level (x), what is the maximum number of users ready to run (y). SRM uses this plot to predict the impact on number of ready users when assessing a change to an MPL target. Another plot for swappable work in a service class period is: given a percentage (x) of ready users who are actually swapped in, how many milliseconds (y) of swap-in delay occur for the average transaction. This plot shows how response time may be improved by increasing MPL slots or the expected impact to response time by reducing MPL slots.

3.5 Choosing and helping a receiver

On a timed basis, SRM looks for the service class period of highest importance that is missing its goal. That period is called the receiver. Using the samples of delays, SRM figures out what resource might be able to help that receiver. When the resource is identified, SRM looks for other work that can donate the resource. By analyzing the sampled state information and plots and historical data for both the donor and receiver, SRM determines whether the donation is a good idea or not.

The above paragraph is a very simplistic statement of the approach SRM takes. SRM does not wait for a service class period to miss its goal before taking action. It can select a receiver even when the PI is less than 1. And when looking for a resource that could help the receiver, SRM attacks delays according to the order of most expected benefit.

3.5.1 Handling servers

Suppose the receiver that is identified is a service class of CICS banking transactions (BANKING). Before SRM can know how to help that service class, SRM has to identify all the address spaces which are involved in handling those banking transactions. CICS/ESA 4.1 will identify its regions by using new programming interfaces introduced in MVS/ESA SP V5.

All TOR(s) and AOR(s) and FOR(s) at the CICS/ESA V4.1 level are identified. SRM creates internal service classes on each MVS image to define the topology of the regions which are all responsible for those BANKING transactions. Then the address space samples collected for those regions are used to determine which delays are being experienced by the regions, and to assess how SRM can help them.

3.5.2 Managing towards resource group constraints

The above sections (describing how SRM calculates a PI and chooses a receiver based on goal importance and gathered data) intentionally ignored resource groups. However, resource group minimums are honored first.

Dictating the processor cycles available to work could be in direct conflict with the setting of goals for that same or other work. If some work is running in a resource group with a minimum constraint specified, and that work is demanding CPU capacity, the SRM attempts to provide that work with the specified minimum capacity even at the expense of the goals of other work.

Before choosing a receiver as described in the earlier section, SRM will help any resource groups that are not meeting their specified minimum service units per second, as long as some work in that resource group is either missing its own goal, or has a discretionary goal.

Likewise, if a given resource group has exceeded its maximum capacity allowed, the MVS Workload Manager will 'cap' that work at the expense of the work meeting its goals. This could occur even if the work was just selected as the receiver most in need of help.

[ Table of Contents | Previous | Next ]