Effective Use of MVS Workload Manager Controls
[ Table of Contents
| Previous | Next
]
In preparing an initial set of goals, the performance analyst can exploit
several different sources. The following three inputs will be discussed:
-
Current IPS/ICS setup
-
Service level objectives based on the business needs.
-
Historical Data
An installation could use its current IPS/ICS configuration as a basis
for its MVS Workload Manager service class configuration and as a basis
for its Workload Manager classification rules. For example, most installations
already have a multi-period performance group set up for the production
TSO work. Each period is defined with a particular duration. These installations
should find it very easy to set up a service class to mirror this type
of TSO performance group. Other ways the IPS/ICS may help are:
Service definition coefficients
A service unit is a measure of resources consumed by an address space.
Each installation can tailor, or weight, the different components of service
by specifying weighting factors. Those weighting factors are the service
definition coefficients (CPU, IOC, MSO, and SRB). These coefficients are
specified in the IPS, and are needed in a WLM service definition as well.
An installation could just continue using the same coefficients. That
might be wise if there are any billing tools used in the installation based
on service units, and if those tools cannot be changed. Note that the tools
may need to be changed for goal mode anyway, since presumably they are
sensitive to Performance Group Number. If so, it is recommended that corresponding
report classes rather than service classes be defined for these tools to
be utilized in goal mode. Define service classes primarily for SRM management
purposes, and report classes primarily for reporting purposes.
If there are no restrictions on the coefficients, this could be a good
time to change them. The MSO coefficient has generally been carried at
the default value of 3.00 This was appropriate when storage was a very
rare resource. But for most installations today, the storage resource should
not contribute so extensively to the total service. For several years it
has been recommended to change MSO to either 0.000 or to 0.0001 It is recommended
that this be done during a system's migration to goal mode.
Another change to consider now is to reset the CPU and SRB coefficients
to 1.0 That way the definition of service units per second for resource
groups will be directly in the terms of the CPU and SRB service units (since
multiplying by a coefficient of 1 obviously would not change the resulting
product).
Many installations use CPU=10.0, IOC=5.0, SRB=10.0 today. Each type
of service would contribute the same proportion or weight if these three
coefficients were changed to CPU=1.0, IOC=0.5, SRB=1.0 -- of course assuming
the MSO coefficient is set to 0.
Durations
A service class goal is valid for a specified duration, just as performance
group periods can specify durations. Since the duration is in terms of
service units consumed, obviously if the coefficients of service units
are changed, there must be a similar resulting change in the durations
specified for multi-period service classes. Durations are probably only
used in PGNs for TSO and batch work. Those will be the most likely types
of service classes to have durations as well. Realize that a change to
the meaning of a service unit can significantly alter the meaning of "TSO
trivial" unless the corresponding duration is adjusted as well.
Insight from RMF reports
Understanding what is running in each domain and each performance group
will allow the installation's RMF reports to help while preparing the WLM
policy. The velocity experienced by a domain and a PGN is reported in RMF
Version 5. That will be very valuable in selecting the velocity for a group
of started tasks or long running batch work.
Using the current IPS/ICS setup, as previously discussed, is very limited
however. For instance, studying the IPS/ICS may not give any insight into
CICS response times, and certainly will not provide the velocity goals
for started tasks, nor IMS response times.
The installation's existing service objectives provide a good starting
point for the goals to be defined to WLM. Possibly some batch classes are
already documented as requiring a given turnaround time. Probably interactive
work is already documented a requiring a specified response time.
There are three items to remember when looking at service level objectives
as a source of information for the WLM policy:
-
WLM's response time goals describe system response time, without including
network delays at the start or the end of the transaction.
-
The objectives document is probably very concise or simple, specifying
just a few goals. Contrast that with the extensive number of PGNs contained
in the installation's current IPS member. The WLM policies should emulate
the service level objective in having a few goals, but can mirror the IPS
by using report classes where extra reporting data is needed. This shift
of emphasis from a large number of PGNs to a small number of service class
goals is one reason no tool can, in most cases, generate an effective sample
WLM policy from the IPS. Installations should expect to approach WLM with
a different mind-set than they approach the IPS.
-
A service objective frequently defines the worst-case. That is, a customer's
service agreement may state that turnaround for batch class A is 30 minutes
while today the installation is delivering a 5 minute turnaround. It would
probably be unwise to just pass the 30 minute service objective to WLM
because end users may become irritated by such a drastic change from their
expectations. Instead, use the service objectives as guidelines for setting
goals, and place more emphasis on historical data.
RMF Workload Activity reports will provide a good source of information
on the average response times currently being achieved for each TSO period.
If an installation expects to have the same percentage of TSO transactions
complete in each period in goal mode than as were completing in each period
in compatibility mode then that historical data will be a good starting
point for the response time goal for each of the TSO service class periods.
The RMF V5 compatibility mode Workload Activity report will also be
a good source of the velocity achieved by PGNs and domains. That will provide
the insight into the velocity goal to specify for service classes for started
tasks and long-running batch work.
In goal mode, a batch transaction is a batch job. Most installations
already have historical data on average batch elapsed time. Realize that
if an installation uses PERFORM= on the JCL of individual job steps, and
supports that via OPGN specifications in the ICS, then the RMF Workload
Activity reports from compatibility mode should not be used to determine
average response times for PGNs of batch work. That is because each step
with a unique PERFORM value will constitute an individual transaction in
compatibility mode, so the historical RMF data is no longer in terms of
batch jobs.
Possibly an installation already has average response time data for
interactive OLTP transactions via CICS Performance Monitor or IMSPARS.
These could be the source of average response time goals for the WLM service
definition. In addition, reference
1 documents a way to use RMF in compatibility mode to find that same
data for IMS or CICS transactions.
This statement should be fairly obvious. One cannot specify what goal to
achieve for a particular type of work if one does not know the definition
for a transaction for that type of work.
It should be noted that many customers care about end user response
time at a terminal. But there is nothing the base control program can do
to affect network delays. Nor is there any way to synchronize the clocks
for the intelligent terminals and controllers with the sysplex timer. Distributing
CPU cycles and storage resources based on unsynchronized clocks is a very
bad idea. Therefore, the goals specified in WLM's policies for interactive
work are host response times.
TSO Transactions
A TSO transaction for WLM's policy is identical to a TSO transaction as
reported via RMF for the past 20 years. The installation still has a choice
via IEAOPTxx whether a CLIST should be counted as a single transaction
or a collection of individual transactions.
JES Transactions
A batch job is one transaction. It starts when the job is submitted (when
the JES reader processes the job), and completes when the initiator finishes
executing the job. That means it does include the time queued by JES waiting
for an initiator, but it does not include output processing.
CICS Transactions
A CICS transaction begins when the initial CICS region receives a message,
generally from VTAM, and ends when that region returns the result to VTAM.
If the transaction is routed to another CICS region (AOR, FOR, etc) the
time spent processing by those other regions accumulates for the original
transaction.
This transaction information is passed to WLM beginning with CICS/ESA
V4.1.
IMS Transactions
An IMS transaction begins when the Control Region receives a message from
VTAM, and ends when the control region passes the response back to the
network.
A customer has the option to allow inserted programs (program to program
switch) to be considered as new transactions, or to just be considered
a continuation of an existing transaction.
IMS transaction information is passed to WLM beginning with IMS/ESA
V5.1.
Distributed DB2 Transactions
A goal can be specified for a distributed DB2 transaction. This is a request
arriving remotely across the network via Distributed Relational Database
Architecture. Once again, the start time is the arrival time from the network
and the transaction ends at the commit point when DB2 completes the distributed
request.
This transaction information is passed to WLM beginning with DB2 V4.1
Avoid repetitive experimentation with velocity goals in the hope of forcing
a specific dispatching priority, or dispatching priority order. MVS Version
5 has changed the dispatcher significantly to provide equal access for
work at the same dispatching priority. Previous concepts of what work can
or cannot share the same priority may no longer be valid.
Also, the MVS Workload Manager makes resource allocation decisions dynamically
throughout the day. It recognizes workload peaks and valleys on much smaller
intervals of time than installations are used to controlling. Therefore,
the MVS Workload Manager may determine that a given workload could manage
with a lower dispatching priority during one time period of the day than
the rest of the day and still meet the goal of the work.
Using RMF Monitor II, one will be able to see the dispatching priority
and storage isolation targets possibly change during the day. Remember
that the absolute value of the dispatching priority at any given time is
not relevant. Only the relationship of one priority to others is relevant.
SRM will dynamically adjust that relationship by continually assessing
what resource is needed by work of highest importance that is missing goals,
and see who can donate some of that resource. If the resource that is needed
or is donated is CPU, the relative dispatching priorities will change.
One could consider the job of the MVS Workload Manager as 'to work towards
goals' rather than 'to meet all goals'. Just because one is able to specify
a particular goal does not mean that the resources are available to meet
that goal. The MVS Workload Manager is continually monitoring how well
goals are being met. It will attempt to achieve higher importance goals
before lower importance goals. If a goal is set too aggressively the MVS
Workload Manager may repetitively try to help the work associated with
the goal, only to determine that nothing more can be done for it (or that
stealing the resources it needs from some other work is not a good trade,
based on the other work's goal and importance). In such cases SRM will
proceed to help other work, but cycles have been wasted trying to help
work that will never meet its goal.
One requirement IBM has heard repeatedly is for a simple way for accounts
to define goals to MVS. This is what WLM is trying to do. It is unlikely
an installation will have hundreds of different business goals. Therefore,
if service classes are defined only as needed by different types of work,
that will allow the definitions to MVS to be simple.
Besides the argument for simplicity, there is a good technical reason
to keep the number of service classes down. SRM is deciding how to allocate
system resources based on sampling. As more work is combined into a service
class, there are more entities contributing samples. So there is no need
to go far back into time to obtain a statistically significant set of samples.
This means SRM can make better decisions more quickly, and therefore be
more responsive, rather than depending on sampled data from several minutes
ago.
For example, many started tasks spend over 80% of their time in an intentional
timer wait. So less than 20% of the samples from those address spaces would
give any insight for SRM into how to manage the dispatching priority or
storage access of the work.
It may take over 10 minutes to gather sufficient samples from a single
address space that was active only 20% of the time. However, if 20 similar
started tasks are combined in the same service class, SRM would have a
sufficient number of samples within 30 seconds to be able to decide whether
an adjustment in resources is deemed appropriate. This allows SRM to be
much more responsive.
What should be a typical number of service classes? Most customers will
probably use around 25 service classes. This allows 4 or 5 for started
task service classes, a handful for different batch classes, and plenty
for the various interactive work being run by IMS or CICS or TSO or APPC,
etc.
All work with a discretionary goal will compete for processor access behind
all other work with a velocity or response time goal.
Therefore, a discretionary goal will work best if a significant percentage
of the processor cycles are available to work with discretionary goals.
This is due to the behavior of the mean time to wait processor algorithm
which will keep I/O intensive work at a higher priority than CPU intensive
work.
SYSOTHER is one of the 3 pre-defined service classes. It is assigned to
certain work which a customer does not even care enough about to classify.
For example, if a WLM policy does not have any rules to assign a service
class to APPC transactions, when WLM sees one it will be associated with
SYSOTHER and therefore have a discretionary goal.
There is no real problem with having work fall into the SYSOTHER service
class. It is just an indication that the work was probably not discussed
and prioritized along with the other work that was classified.
SYSOTHER will be assigned for unclassified work from the following subsystem
types: APPC, Distributed DB2, JES, OPEN/MVS and TSO/E. Started tasks are
not associated with SYSOTHER. See the section entitled 'Assigning Service
Classes to Started Tasks and System Spaces.
If classification rules do not exist for CICS or IMS transactions (called
CICS and IMS subsystem types in the WLM ISPF application), the transactions
will not be managed towards the SYSOTHER discretionary goal. Instead, the
serving address spaces will continue to be managed according to the goal
specified for them as batch jobs or started tasks. See the discussion of
servers later in this paper. (2)
Defining goals to MVS Workload Manager is very easy using the ISPF application.
But it is highly recommended that an installation have a clear definition
of what they want to tell WLM, before starting the ISPF application. There
are 3 good reasons for this recommendation:
-
The importance to assign to the goals of different service classes will
probably involve discussion with different organizations or functions within
the installation. Those discussions and the resulting agreement on the
ranking of work are essential to a good service definition. See reference
1 for an example worksheet that might help installations in these discussions.
Without a visual comparison of one goal to the next, the discussions probably
could not take place.
-
Naming conventions should be established before starting to define any
information to WLM. With a little advance thought, the installation will
probably choose more meaningful names by following some conventions, than
would be chosen on the spur of the moment in the ISPF application.
-
WLM's ISPF application tries to catch mistakes at the time they are made.
For example, if service class BATCH_A is supposed to be associated with
workload TEST and with resource group DEPT87, the ISPF application requires
the workload and resource group to be defined before they can be referenced.
The relationships can be seen very easily on paper. Then entering them
into the ISPF application is a trivial data-entry task.
WLM allows goals to be specified for CICS or IMS transactions, as well
as for work running in TSO or APPC address spaces, batch jobs, and started
tasks. Since the CICS and IMS regions are run as batch jobs or started
tasks, they will of course be associated with a service class. Do not assign
the same service class to these regions as is assigned to the transactions
that the regions serve.
Similarly, with DB2 V4.1, it is possible to assign goals to distributed
DB2 transactions. Isolate that work into its own service class, rather
than mixing it with any address space work.
The primary reason for this recommendation is that the sampling data,
plots, and projections SRM assembles for each service class period can
become very distorted if work of such diverse structure is combined in
a service class.
Footnotes:
(2)
Note there is a very brief period of time during a mode switch or during
a policy activation where SRM and WLM control blocks are still being created.
If SMF writes type 30 records during this time when the control blocks
are not present, it will record the address space as being associated with
service class SYSOTHER. So even if an installation fully classifies all
work, that installation might see an occasional job associated with SYSOTHER
in type 30 records.
[ Table of Contents
| Previous | Next
]