
This article is a review of BMC ProactiveNet Performance Manager (BPPM) version 8.6 and its key sub-components.
The main key sub-components include:
> ProactiveNet Analytics
> ProactiveNet Event Management (formerly Mastercell)
> ProactiveNet Performance Manager (i.e. PATROL)
Versions Reviewed
Component
|
Version
|
BPPM Event Manager
|
8.6
|
BPPM Analytics
|
8.6
|
PATROL Central
|
7.8.10
|
PATROL Central Operator – Web Edition
|
7.8.10
|
PATROL Agent
|
3.9.00.1i
|
PATROL for UNIX Servers
|
9.10.00.02
|
Key Capabilities
Event Management
BPPM Event Management (previously known as Mastercell or BEM) is the
component that replaces PATROL Enterprise Manager or PEM (previously
known as CommandPost).
BPPM introduces a programming language called MRL.
MRL is not as flexible as PERL or REX which can both be used in PEM, but
MRL does include many in-built features such as policies that make the
design of rules slightly easier.
PEM used to perform event management using up to 5
transformers or scripts written in PERL. PEM was effectively a tool box
whereby all the intelligence is provided by the PERL scripts which
enrich the events using a number of lookup files.
Which product is better, PEM or BPPM? BPPM is arguable a better
event management platform. Although MRL is frustrating to work with,
the in-built capabilities mean that you don’t have to develop everything
from scratch. BPPM is generally a good event management platform.
Threshold Management
PATROL Configuration Manager (PCM) is one of the best threshold
management tools in the industry. The threshold management
capabilities on BPPM (aka ProactiveNet) are poor in comparison. BMC
state that they will include PCM functionality on the next release of
BPPM.
The limitations of Threshold management in BPPM are numerous:
- BPPM has no local thresholds that can be applied across multiple servers.
- Local thresholds can only be defined via the GUI.
- Local thresholds can’t be migrated from one environment to another.
- Migration of global thresholds can be performed using a export/import utility – but it is not simple.
- The GUI for managing thresholds is cumbersome and not intuitive.
On the plus side, the different types of thresholds in BPPM are very
powerful. BPPM has Absolute, Intelligent, Signature and Predictive
thresholds. These thresholds are statistically based and will generate
events when a statistical anomaly is detected. The product will
automatically calculate trends using linear regression and variations
based upon hourly, daily or weekly patterns. However, the statistics
will not eliminate threshold management as BMC have sometimes claimed.
Many thresholds are Boolean in nature – either good or bad - and are
therefore not approriate for statistical analysis. Statistical analysis
is only appropriate for about 20% to 30% of thresholds and analysis
consumes a lot for CPU cycles.
Ease of Implementation
BPPM is undeniably a complex product. Far too complex in my opinion.
There are many other much simpler solutions such as HP SiteScope or CA
Nimsoft which can be implemented much faster. In addition, the BMC
Product Set has gradually got more and more complex over the years. The
solution is really three products bundled together:
- MasterCell which BMC purchased about 7 years ago.
- ProactiveNet which BMC purchased about 4 years ago.
- PATROL which BMC purchased about 20 years ago.
MasterCell is a great event management product. ProactiveNet has
perhaps been oversold by BMC – and the value is overstated. The
autonomous thresholds can only be applied to 20% -to 30% of parameters
anyway. PATROL was originally a great product – but has become bloated
and complex after years of poor product management.
As an illustration of how complex the BPPM solution has become, consider the following table:
Component / Feature
|
Old Solution with PEM
|
New BPPM Solution (version 8.6)
|
Number of Servers
|
3 (DEV, DR and PROD)
|
11 (3 DEV, 3 TEST, 5 PROD)
|
Number of Connections to the Agents
|
2 (PEM and RT Server)
|
3 (BIIP3, BPPM Adaptor, RT Server)
|
Number of Adaptors
|
1 – RT Server
|
3 (RT Server, BPPM Adaptor, BIIP3
|
Dynamic Policy Files (for Rules)
|
5 Rule Files
|
12 Rule Files
|
Forms for Threshold Management
|
1 PCM
|
2 (TEST and PROD BPPM Servers)
|
Extensibility
The PATROL agent has always been very extensible. There is a rich
API and many different ways to write an interface. PATROL Central
has no API and therefore can not be extended. Both BPPM and PEM are
very extensible and can be extended through a variety of scripting
languages such as PhP or PERL.
Blackout
BMC has never provided a web form that allows staff in the Operations
Bridge to blackout servers or services for upcoming outages due to
planned maintenance. This customer (mentioned in this review) had to
write its own Web GUI for Blackout. This is an Apache and PhP solution
that allows the shift operators to configure blackouts. It required 25
days of development to alter the blackout web form and migrate this
functionality from PEM to BPPM.
Administration
Routine Daily Admin Tasks
For an environment of 500 Agents, BPPM requires from 0.5 to 1 FTE to
keep the lights on - depending on the experience of the person. Typical
daily tasks include the following:
- Restarting Agents. For an environment of 500 Agents, you can expect
that 1 agent will crash per day. The most common cause is probably
history file corruption. History files can grow to beyond 4 GB if not
managed.
- Checking the Consoles. Most environments will end up with a
hierarchy of BPPM Event cells. The Administrator needs to log into each
Console to verify that events are being:
- De-duplicated properly;
- Propagated correctly from one cell to the next;
- that incidents are being raised correctly - if Automtic Incident Generaion (AIG) is configured.
- Managing Thresholds. The Administrator will get on average one
request per day to change a threshold or verify that a threshold is in
place. For example, an ORACLE DBA may say that there was a SEV2
incident last night related to table locking. "Could you please check that instance DW_PROD is monitorited for locking.?"
It can take from 30 minutes to 2 hours to investigate each request and
write an email suggesting and agreeing the new threshold. Perhaps
longer if a meeting is required.
- Managing Rules. Changes to the BPPM Rules occur about once per
month and need to be performed using change control. Rule changes
require a code change to the MRL and the cells will need to be bounced.
- Commissioning and Decommission New Agents. Agent commissioning
using occurs every few months and may involve up to 20 virtual hosts
associated with one Physical machine. The Commissioning process is
faily involved (in fact all the Admin steps are complex). See below.
- Deploying KMs. When the support teams deploy new infrastructure
software such as Websphere or ORACLE, the associated PATROL Knoweldge
Module (KM) will also need to be deployed. Each deployment may take 1-3
hours and will require change control. Input will be required from the
SME. For example, the ORACLE DBA may be required to type in the system password for ORACLE during the KM Configuration process.
PATROL Agent Commissioning
The Agent commissioning process for configuring monitoring for a new server consists of the steps shown below:
Step Number
|
Step
|
Description
|
1
|
Ping Host
|
Ping Host to very that the hostname is correct?
|
2
|
Install Agent
|
Install Agent Using Solaris Package
|
3
|
Update Event Rules
|
edit BPPM enrichment file abc_host.csv
|
4
|
Apply to PROD Cell
|
import abc_host.csv into PROD cell
|
5
|
Apply to TEST Cell
|
import abc_host.csv into TEST cell
|
6
|
Update PING Test (primary)
|
Update PING Test configuration on Primary Server to ensure the host is up.
|
7
|
Update PING Test (secondary)
|
Update PING Test configuration on Secondary Server to ensure the host is up.
|
8
|
Configure UNIX km
|
Use PCM to give Agent Standard Configuration for the UNIX km.
|
9
|
Update BIIP3
|
Update BIIP3 Config so that the Agent can talk to the Event Management Cell.
|
10
|
Agent Restart
|
Restart the Agent to ensure that the Agent Configuration takes affect.
|
11
|
Update PCO Web Console
|
Update PCO Web Console so that the Agent appears in the PATROL console.
|
12
|
Update Work request
|
Update the Work request to indicate the job is complete.
|
If additional Monitoring is required for ORACLE or WEBLOGIC or some
other Application, then there are additional configuration steps that
are required.
Programming Languages
There are two languages to learn with BPPM
- MRL or Mastercell Rule Language - This is a fairly unique programming language.
- PSL or PATROL Script Language. This language is similar to PERL. The complexity lies in the functions that need ot be learned.
Summary of Administration
Administration of BPPM is overly complex. The product has evolved
over the course of the last 20 years. As another new component has been
added via aquisition, the product has become increasingly complex and
time consuming to administer.
Architectural Considerations
Any Solution Design for BPPM should consider the following key questions:
Question
|
Details
|
How does the design allow for rule tracing?
|
Using the trace log is not practical due to the volume of events. A
good solution is to assign a Unique ID to each rule and then configure
each rule to add an entry to a new slot called “matching_rules”.
|
How does the design specify rule execution order?
|
It is often difficult to design rules because of confusion about rule
execution order. It is good practice to split all mrl files into mrl
files for new rules and mrl files for refine rules. So you get:
new_mcxp.mrl and refine_mcxp.mrl. The files then should be grouped in
the .load file by stage, so you have refine rules followed by new rules …
etc.
|
Does the DEV environment have the same number of cells as the TEST and OAT environments?
|
Don’t be tempted to have fewer cells in the DEV environment. It is
tempting to have fewer cells in order to limit the number of zones
(servers) required. This is a mistake. Rule execution order is greatly
affected by the propagation (or not) of slots between cells and the
configuration of mcell.propagate.
|
Does the design specify the configuration of mcell.propagate?
|
The design should specify the configuration of all mcell config files – including mcell.propagate, mcell.dir etc.
|
Is BIIP3 included in the Design?
|
BIIP3 is essential in order to forward PATROL events to the cells for
any cells that are not event class 11 and 39. These events are
explicitly generated by the PSL event_trigger() function. It is
impossible for BPPM Analystics (ProactiveNet) to collect these events
because they have no associated metric.
|
Threshold Management
|
If thresholds are being migrated fro PCM to BPPM, How will the
thresholds be migrated from BPPM server to another? Has the export /
import process been thoroughly tested? (because is has serious issues).
I would advise migrating the thresholds to BPPM as a Phase II activity or wait for BPPM v9.
|
Export Thresholds from PCM
|
Does the design specify using a tool for extracting all the
thresholds from PCM into a spreadsheet? (I have a PERL tool to do
this).
|
Testing
|
Does the Design provide for at least a month of end-to-end testing once the rules have been completed.
|
Monitoring the Monitoring
|
Does the Design incorporate monitoring of the monitoring? Will an event be generated if the BIIP3 Adapter fails?
|
Event Storm
|
If the BIIP3 Adaptor looses connection to multiple agents every half
an hour and then regains the connection 30 seconds later this will
create 200 new AGENT_DOWN events (mc_adapter_control). The de-dup rule
will not work because the AGENT_UP event closes the AGENT_DOWN event.
What rule is going to prevent this event storm?
|
Time-out Policies
|
Does the Design specify timeout policies for all the main top level
event classes such as MC_CELL.. and EVENT. Does the cell start
reasonably quickly with 2000 events? What about 20,000 events?
|
DDE Enrichment
|
Does the Design fully specify the Enrichment files that will be used?
|
DDE Synchronization
|
Are the DDE config files pulled or pushed into the cells? How are the DDE cfg files synchronized between cells?
|
Blackout
|
Has a Web site been included in the Design for Blackout by the
Operations Bridge? BPPM does have a “Schedule downtime” facility – but
this is entirely inappropriate for operators and does not account for
BIIP3 events.
|
Blackout Dev
|
If a blackout GUI is a requirement, has a month of Development been allocated (using something like Apache and PhP)?
|
BPPM Analytics
|
Does the Design discuss the possibility of implementing BPPM Analytics as a second phase?
|
Reporting
|
Does the design include Event Reporting to drive Continuous Improvement? Key reports are total events grouped by:
- ·Day, Week, Month
- ·Object Class
- ·Application
- ·Service
- ·Support Group
|
Reporting DEV
|
If reporting is a requirement, does the Design include time to
implement the BMC reporting tool or 2 weeks of development using PhP and
mquery.
|
AIG
|
Does the Design Include Automatic Incident Generation? (AIG).
Semi-automatic incident generation an option – whereby an operator
creates a ticket by right clicking on an event. Is this option
considered and discussed in the design?
|
Failover
|
Is failover considered? How is the configuration replicated? Replicated DISK?
|
Training
|
Doe the project plan include time for Training the staff in the operations Bridge? What about 2nd level support?
|
Go-live
|
Is the Go-Live big bang or Phased? Phased is preferred for risk
mitigation but will require operators to run two consoles in parallel.
|
Audible Alarm
|
Is an Audible alarm a requirement? If so, then this will require a
few days of development to configure a web page that uses a sound file
and “mquery –s COUNT”.
|
BPPM Classes
BPPM Has a number of event classes as shown below which all inherit from the CORE_EVENT class.
CORE_EVENT
- EVENT
- MC_CELL_EVENT
- MC_UPDATE_EVENT
- MC_SMC_ROOT
- MC_MCCS
- MC_CLIENT_BASE
- MC_CLIENT_CONTROL
- MC_CLIENT_ERROR
- MC_ADAPTOR_BASE
- MC_ADAPTER_CONTROL
- WIN_EVENTLOG
- LOGFILE_BASE
- SNMP_TRAP
- PEM_EV
- PATROL_EV
- PPM_EV
- MC_CELL_CONTROL
- MC_CELL_START
- MC_CELL_STOP
- MC_CELL_TICK
- MC_CELL_STATBLD_START
- MC_CELL_STATBLD_STOP
- MC_CELL_DB_CLEANUP
- MC_CELL_CONNECT
- MC_CELL_CLIENT
- MC_CELL_DESTINATION_UNREACHABLE
- MC_CELL_HEARTBEAT_EVT
- MC_CELL_RESOURCES
- MC_CELL_ACTION_RESULT
- MC_CELL_PUBLISH_RESULT
- IAS_EVENT
- IAS_START
- IAS_STOP
- IAS_SYNCH_EVENT
- IAS_REINIT
- IAS_LOGIN
- IAS_ERROR
Mastercell Rule Language (MRL)
Mastercell Rule Language (or MRL) is the language used to develop
event management rules within BPPM. The administrator can develop 11
different types of rules as shown in the table in section "Rule Phases"
below. The language is simple and relatively easy to learn in terms of
both the syntax and the in-built functions. The most difficult concept
to grasp is the execution order as explained below. One of the most
common problems with the rules is to misunderstand the execution order
and find that the rules are not executing in the desired sequence. The
other cause of frustration is the lack of common statements such as a
looping structures (do, while for until) which one takes for granted in
other languages. It is possible to iterate over a list structure
using the listwalk() function call. The New rule phase also has limited
capability to loop over events using the Updates clause. Fortunately
however, the need to loop is fairly rare. However, at times the lack of
standard statements can be a cause of frustration.
The biggest problem with MRL is the slow cycling speed when debugging
code. Compared to PhP or PERL, it takes at ten times as long, to stop,
compile and restart. So debugging cycles are 10 times as long and
productivity is similarly affected. True, it is not necessary to write
pages and pages of code - but typically one will write about 8-15 pages
of MRL for each project. 8 pages of PhP (tested and debugging) takes 1
to 2 days. 8 pages of MRL (tested and debugged) takes 2-4 weeks. In
addition, one should allow for an additional month of End-to-End
testing before production go-live to test the rules with real events -
and to allow for all possible scenarios to play out and for all the bugs
to emerge. This rules of thumb apply for companies of 5,000 to 10,000
employees. For larger organizations, you should allow for more time.
Execution Order
- Rules are processing in order according to their rule phase as shown below.
- Rules are executed in the order in which they appear in the .load file.
- Rules are executed in the order in which they appear in the mrl file.
- Policies are executed in order of the specified ‘execution order”.
Rule Phases
Rules are executed in the order shown below.
Execution Order
|
Rule Phase
|
Description
|
1
|
Refine
|
A Refine rule verifies the validity of incoming events and collects
additional data for an event before it is sent through the remaining
rule phases where further processing takes place.
|
2
|
Filter
|
Filter rules limit the number of incoming events by discarding those
events that need no additional processing or analysis. Filter rules
compare incoming events to the event condition formulas (ECFs) contained
in the rule to determine if an event is discarded or proceeds to
further processing. An incoming event is processed through each Filter
rule until a Filter rule discards the event, or all Filter rules are
exhausted. An event must match all the Filter rules to be accepted.
|
3
|
Regulate
|
Use regulate rules to handle time frequency accumulations of events
or repetitive occurrences of events. An event is considered a repetition
of another if the event has the same values for all the slots that are
defined with the dup_detect=yes facet in the BAROC definition of its
event class.
|
4
|
New
|
Use New rules to execute an action when a new event is received, for
example increasing the severity level for an event or updating an
existing event with new event data. New rules determine if an event
becomes permanent and is placed in the repository.
|
5
|
Abstract
|
Abstract rules create high-level, or abstract, events based on
low-level events. A new event starts at the new rules phase, skipping
the filter and regulate rules phases. With Abstract rules, you can keep
low-level events with cells in the lower-level of the cell hierarchy,
abstract the data from low-level events into high-level events, and
propagate them to a higher-level cell. A high-level cell in the
hierarchy can consolidate abstract events from several low-level cells
and prevent a large number of abstracted technical events for which no
consolidating rules apply.
|
6
|
Correlate
|
Correlate rules build an effect-to-cause relationship between an
event that occurs as a result of another event. Correlate rules execute
whenever a cause or an effect event is received. The relationship
between correlated events can be broken.
|
7
|
Execute
|
The Execute rule performs a specified action when a slot value has
changed in the repository. The specified action, which is either
internal to the cell or running an external executable, is based on the
characteristics of one or more events.
|
8
|
Threshold
|
The Threshold rule counts the number of events that matches the
criteria you specify if the number of these events exceeds the amount
allowed within a time frame the Threshold rule executes.
An event is considered a repetition of another if the event has the
same values for all the slots that are defined with the dup_detect=yes
facet in the BAROC definition of its event class.
|
9
|
Propagate
|
A cell uses Propagate rules to forward events or messages to one or
more destination cells or gateways. For example, a Propagate rule can
escalate an event from a lower level cell to a higher-level cell in an
environment.
|
10
|
Timer
|
Use Timer rules to create timed triggers to call a rule. Timer rules are evaluated when a timer expires.
|
11
|
Delete
|
The purpose of Delete rules is to perform actions before an event is
discarded from the repository, such as a rule that suppresses data that
has no meaning without an event instance. Delete rules are evaluated
whenever an event is deleted from the repository or when events are
deleted using the Delete flag in the mposter command.
|
PATROL Configuration Manager (PCM)
PATROL Configuration Manager (PCM) is a configuration tool used for
PATROL agents. The tool is mainly used for configuring Thresholds and
is very effective at this task.
Operation
PCM is similar in concept to the Windows registry editor. The Main
Form consists of a two TreeView panes as shown below. The left TreeView
is used to configure hosts which are arranged in groups such as ORACLE
(shown below). The right hand TreeView is used to manage the rules
which can also be arranged into groups. The RuleSets are linked to the
Hosts by dragging RuleSets from right to left. The RuleSets are
dragged and dropped onto the leaves marked "LinkedRuleSets". The user
then invokes a command called "Apply RuleSets". The Rulesets are
applied to each Agent in the same order as they appear in the hierarchy
on the left. RuleSets linked to lower level nodes take precedence and
"override" higher level group RuleSets.

Typical Use Case
The use of PCM typically follows a three step process. Administrators must perform the following:
- Select an Agent as a master and configure this Agent using the PATROL Central Operator (PCO) Console.
- Copy the configuration into PCM.
- Apply the configuration to other similar Agents using PCM.
- Restart the Agents in order for the configuration to take affect.
Weakness
The key weaknesses of this configuration process are the following:
- PCM and PCO are seperate tools. Ideally, the configuration tool
(PCO) and the configuration distribution tool (PCM) should be the same
product. This would eliminate step 2 above.
- Step 4 should not be necessary. Restarted the Agents can be easily
performed using PCM - but the problem is that all active events are
regenerated. This means that all agents must be blacked out for up to
an hour before any restart - otherwise staff in the Operations Bridge
will see hundreds of duplicate events that they have already handled
over the last few hours.
Desired State Management
The key benefit of PCM is that it can be used to manage a Desired
State for each Agent If you apply the configuration once or a thousand
times, the result is exactly the same. The Hierarchy allows one to set
global or default configuration using the higher nodes in the left
TreeView an then to override the configuration with local (host
specific) configuration using the lower nodes. This hierarchy works
extremely well.
Policies
The Policies feature within BPPM Event Management is gnerally a well
executed feature within the product and has suffcient flexibity to meet
most customer's needs. The Dynamic Data Enrichment (DDE) policies
allows the user to manage the rules externally using Comma Seperated
Value (CSV) files.
The key thing that must be kept in mind, is that the DDE policies
match based on Best Fit and not First Match. So for example, if you
want to match on a hostname called "fred*" (the star is a wild card)
then frederick will match before fred* even if fred* appears first in
the csv file. The rules are loaded into a hash memory structure within
the product. The benefit of 'Best Fit" is that the execution time for
finding a match is predictable - irrespective of the number of lines in
the CSV file (and there could be thousands). The disadvantage of "Best
Fit" is that the matching can be out of sequence and counter-intuitive.
Best Practice in this case is to keep the CSV files simple. Each
Enrichment file should also have only one purpose. For example, the
customer used in this review orignally started with 5 enrichment files
with their old PATROL Enteprise manager (PEM) environment. After
implementing BPPM, the customer ended up with 11 DDE enrichment files.
The number of total lines was less, but the number of files was more.
When migrating from PEM to BPPM, the enrichment files should be
"Normalized" - by minimizing the number of lookup columns in order to
reduce the probability of out-of-order rule matching.
BMC Standard Policies
Policy
|
Description
|
Closure
|
An closure policy closes a specified event when a separate specified event is received.
|
Blackout Policy
|
A blackout policy might be used during a maintenance window or holiday period
|
Component Based Enrichment
|
enriches the definition of an event associated with a component by
assigning selected component slot definitions to the event slots
|
Enrichment
|
enriches the definition of an event associated with a component by
assigning selected component slot definitions to the event slots
|
Correlation
|
Correlation relates one or more cause events to an effect event, and
can close the effect event The cell maintains the association between
these cause-and-effect events.
|
Escalation
|
Escalation raises or lowers the priority level of an event after a
specified period of time. A specified number of event recurrences can
also trigger escalation of an event. For example, if the abnormally
high temperature of a storage device goes unchecked for 10 minutes or if
a cell receives more than five high-temperature warning events in 25
minutes, an escalation event management policy might increase the
priority level of the event to critical.
|
Notification
|
Notification sends a request to an external service to notify a user
or group of users of the event. A notification event management policy
might notify a system administrator by means of a pager about the
imminent unavailability of mission-critical piece of storage hardware.
|
Propagation
|
Propagation forwards events to other cells or to integrations to other products.
|
Recurrence
|
Recurrence combines duplicate events into one event that maintains a counter of the number of duplicates.
|
Remote
|
Remote action automatically calls a specified action rule provided
the incoming event satisfies the remote execution policy’s event
criteria.
|
Suppression
|
Suppression specifies which events that the receiving cell should
delete. Unlike a blackout event management policy, the suppression
event management policy maintains no record of the deleted event.
|
Threshold
|
Threshold specifies a minimum number of duplicate events that must
occur within a specific period of time before the cell accepts the
event. For events allowed to pass through to the cell, the event
severity can be escalated or de-escalated a relative number of levels or
set to a specific level. If the event occurrence rate falls below a
specified level, the cell can take action against the event, such as
changing the event to closed or acknowledged status.
|
Timeout
|
Timeout changes an event status to closed after a specified period of time elapses
|
Component Based
Blackout
|
Specifies which events the receiving cell should classify as
unimportant and therefore not process . The events are logged for
reporting purposes. A Component Based Blackout event management policy
might specify that the cell ignore events generated from a component or
device based on component selection criteria for this policy.
|
Typical DDE Enrichment Files
CSV File Name
|
Description
|
Lookup Columns
|
Data Columns
|
Host.csv
|
Assign Location and HostType (DEV, TEST or PROD) based on host name |
HostName |
Location, Physical Server, HostType |
HostSuppress.csv
|
Filter out events based on hostname (e.g. when new Agent installed) |
HostName |
HostSuppress (YES,NO) |
Application.csv
|
Assign an application nane to each event. |
ApplicationClass, Parameter |
Application |
ObjectSuppress.csv
|
Filter out troublesome parameters based on Event class |
ApplicationClass, Parameter, EventClass |
ObjectSuppress (YES,NO) |
ApplicationSupress.csv
|
Filter out events based on application |
Application |
ApplicationSuppress (YES,NO) |
HostBlackout.csv
|
Blackout Hosts for planned outages based on timeframe |
HostName, PhysicalServer, Location |
TimeFrame |
Service.csv
|
Assign Service Name to all events |
Host, Instance, HostType |
Service, SupportGroup |
ServiceSuppress.csv
|
Filter Out events based on service |
Service |
ServiceSuppress (YES,NO) |
ServiceBlackout.csv
|
Blackout services for planned outages during a particular time frame |
Service |
TimeFrame |
ServiceDowngrade.csv
|
Downgrade severity for particular services |
Service |
SeverityCode (e.g. 12333) |
TextMessage
|
Change message Text for certain parameters |
ApplicationName, Parameter, EventClass |
NewMesaage |
Note: Severitycode of 12333 downgrades MAJOR (4) and CRITICAL (5) to MINOR (3).
Issues
PATROL Agent Restart
If the PATROL agent’s configuration is changed, then the agent
usually requires a restart. Unfortunately, the PATROL Agent regenerates
all active events (any parameter that exceeds a threshold) when the
agent is restarted. This means that all an agent must be blacked out
when the Agent is restarted.
PATROL Agent History Corruption
The Agent History file will always get corrupted if the History file
exceeds 4 Gbytes. There is a 4 GB file size limit on Solaris. The
history file will frequently exceed this limit on busy servers running
messaging services such as Tuxedo or MQ (simply because there is a lot
to monitor). The history file may get corrupted for other reasons.
When the Agent gets corrupted, it will generated an event for every
attempt to store a parameter value. This problem can generate hundreds
of events every few minutes from just one host. This number events can
easily overload a cell and a BIIP3 Adaptor (see BIIP3 Corruption below).
With 500 UNIX Agents, you should expect one agent to get corrupt history about every 2 weeks.
BIIP3 Cache File Corruption
If the BIIP3 cache file is corrupted, the BIIP3 can get stuck on one
event and keep generating the event. I have seen 4 million repeated
events in a cell due to this problem.
BIIP3 Cache file corruption may be caused by overloaded (see PATROL Agent History Corruption above).
I have seen this problem occur twice within 3 months.
The workaround is to clear the ache file and restart the BIIP3 Adaptor.
BIIP3 Agent Connection Drops
In certain situations, the BIIP3 Adaptor may loose connection with
all the agents every half an hour. The Agent will then gain connection
again almost immediately. This causes a flapping AGENT_DOWN and
AGENT_UP condition that is not de-duplicated – because the AGENT_UP
clears the AGENT_DOWN event. This issue can generate thousands of
events and thousands of new Incidents (assuming Automatic Incident
Generation is implemented).
One best workaround is to create a new rule for MC_ADAPTER_CONTROL
(AGENT_DOWN) events and set them initially to severity INFO. If the
Agent is truly down then the second agent down event (which occurs 3
minutes later) should be configured in the rule to set the severity back
to WARNING or ALARM.
The problem is also solved by restarted the BIIP3 Adapter. I
therefore suggest that all customers schedule a restart of the BIIP3
adaptors once per day. No events are lost because the BIIP3 adapter
(and the PATROL Agent) caches all events.
I have seen this problem about once per month with a population of 500 agents.
BPPM Threshold Migration
The migration of both global and local thresholds from one BPPM
Analystics instance to another must be performed by hand. The is an
export / import mechanism for global thresholds, but as of July 2012,
this mechanism is unreliable. There is no import / export mechanism for
local (host specific) thresholds.
BPPM Local Instance thresholds
BPPM Analytics does not support instance specific thresholds. In
other words, you can not set a default threshold for FSCapacity across
all file systems and then set an instance specific threshold that
applies only to the root FileSystem and htne apply this instance
specific threshold to all hosts. The instance specific threshold must
be individually defined on all hosts. If there re 500 hosts, this
becomes unfeasible. This is no script or API that can be used to
automate this task.
BPPM – Missing Hosts
With this release of BPPM, the PATROL Agents are connected to BPPM
Analytics using the BPPM Adaptor. When you use the Graphing facility to
graph parameters in BPPM, some of the hosts do not appear – event
though they are connected via the Adapter. At the time of this writing,
this case is open with BMC and is unresolved.
BPPPM does not support Custom Event Catalogues
PATROL Events that are triggered using the event_trigger() PSL
function are not supported by BPPM Analytics (ProactiveNet). This
forces all customers (who use PATROL agents) to implement both the BIIP3
Adapter (for event_trigger() events) and the BPPM Adapter for all
standard PATROL metrics (that have an underlying parameter).
This means that the adapter layer with a BPPM implementation is quite
complex. There are three Adapters attached to every agent on three
separate ports. The Adapters are the RTServer, the BIIP3 Adapter, and
the BPPM Adapter.
This complexity means that the implementation becomes fragile, complex to administer and fundamentally unreliable.
LOG monitoring
It is difficult to define catch-all rules using the standard BMC Log
monitoring KM. For example, it is possible to create a catch-all rule
that triggers on the search stirng "ALARM". You hten give htis
definition a custom origin which might be something like
"LOG.BANKING_app_log.alarm". You then create a custom event mesasage
that inserts the line from the log file inot the text of the message.
This can be done with the syntax "%1-". The problem occurs at the
event management layer. All events that match this rule will get rolled
up into one event as duplicates - despite the fract that each event
represents a different line from the log file and a different problem.
The work-around is to change the de-duplication rules at the event
managemnet layer. Be careful. if the rules are improperly defined,
you can make the product vulnerable to an event storm - which may only
manifest itself a month or two later.
Monitoring of the monitoring is insufficient.

Typical Project
Project Background
The review was conducted after an upgrade Project in which every
component within an old PATROL environment was upgraded. The project
was driven by the customers internal audit organization that review the
companies products and determined that PATROL enterprise Manager (PEM)
was no longer supported an therefore the whole environment should be
upgraded.
Project Phases
The project consisted of a number of separate projects which could
have been undertaken individually. The customer chose to performed all
three projects simultaneously which increased the risk, complexity and
length of the overall project.
Phase
|
Description
|
Phase 1
|
Solution Design
|
Phase 2
|
Upgrade of the PATROL Agents and Knowledge Modules
|
Phase 3
|
Replacement of PEM with BPPM Event Manager
|
Phase 4
|
Introduction of BPPM Analytics
|
Project Timescales
The Solution Design phase was conducted in late 2011 and the
implementation was started immediately after the New Year in 2012.
Phase 3 of the solution was finally put into production on Thursday 28th
June 2012.
Phase 4 of the project has not yet been completed. Phase 4 was
removed from the project scope when the customer fell behind on
delivery. Currently, there are no plans to complete this phase of the
project.
The customer contracted several months of consultancy from BMC
Software. BMC performed the initial solution Design and much of the
initial configuration of the event management rules.
Resources
The resources assigned to the project, consisted of the following:
Resource
|
Time Allocation
|
BMC Consultant
|
~ 3 months
|
Customer SME
|
7 Months full time
|
Independent Consultant
|
4 Months
|
Customer UNIX Engineers (2 Engineers)
|
4 Months
|
Customer infrastrucutre Architect
|
1 Month
|
Customer Project Manager
|
2 Month
|
Customer Deliver manager
|
2 Months
|
Management Involvement (Project Sponsor + Resource Manager)
|
1 Month
|
Total
|
24 Months
|
Lessons Learned
The project overran initial estimates – both in terms of budget and cost. The following issues were encountered:
Issue
|
Description
|
Solution Design
|
The Event Management Rules had to be completely redesigned which
delayed the projected by about a month. The customer’s old rules used
First Match – whereas BPPM only supports Best Fit. The complexity of
the customer’s rules was not properly analysed or understood during the
design phase.
|
Documentation
|
The design of the event management rules and were not properly
documented. When it became evident that the design had to be changed,
the lack of documentation slowed understanding and meant that some
thinking had to be repeated and the design documented properly.
|
Thresholds
|
The customer spent over a month trying to migrate their thresholds
from PATROL to BPPM. This tasks was complex due to the different format
of the thresholds. The customer also experienced many issues with the
migration tools which did not work properly. Managing thresholds in
BPPM is not as easy as managing thresholds in PATROL (using PATROL
Configuration Manager). In the end the customer abandoned the attempt
to introduce BPPM analytics. The Autonomous alerts only covered 20% of
the thresholds anyway, so the benefit of BPPM Analytics was not
compelling.
|
Testing
|
The customer underestimated the time required for comprehensive
testing. Testing should have been planned earlier, started earlier and
resourced appropriately. At least a full month of end-to-end testing
was required.
|
Technical Lead
|
Technical Leadership was lacking through some parts of the project.
Initially, the BMC Consultant was the technical lead. Towards the end,
an independent consultant was the technical lead. There were issues of
continuity.
|
Project Phases
|
The project consisted of 4 project phases. Phase 2 and Phase 4 were
optional and were not required in order for the custom to meet its audit
deadline. In the end, Phase 4 was abandoned.
|
Summary and Conclusion
Component Rating (1-5 Stars)
BMC ProcativeNet Performance Manager (BPPM) is really 3 products
bundles into one suite. It still makes sense to rate each component
individually.
Product |
Summary |
Score 1-5 |
BMC BPPM v8.6 Analystics (formerly ProActiveNet) |
The product appears to have reasonably good quality control. The
graphing is good. The threshold management features are poor - but BMC
says this is being fixed in the next release. I am not convinced on the
whole concept of using statistics. Statistical analysis uses a lot of
CPU which makes scaleability an issue. Only about 30% of monitored
metrics are appropriate for statistical analysis. BMC's claims that
this product removes the need for threshold management is an exageration
and 70% of thresholds will still need to be managed using absolute
value (i.e. standard) thresholds. |
3
|
BMC BPPM v8.6 Event Mgmt (formerly Mastercell) |
This product is one of the strongest event management products
around. There are challenges with using the MRL rule language - but
generally this product works well. I question BMC's bundling of this
product with ProactiveNet and would like to see the product available as
a stand-alone component. Develoing and debugging rules is time
consuming and difficult. Only time will tell if this product continuous
to be a good event management platform. |
3 |
BMC PATROL 7.8.10 |
Twenty years ago, PATROL was the best monitoring solution of its
type. Since then the product has become bloated and overly complex.
PCM was a great addition and makes the management of thresholds
realtively easy and repeatable. The product has not changed much in
about 8 years. Four years ago, BMC were going to retire the product.
Today PATROL is an integral part of BMC's BPPM strategy. The KMs and
the breadth of monitoring saves this product from a lower rating. |
3 |
Rating according to Capabilities (Score 1-10)
Component/Capability
|
Previous Version (with PEM)
|
Latest Version (BPPM v8.6)
|
Event Management
|
3
|
4
|
Threshold Management
|
5
|
2
|
Analytics / Graphs
|
3
|
5
|
Ease of Implementation
|
3
|
2
|
Extensibility / interfaces
|
4
|
4
|
Operator Form for Blackout
|
1
|
1
|
Average Score
|
(3.2)
|
(3)
|
Components
|
PATROL and associated KMs
PATROL Central Operator
PATROL Enterprise Manager (PEM)
|
PATROL and associated KMs
PATROL Central Operator
BPPM Event Management
BPPM Analytics (ProactiveNet)
|
Conclusion
The score for BPPM has not improved with this revision. The product
is more complex, more difficult to implement and thresholds are more
difficult to administer. The improvement in capability associated with
anomaly detection is not convincing and not proven to this customer and
is only relevant for 30% of parameters. BMC must work hard to improve
administration and ease of implementation.
The combination of BPPM Analytics (ProactiveNet), BPPM Event
Management (Mastercell) and PATROL has the potential to be a market
beating product. However, the investment required is significant. Time
will tell if BMC delivers on this vision.
Disclosure: I am a real user, and this review is based on my own experience and opinions.
I believe in the Enterprise customer base BMC is going to lead. As mentioned above BMC have the ability to simply integrate anfd view in the Manager of Managers. This is critical in environments where there are multiple existing and legacy toolsets. A single view is key to empower the resources that need only critical information displayed so that fast and effective response is gauranteed.