App:Taskmon

From Picswiki

Jump to: navigation, search

Contents

Overview

TaskMon may run as either the primary control program on a PICS node or as a child of ACC to provide node management services. Unattended nodes within the PICS Display Network are always run with TaskMon as the primary control program. Nodes run outside the PICS Display Network and nodes requiring extra security use the ACC/TaskMon combination.

TaskMon increases PICS' robustness by autonomously controlling the operation of its node. A PICS node usually does not require communication with any other node in order for it to carry out its most basic functions (for example, to analyze previously retrieved historical data). However, without outside contact the node may operate in a degraded mode due to lack of access to outside services, access to real-time data, static database, archive data, or other services. TaskMon may be directed to shutdown PICS if a task does not have access to data it requires.

Responsibilities

  • Node configuration.
  • Node operational state control.
  • PICS task management.
  • Service table management.
  • Real-time point provider table management.
  • Inter-TaskMon communication.
  • System point provider.
  • PICS termination.

A PICS node is called a configured node if it is defined in the PCF. If it is not, it is called and alien node. TaskMon treats these two types of nodes differently in a few situations. For presentation purposes, only configured nodes will be described except in the Alien Node section. In the following sections, a reference to this section will be given when there is a difference.

Configuration

When TaskMon starts, it uses the registry, a PICS configuration file (PCF) and a local ini file to determine how to manage the node. The registry provides the PICS system, section, class and subsystem names along with the node’s side (A or B). The PCF contains detailed information about every subsystem. The file name is formed by taking the registry's system name and adding '.PCF' to the end of it. The detailed content of this is described in SysIni, which creates it. The local ini file is named 'TmLocal.ini'. It contains definitions of other PICS tasks for the node. This ini file generally isn't needed due to changes to TaskMon's handling of a user starting a PICS task instead of having TaskMon start it.

While the PCF contains a great deal of information about the PICS system there are two things in it that are used in conjunction with the registry variables - the section name and the class. The PCF's section name overrides the name in the registry. (See Alien Node) If they differ, TaskMon writes the PCF section name into the registry. If the subsystem is configured in the PCF to use the registry's class name, TaskMon will use the definition of the class in the PCF for most of the configuration of the node, which includes the list of tasks for that node. Thus, a node can have its configuration changed by simply changing the class name in the registry then restarting PICS. This feature is mainly used by display nodes.

TaskMon will use the PCF information to configure the node for operation. It will generate the list of tasks it will start and the order to start them. This list will include the tasks defined in TmLocal.ini. TaskMon determines if the node is to run in peer mode or in a primary/backup subsystem pair.

Either the PCF file to load is derived from the subsystem name or class name. When ACC starts TaskMon, it can directly tell TaskMon which PCF file to use.

Operational States

Once the PCF and INI files are loaded, TaskMon moves from the modeless state gray icon to the arbitrate state yellow icon and begins to make the node operational. (See Alien Node.) At this point TaskMon knows enough to begin broadcasting its network status message, which makes the node known in the PICS network. If the node is to operate as a peer, it can immediately move out of the arbitrate state to the peer/loading state blue-yellow icon. Otherwise, it waits until it hears from its partner or times out waiting for this. If the partner is already primary, this node becomes backup. If the other node is also arbitrating, depending on the type of failover mode defined for the subsystem, one will be chosen as primary and the other will become backup. Most nodes are configured to not have a preference for which node is to be primary. In that case, the A-side will become primary. If the waiting for a partner times out, this node becomes primary. Due to timing of the messages from both nodes, there is a slim chance both sides could decide to become primary at the same time. In that case, the A-side becomes primary and the B-side reboots. If this node is to be primary, it enters the primary/loading state green-yellow icon otherwise it enters the backup/loading state pink-yellow icon.

In the loading state, the node's configured tasks are started. Upon completion of this the node enters its operation state of Peer blue icon, Primary green icon, or Backup pink icon.

The final operation state is the Termination state red icon. It transitions into this state when it is commanded to or it determines that its node can not fulfill its function. While in this state it informs the PICS network, via its broadcast status message, that it is terminating, and stops all PICS tasks on its node before using the termination method configured for the node.

Task Management

Task management can be considered TaskMon’s main responsibility. The PCF and local INI file define which tasks are to be started. The task definitions specify the dependencies on other tasks. (See SysIni.) TaskMon loads all the task information into its task table which is accessible to users via the evTmApi interface. TaskMon then creates a start order for the tasks and begins starting the tasks in that order, waiting for each task to reach its operational state before starting the next one. Once all the tasks are operational, TaskMon broadcast a windows message saying the node it operational and updates its status.

TaskMon monitors all PICS tasks for the duration of their execution two ways. First it monitors the task’s process handle. When it is set, the task has exited. The second is by windows messages. Failure to follow the protocol of these messages will cause TaskMon to terminate the task. TaskMon's response to a task's failure depends on that task's definition.

During a task's start up, it must register with TaskMon then tell TaskMon it is ready. There are evTmApi functions for these operations. Registering tells TaskMon the window handle of the task which allows TaskMon to begin posting health queries to the task. For the task, registering maps TaskMon's shared memory and returns the node's system ID to it. Telling TaskMon it is ready allows TaskMon to move forward with the next task to start. There is a time limit for a task to register and one for it to become ready. (The time limit for becoming ready can be extended to some extent.) Once registered, TaskMon will post a health query message to the task. The task also has a time limit on posting its response. After receiving the response, TaskMon will wait a period before asking for the task's health again.

The task can post its health to TaskMon at any time. This is mainly done when the task detects it can no longer perform its function and will exit. This makes TaskMon more responsive to changes in its node.

A task will normally unregister from TaskMon when it is told to exit by TaskMon or it performs an orderly shutdown on its own. However, there are two ways a task can unregister with TaskMon. The first is the normal way when the task gives its health to TaskMon. If TaskMon didn't tell the task to exit, it will send a terminate message to the task asking it to exit. In the second way, the task gives TaskMon a special value for its health. This puts the task back in the state of TaskMon waiting for it to register. A task would use this method of it wanted to destroy the window it is using to communicate with TaskMon and recreate a new one to be used.

Although the preceding paragraphs imply that all tasks defined for a node are started when PICS starts on the node, a task can be defined to not be automatically started. In which case, either another task will need to command TaskMon to start it or it will be started by a user in the same manor other Windows tasks are started.

As stated earlier, TaskMon responds differently to failed task according to its definition. If it is defined as critical, then PICS on that node will be terminated. If it is restartable, it will be restarted. However, a restartable task will not be restarted if it doesn’t reach the ready state before exiting. If it is neither of these, the task's exiting is ignored.

When TaskMon decides it must terminate PICS on its node, it updates its status to failed and begins to shutdown all PICS tasks in the opposite order they started by posting them a message asking them to exit. If the task doesn’t exit within a time limit, TaskMon will kill that task. During shutdown, TaskMon will still broadcast its status message to the PICS network. This message allows a backup node to immediately switch to primary and all nodes to remove any services or providers that are registered by it.

All time limits and how long TaskMon will wait before posting a health query to a task are configured individually by task, although there are defaults if not explicitly given in the tasks definition.

Service Table Management

TaskMon provides management for services its node provides to the rest of PICS and for the services provided by other nodes. A task which will be a service provider, such as the archivers, registers its services with TaskMon. The registration includes its name, IP addresses, ports and the conditions under which the service is to be available. For example, a node may only be available to its partner (A or B side), only when it is a primary node or it may be available to on multiple NICs. The task can also unregister or modify characteristics of its service.

TaskMon periodically broadcast to the PICS network a message specifying the services on its node allowing all TaskMons to keep a table of the services provided all PICS nodes nodes. TaskMon will broadcast a windows message when a service is added, deleted, or changed. This allows the tasks on the node to act appropriately in a changing situation, such as a fail-over of a node.

Real-Time Provider Table Management

TaskMon provides management to keep track of the various real-time point providers in the system. The purpose of the table is to allow the real-time agent on the node determine when and if it needs to proxy points that no longer have a provider. A task which will provide points registers its provider numbers with TaskMon. A task can also unregister its provider numbers at any time.

TaskMon periodically broadcast to the PICS network a message specifying the providers on its node, allowing all TaskMons to keep a table of the providers in the system. TaskMon will broadcast a windows message when a provider is added or removed from the table. Note, that a provider is considered to be available if any node is registered to provide it whether it is primary or backup. This allows a fail-over to proceed without the real-time agent thinking the provider is no longer available.

Inter-TaskMon Communication

All inter-TaskMon communication uses UDP messges. TaskMon periodically broadcast three different UPD messages to be read by all other TaskMons, a status message, a service advertisement, and a provider advertisement. There are four other UDP messages sent by TaskMon, a broadcast MMI command, a duplicate node message which is sent only to the offending node and two messages involving Alien nodes (see Alien Node). Each message has a header which, besides the message ID, specifies the PICS system, identifying information for the source and destination node, and some other information.

The status message contains the name of the subsystem, its state, a boss number, and four byte of status for each task running on its system. The boss number, along with the node's side, is used to resolve the conflict of two nodes of a subsystem thinking they are primary. The service advertisement contains a list of services being provided. The provider advertisement contains a table of the 256 possible providers with a indication of whether it provides it or not. Each messages contains a sequence number. The sequence number is used to determine if messages are being missed from any node.

When TaskMon receives its first status message from another node, it populates its subsystem table with that node's information. Future status messages update the entry. When a node's state changes, TaskMon broadcast a Windows message giving the old and new state of the node. Service and provider advertisements populate the service and provider tables.

Once TaskMon receives a status message from a node, TaskMon requires that they all three continue to be received within a time limit. If they are not, TaskMon marks the node as lost, removes all services and providers by that node from their tables, and broadcast appropriate Windows messages. When a node is has been out of communication long enough, it reverts to the state of expired. After an even longer time, it reverts to the state of unknown.

Currently, there is only one form of the MMI command in use. It is used to set the termination type for a node or group of nodes. It can optionally tell the affected TaskMons to terminate PICS on their node at that time. All nodes aid in the propagation of MMI commands by resending them once a second a few times.

If TaskMon detects a node with the same name/side of an known node but with a different IP address, TaskMon will send out a duplicate node message to the new node and throw away the message received from it. When TaskMon receives a duplicate node message directed at itself, it terminate PICS on its node.

System Point Provider

Each TaskMon provides all the PICS subsystem's points to the real-time agent. The information used to provide the points comes from the status messages (or lack of) received from other nodes. The real-time agent treats these points differently then other points by not sending them out to the network. As a result, tasks that subscribe to these points are seeing them from the point of view of their node. This view will be the same as the other nodes as long as the connectivity between the nodes is consistently good, since lose of communication with a node can cause that node to appear to be missing.

The exception to the local TaskMon providing the points are nodes whose real-time agent is RtClient. TaskMon still provides the points to RtClient, but they are discarded. Instead, RtClient and uses the data it receives from RtServer. Since nodes that use RtClient normally have a bridge between them and the PICS network, their TaskMon doesn't get status messages from the nodes and therefore can't provide the correct values. In most PICS configurations, all ACC nodes use RtClient.

PICS Termination

There are three reasons TaskMon will terminate PICS on its node:

  • Menu command
  • A node command issued either locally or remotely
  • A critical task reported an error, unregistered, or exited.

Once TaskMon decides to terminate PICS, it will event log the triggering reason for the termination then begin commanding each PICS task to exit. The tasks are commanded to exit in the opposite order they were started. Each tasks must exit within a time limit or TaskMon will kill it. TaskMon again event logs the termination reason along with the type of termination it is using.

There are four termination types:

  • Log off user
  • Power-off computer
  • Reboot computer (Default)
  • Exit to Windows.

If the termination type is 'Exit to Windows', TaskMon immediately does that. Otherwise, it displays a dialog which gives the type and reason for the termination and starts a timer before terminating. For unattended nodes, a user can click an OK button to cause immediate termination or the 'Exit to Windows' button, which will change the termination method before exiting. An attended also has a longer time limit before TaskMon actually performs the termination method.

Alien Node

Alien nodes are those nodes which are not explicitly named in the PCF. Since they are not named, TaskMon does not know what tasks to load for them or how to control them. However, the PCF can have an alien base subsystem defined for it. If it does, the characteristics of this subsystem are applied to the node. If not, TaskMon terminates PICS on that node. Since the alien base is defined like any other PICS node, it can, and usually will, be defined to use the node's registry class as discussed in the Configuration section.

In alien nodes, the registry subsystem name and side are not used. Instead, TaskMon negotiates with the other TaskMons for them. At the end of the negotiation, TaskMon writes the name and side into the registry. Note that the next time this node starts PICS, it could get a different name and side. The names are always 'Alien_ddd', where the ddd is the ID for the node. Unlike regular nodes, the section name, if valid, is used. Thus, unlike configured PICS nodes, the A-side and the B-side of an alien subsystem can be in different sections. From this point on, the node is treated as any other PICS node.

The Alien name negotiation occurs while TaskMon is in the modeless state. This negotiation consist of the alien TaskMon requesting via a broadcast UDP message a specific ID and side. It initially tries ID 255 A-side. If some other TaskMon knows of a node already using the ID and side, it rejects the request by sends a UDP message to the alien with an ID and side that is available for use. The alien will then request that ID and side. Again, other TaskMons can reject this by sending a reject message. This continues until all possible alien IDs and side have been rejected or one is not rejected by anyone. If they are all rejected, TaskMon will terminate PICS. Assuming the alien does not receive a rejection, it will send the request one a second a total of 10 times. At that time, the alien assumes that ID and side, writes its name and side to the registry then transitions into its arbitrate state. There is one drawback with this method. If TaskMon starts on two alien nodes within a second of each other, they can both be given the same ID and side. They will then be duplicates of each other which is resolved as described in the Inter-TaskMon Communication section.

Tray Icon

TaskMon's system tray icon shows its current mode of operation for quick reference.

Icon Mode, Meaning
Image:Tmgray.png Modeless, just started and nothing is known
Image:Tmcyan.png Modeless, getting configuration file (not imp.)
Image:Tmyellow.png ARBITRATING, waiting to hear from partner
Image:Tmyellowgreen.png PRIMARY, loading
Image:Tmyellowpink.png BACKUP, loading
Image:Tmyellowblue.png PEER, loading
Image:Tmblue.png PEER, operational
Image:Tmgreen.png PRIMARY, operational
Image:Tmpink.png BACKUP, operational
Image:Tmred.png Terminating

Local INI File

TaskMon allows extra node configuration to be included via a INI file named TmLocalini. This file can contain a [Task Defaults] section along with various task sections. These sections are described with the system configuration compiler, SysIni. [Task Defaults] are found in the PicsSys.ini and the task sections are found in PicsTask.ini. All fields in each section are allowed here. Note that the task defaults start with those from PicsSys.ini and only those fields defined here override them.

The use of this file has been largely removed by modifying TaskMon to allow the node's class in the registry to define most of the details for this node and by using the information in the PCF to determine how to control a self-started PICS task.

Here's a sample which ensures the common end-user applications are available on a display node (DEU):

[Task Defaults]
Health Check Poll Rate=+5s
Health Check Timeout=+3s
Load Path=bin\
Priority Class=Normal
Thread Priority=Normal
Register Timeout=+5s
Ready Timeout=+10s
Exit Timeout=+5s
Critical=No
Auto Start=No
Auto Restart=No
Multiple Instance=No
Security Rights=User,Operator,Technician,Guest

[Display]
Label=Recall Display Program
Executable=redisp.exe
Depend1=Static
Depend2=Realtime

[Gradient_Plot]
Label=gPlot
Executable=gPlot.exe
Depend1=Static
Depend2=Realtime

[pgPlot]
Label=pgPlot
Executable=pgPlot.exe
Depend1=Static
Depend2=Realtime

[AlarmLog]
Label=Alarm Log
Executable=AlarmLog.exe
Multiple Instance=Yes
Depend1=Static
Depend2=Realtime

[AlarmSort]
Label=Alarm Sorter
Executable=AlarmSort.exe
Depend1=Static
Depend2=Realtime

[Megawatt_Display]
Label=Megawatt Display
Executable=MWDispPics.exe
Depend1=Megawatt_Historian

Network Message Formats

All TaskMon messages start with a common header and most are broadcast UDP messages. The "Duplicate Node Message" and the "Reject Alien Request" messages are sent to a specific IP address.

Header

The header is validated to ensure the message can and should be processed.

typedef struct TM_UDP_HEADER_s
{
   char              szSystem[32];  // PICS Name of the system
   WORD              wLen;          // Length of msg
   WORD              wProtVer;      // Current protocol version number (SP_MSG_PROTOCOL_VER)
   TM_MSG_ID         mid;           // Msg Id
   SYSTEMID          sidSrc;        // System Id of the source
   SYSTEMID          sidDest;       // System Id of the destination (0 if for everyone)
   TM_SECTION_BITS   sBits;         // Section information
   QWORD             qTime;         // qTime for this node's pcf
   WORD              wSsLocal;      // Subsystem index of local class used
} TM_HEADER;
  • szSystem must match TaskMon's PICS system name or it is discarded. This allows TaskMons from multiple PICS to share the network.
  • wLen must match the length of the UDP message.
  • wProtVeer will cause a PICS termination on this node if it is greater then TaskMons value.
  • mid determines the format of the information after the header.
  • sidSrc defines which node sent this message including what its current state and status are.
  • sidDest defines which nodes this is intended for. A message can be directed to an individual or all subsystems. It can be further restricted to only one one side or only nodes with the specific state of Primary, Backup, or Peer.
  • sBits give the section the source is in along with information needed in its system point.
  • qTime is PICS time in 64-bit format of the node's PICS configuration file creation. Although not used at the present, it it intended to allow a node to determine if it needs to get a new pcf.
  • wSsLocal allows SysView to display the correct configuration information for a node.

Status Message - mid = TM_MSG_TM_STATUS (2)

This message is broadcast every half second (default time). "[System Definition]Status Message Rate" can modify this time. A node is declared to have lost communication and expired if 4 are missed (default count). "[System Definition]Node To Live Count" can modify that count.

typedef struct TM_TM_STATUS_s
{  // MsgId == TM_MSG_TM_STATUS
   TM_HEADER  h;
   DWORD      dwSequence;        // Incremented before sending the message
   DWORD      dwBoss;            // Conflict resolution when two nodes believe they are primary.
   WORD       wTasks;            // Number of entries in Task
   TM_STATUS_TASK Task[256];     // Task status information
} TM_TM_STATUS;
  • dwSequence allows each TaskMon to detect lost UDP messages from a node.
  • dwBoss allows two nodes in conflict over which is primary to determine which should terminate. The smaller terminates. If they are the same value, the B-side terminates.
  • wTasks Although there can be 256 tasks on a node, only the first wTasks entries in Task are sent.
  • Task contains information about the current state and configuration of a task on the node. See TM_STATUS_TASK in SysTasMon.h for details.

Service Message - mid = TM_MSG_TM_SERV (1)

This message is broadcast every other status message.

typedef struct TM_TM_SERV_s
{  // MsgId == TM_MSG_TM_SERV
   TM_HEADER   h;
   DWORD       dwSequence;        // Incremented before sending the message
   DWORD       dwServices;        // Number of services in this message
   TMAPI_SERV  serv[dwServices];  // Array of service descriptions
} TM_TM_SERV;
  • dwSequence allows each TaskMon to detect lost UDP messages from a node.
  • dwServices is the number of services included in this message.
  • serv[] contains the details of each service being offered. See TMAPI_SERV in tmapi.h for details

Provider Message - mid = TM_MSG_TM_PROVIDER (4)

This message is broadcast every other status message.

typedef struct TM_TM_PROVIDER_s
{  // MsgId == TM_MSG_TM_PROVIDER
   TM_HEADER      h;
   DWORD          dwSequence;               // Incremented before sending the message
   WORD           wTaskId[MAX_TMAPI_PROV];  // Array of provider definitions
} TM_TM_PROVIDER;
  • dwSequence allows each TaskMon to detect lost UDP messages from a node.
  • wTaskId[] contains the task id of the provider. A value of 0xffff means this node does not have this provider.

Duplicate Node Message - mid = TM_MSG_TM_DUP_NODE (3)

This message is sent whenever a message is received from a duplicate node. It is sent to the duplicate node only.

typedef struct TM_TM_DUP_NODE_s
{  // MsgId == TM_MSG_TM_DUP_NODE
   TM_HEADER  h;
   char       szValidSubSystem[32];      // PICS Name of the valid subsystem
   DWORD      dwValidIp;                 // Ip of valid subsystem
   DWORD      dwOffendingIp;             // Ip of offending subsystem
} TM_TM_DUP_NODE;
  • szValidSubSystem is the name of the SubSystem this node is in conflict with. (The side is in the sidDesc of the header.)
  • dwValidIp is the IP of the node that is being used. This is used for logging.
  • dwOffendingIp is the IP of the invalid subsystem.

Node Command Message - mid = TM_MSG_TM_CMD (7)

This message is broadcast once a second for 5 seconds (default time). "[System Definition]Node Command Life" can modify this time.

typedef struct TM_TM_CMD_s
{  // MsgId == TM_MSG_TM_CMD
   TM_HEADER  h;
   SYSTEMID   sidSrc;        // System Id of the original sender of this message
   DWORD      dwSequence;    // Sequence number generated by original sender
   long       iMsOfLife;     // Milliseconds left to keep this message alive
   long       iReserved;     // set to zero for future compatability
   TM_CMD_SET_MSG cmdSet;
} TM_TM_CMD;
  • sidSrc allows all nodes to know which node originated this command.
  • dwSequence comes from the originator of the message. The combination of sidSrc and dwSequence allows TaskMon to determine if a received node command has already been received.
  • iMsOfLife starts with the "Node Command Life" value and is decremented by one second every time it is broadcast. When the time reaches zero or less, the message will be discarded.
  • cmdSet is an array of up to 40 commands. Each command has four parts: who, side, command, and name. Who has four values: all, class, subsystem, or section. Side is A, B, or both. The currently supported commands set the PICS termination type to one of its four types either for later or immediate use. The name indicates which subsystem class, PICS section, or subsystem. Name is not used when the command is for all nodes.

When the command is received, TaskMon determines if it is for it. If it is, it acts on the command. Regardless, the command is remembered and broadcast for the remainder of the commands life span.

This command is interesting in that it is generated and broadcast by one node and then broadcast by all nodes which receive it with an ever decreasing life time. This ensures the message will not have an indefinite life due to nodes missing an early copy of it which can happen with UDP messages or when a TaskMon starts on a node while the command is active.

Alien ID Request Message - mid = TM_MSG_TM_ALIEN (8)

This message is broadcast once a second for 10 seconds.

typedef struct TM_TM_ALIEN_s
{  // MsgId == TM_MSG_TM_ALIEN
   TM_HEADER      h;
   SYSTEMID       sidDesire;     // Desired system id. (only SystemID and SideID have meaning)
   WORD           wCountDown;    // Number of times this message will be resent
} TM_TM_ALIEN;
  • sidDesire has the ID and side value filled in with the desired values for the node
  • wCountDown starts at nine and decreases to zero.

Reject Alien ID Request Message - mid = TM_MSG_TM_ALIEN_REJECT (9)

This message is sent directly to the requesting alien when there already is a node with the requested ID and side.

typedef struct TM_TM_ALIEN_REJECT_s
{  // MsgId == TM_MSG_TM_ALIEN_REJECT
   TM_HEADER      h;
   SYSTEMID       sidDesire;     // Reflected from TM_TM_ALIEN message
   SYSTEMID       sidTry;        // Try this SystemID and SideID
} TM_TM_ALIEN_REJECT;
  • sidDesire is copied from the request message
  • sidTry is contains an ID and side to try next. A side of 3 means there are no alien ID-side combinations available.

Debugging

When working in the PICS development environment, with a DEBUG BUILD of TaskMon, some additional command line parameters are available:

/tray
Tells the debug build to put the program icon in the system tray instead of on the TaskBar. By default, TaskMon's debug build opens the main window when run and places the program icon on the task bar (very useful when debugging TaskMon itself).
/DebugBreak
Equivalent to setting DebugBreak=TRUE in the [Debug] section on TaskMon.ini
/NoTimeOut
Equivalent to setting NoTimeOut=TRUE in the [Debug] section on TaskMon.ini
DEBUG
Equivalent to setting Debug=TRUE in the [Debug] section on TaskMon.ini
DEBUG
task1[,task2,etc]
Equivalent to setting Debug=TRUE in the [Debug] section on TaskMon.ini and setting one (or more) Task=task1 entries in the [Debug] section.

When the Debug, DebugBreak and/or NoTimeOut options are set from the command line, the associated setting in TaskMon.ini is ignored.

Personal tools