Job Tracking on a Grid - the Logging and Bookkeeping and Job Provenance Services1

CESNET technical report number 9/2007
also available in PDF, PostScript, and XML formats.

Luděk Matyska, Aleš Křenek, Miroslav Ruda, Jiří Sitera, Daniel Kouřil, Michal Voců, Jan Pospíšil, Miloš Mulač, Zdeněk Salvet
26.2.2007

1   Abstract

Keeping track of a job within a complex Grid environment is a difficult task that cannot be easily delegated to inspection of data from Grid infrastructure monitoring. The job centric monitoring service is used to provide data about actual job status independently on jobs crossing internal administrative domains. It is also a valuable source of data for system administrators helping to improve the infrastructure behavior. The Logging and Bookkeeping (L&B) developed within the EGEE project provides a distributed scalable job centric monitoring service able to deal with hundreds of thousands of jobs on large Grids. To provide the necessary scalability and not to slow down the processing of jobs within a middleware, the service is based on a non-blocking asynchronous model. This means that the order of events sent to L&B by individual parts of the middleware (user interface, scheduler, computing element, etc.) is not guaranteed. A robust on-the-fly processing is used to derive a meaningful job state from events arriving in random order. The L&B may thus temporarily provide information that looks inconsistent with the knowledge user has from some other source (for example, he got an independent notification about the job state). The report provides details of the L&B internal design, and the way of correct interpretation of the L&B results (the job state) is also discussed. While L&B is dealing with active jobs only, the Job Provenance (JP) is designed to permanently store information about all jobs that run on a Grid. All the relevant information, including computing environment specification and basic input data, needed to re-submit the job in the same environment is stored and made available for a later perusal. Users can annotate stored records, providing yet another metadata layer useful, for example, for job grouping and data mining over the JP. As the data are never removed from the JP (it can only grow), its architecture is different from the one used for the L&B, separating clearly the storage and querying capabilities. We present an implementation developed within the EU EGEE project.

2   Introduction

Before we start describing the services developed by our group we would like to provide the gentle reader with a context of our work, namely the area of job-centric grid monitoring and proposed or used solutions.

2.1   Job-Centric Grid Monitoring

The Grid is a complex environment consisting of very large number of hardware and software components. To be able to deal with such an environment, users are shielded from the complexity and usually see the Grid through a portal or some other user interface. However, the hidden complexity may be a source of unexpected and unwanted surprises, when submitted jobs do not behave as expected. Both users and Grid administrators need services that monitor the Grid and provide basic information about the state of its components. While this Grid infrastructure monitoring provides very valuable information especially for system administrators, users are usually mostly interested in the state and history of their jobs. This information is not easily available through the Grid monitoring systems that usually follow the physical topology of the Grid and its hierarchy - cluster, institution, region, wide area Grid - that is not related to the job behavior. Jobs move freely through the Grid, being processed by schedulers and passing internal administrative domains of individual resource providers. To see an actual status of a job, individual information pieces, coming from different administrative domains, must be collected and properly interpreted.

The Job centric Grid monitoring [Kou04] deals with the problem by providing a specific service that tracks not the Grid components themselves, but the moving object - the submitted job. This service collects information from all the components that somehow "touch" the job during its lifetime within the Grid. This job-related information is stored in one place that serves further as an authoritative source of job state information. As all the information about a particular job is stored in one place, the user does not need to follow the job manually over the Grid, digesting partial information from different components. The service can also hide all the authentication and authorization complexity usually associated with accessing the Grid monitoring data. The service purges all irrelevant (but sometimes highly sensitive) information that may be part of the Grid monitoring output and leaves only data about a particular job - data the job owner is usually automatically authorized to inspect.

The real time job centric monitoring system is primarily used to track the job status during the job's lifetime within the Grid. Either through polling or through some notification service, users can usually access not only the job status - for example scheduled, running etc. - but also auxiliary information associated with a job or its actual state. Information which cluster or even which computing element is processing my job, what has been a reason for a resubmission, which resource broker is handling the job, are just examples of information available this way. The job centric monitoring system can be also used to provide more aggregate information, like how many of my jobs are in a particular state, which jobs are not finished yet etc. It is also useful to detect problems that may be otherwise difficult to catch, like a computing element with too high job failure rate, very slow resource broker, available this way. This kind of information can help users to better navigate through a complex Grid and system administrators can use the same service to improve the infrastructure.

The information collected by the job centric monitoring system is usually purged when the job finishes and leaves the Grid. However, its usefulness is not limited to the job lifetime. When associated with more data about the job - like the input and output files - and also with data about the environment where the job runs - version of operating system, libraries, and eventually application used, the run time environment setup and variables - a Job Provenance is created. This way, a history of job execution is made available for different purposes. Users (and if authorized, also Grid system administrators) can perform data mining, to search for unusual patterns in job or infrastructure behavior, to collect jobs with common features or to look for a specific job and its input parameters and files. Probably the most important feature of the Job Provenance is the support for job-resubmission. As all the relevant information is available - the input files, job description, environment setting etc. - the job can be re-run either "as is" or some of the parameters can be changed (or even the job binary can be replaced, for example, in case an implementation or algorithm error has been found) and the modified job resubmitted again. Associated with extensive search and collection capabilities this is a very powerful tool to support efficient use of even very large Grids with high number of jobs.

Both the real time and long term job centric monitoring can be extended through a user-provided additional information - the annotations. This is information that is stored in the appropriate databases as a result of explicit users' action and is usually used to improve otherwise too technical (automatically produced by the Grid components) job track. The annotations can be as simple as a brief description of a job (its "title") or they can be more complex and usable for marking jobs with similar properties or features. The annotations are especially important within a provenance, as they help users to keep the information better organized.

The data stored in job centric monitoring systems can be also used to improve the infrastructure. Being part of a general Grid monitoring, they provide an "independent" job (and user) oriented view on the Grid, serving as an auxiliary and complemental information providers - the cross check. The history data available in a Job Provenance is also the source for a statistical evaluation of the Grid behavior. Users perceive the Grid through the way their jobs are processed (successfully completed, failed, lost, resubmitted due to infrastructure failures etc.), so the statistical evaluation of the rich job history available in Job Provenance gives a view that is very close to end user perception of the Grid performance.

Within an EU DataGrid project we started to develop the Logging and Bookkeeping (L&B) service as the real time job centric monitoring system for large scale Grids. The work further extended into a design and implementation of the Job Provenance service, available within the EU EGEE and EGEE II projects. In the following sections, we describe architecture and implementation of both services in a detail, as a particular incarnation of the generic job centric Grid monitoring system principles.

2.2   Related Work

General Grid monitoring is an active research area focused primarily on monitoring Grid infrastructure and its resources. The data provided by such Grid monitoring systems are used by resource brokering and scheduling, for accounting purposes, and generally for management of resources and performance evaluation. The comprehensive survey [Ger04] describes 26 different monitoring tools and infrastructures, however, none of these is focused on tracking specifically Grid jobs.

2.2.1   A Grid Monitoring Architecture

The general Grid Monitoring Architecture (GMA) [Tie01] has been proposed by the GGF Performance Working Group. The GMA defines basic components of a general monitoring system: producers that consist of sensors and that provide the initial monitoring information; consumers that receive the monitoring data in form of events; and a directory that provides a lookup service and manage connections between producers and consumers. This basic picture can be enhanced by re-publishers, that implement both producer and consumer interfaces and process the monitoring events (they can filter, cache, or aggregate the information, make a digest etc.). GMA proposal also identifies several types of interaction between these components, namely publish/subscribe, query/response, and notification.

Most of current monitoring tools including the L&B implementation follow the GMA standard. Basic GMA model was studied and extended in many papers. R-GMA [Fis01] applies model of a relational databases to monitoring systems, CODE [Smi02] and CGMA [Kra05] propose new component called manager or mediator, Mercury [Bal01] studies possibility of building lightweight monitoring system without central repository. Different possibilities of registry implementation were also studied - in [Cec05] distributed registry using Content-based Publish/Subscribe Systems [Car01] was proposed, LDAP based implementations of registry can be found in CODE [Smi02] or MDS2 [Cza01], relational database model is used in R-GMA [Fis01], distributed web-services are building blocks in the MDS3 [Fos05].

2.2.2   General Monitoring Systems

Two extensive surveys of monitoring systems (performance analysis systems survey [Ger04], taxonomy of Grid monitoring systems [ZS05]) provide basic overview of already implemented monitoring systems.

Taxonomy of Grid monitoring tools ([ZS05]) classifies different monitoring systems based on criteria, which are also relevant to job monitoring - type of provision of monitored information, characteristics of GMA component model, type of entities that are primarily monitored and whether studied systems allow data republishing and hierarchies of components. They compare classical "cluster monitoring tools" like Ganglia [MCC04], monitoring tools included in Grid toolkits (MDS from Globus [Cza01], HawkEye from Condor, general GMA implementations like Mercury [Bal01] and CODE [Smi02], tools providing unified way for accessing various monitoring data sources (GridRM [BS03] and MonALISA [New03]) and several monitoring tools developed for one particular usage - Pablo+Autopilot [Rib98] and NetLogger [TG03] for performance analysis, NWS [WSH99] for non-intrusive performance monitoring and forecasting of distributed resources, primarily intended for networks and hosts.

For interoperability of monitoring systems, common schema of monitoring events and compatible data models would be very helpful. However, GMA, being architecture proposal, does not pose such requirements. GGF proposed simple producer-consumer protocol [SGQ01], which is used for example in CODE. Concerning event types, several GGF groups defined schemas based on their research area - Network Measurements Working Group provides schema for network measurements, CIM-based Grid Schema Working Group was working on schema based on the Common Information Model (CIM). In the DataTag project, a GLUE schema [ASV03] was proposed to allow interoperability between US Globus based projects and the EU DataGrid project. Condor is using ClassAd schema-free expressions [RLS98] for description of resources and jobs, and XML-encoded version of ClassAds is used in HawkEye. In the NetLogger, text-based Universal Logger Message format (ULM [AD97]) is used.

Mercury [Bal01], developed in the GridLab project, is C-based GMA implementation with focus to resource and performance monitoring. Mercury implements only producers and consumers, no registry is used. Mercury has a modular infrastructure where individual sensors are implemented as loadable modules thus providing easy extensibility. Mercury also features a hierarchical design where lower level monitoring components can aggregated under higher level components. However, no specialized republishers are provided; Mercury provides only infrastructure, which allows implementation of such features.

The Relational Grid Monitoring Architecture (R-GMA [Fis01]) extends the GMA with a relational model known from database world to express data types and queries in a Grid environment. R-GMA motivation is to support "publication of static and dynamic data, a global view of this data, and a query mechanism capable of dealing with latest-state, continuous, and history queries" [Coo03]. R-GMA distinguishes two basic types of producers: (i) the database producer publishing set of static relations maintained in a database, and (ii) the stream producer, publishing more frequently changed streams of tuples. Consumer is defined by an SQL query, together with requirement on a type of query: continuous, history, or latest-state. New schema component is defined - it maintains definitions of relations and attributes and supports dynamic management of new relations. R-GMA defines also republishers - the implementation of views in database systems. A republisher is defined by one or more queries and publishes answers to those queries. The registry maintains information about producers (relations and views) and consumers (queries) and it is used by the mediator component, which dynamically creates "query plan" for queries required by consumers.

MonALISA [New03] is a Jini-based framework for host and network monitoring. Monitoring servers contain monitoring agents, which collect monitoring data from local sources (batch systems, Ganglia, SNMP) and store this data in a local database. Clients are using Jini discovery service to locate informations providers and download client code for specific data providers. Both real-time and historical queries are supported. System is modular, monitoring agents can be managed using administration tools.

Both R-GMA and MonALISA provides general framework, which can be compared with event gathering part (the logging) of the L&B. However, the monitoring systems usually expect data flows where the reliability of the monitoring infrastructure is not of the utmost importance - a loss of few events is quickly recovered by new arriving events. Another aspect is the security - only recently the monitoring data are becoming a security concern, the monitoring systems prioritize performance over security and data are sent with little or no authentication, authorization and encryption. In case of complex systems with lot of internal processing (like in the R-GMA), the performance of the actual implementation, especially if Java or some scripting language is used, may be far from optimal (see R-GMA performance results in [ZFS03]).

2.2.3   Job Monitoring

Simple job monitoring - collecting information concerning status and characteristics of jobs - is a natural part of job and resource management systems (RMS). Traditional batch execution systems (PBS [HT96], LSF [Zho92], Maui [JSC01]) support this type of monitoring, with several fundamental restrictions - central job database is used, usually without emphasis on fault-tolerant event transmission, history of forepassed jobs is limited. These systems use the RMS service as a primary source of job related events, with no or very limited support for information exchange between individual RMS installations. They all work with an implicit model of a single administrative domain.

Globus Grid Toolkit 4 includes information service MDS3 [Fos05], which provides monitoring information about Grid resources. Running jobs are published in MDS3, too, but no care is taken to provide an aggregate information coming from different Grid components. Globus project designed also the GRAM protocol [Cza88], which provides an abstraction for remote process queuing, execution and monitoring. Globus project provides GRAM server, which supports variety of batch systems. However, GRAM provides access to local resource (for example a cluster) and it must be supplemented with a system that manages (and monitors) jobs on a Grid wide level. Condor [LLM88] or gLite WMS [Lau04] are widely used to interconnect GRAM enabled resources, taking also care of the corresponding job monitoring tasks.

In addition to standard monitoring of running jobs, UNICORE [Str05] provides distributed logging service - information about submitted and running jobs are stored into log files. However, job logs are spread across multiple locations and there is no service providing a unified view on the job.

Both Globus and UNICORE support monitoring of jobs only when the job is managed by their own infrastructure. As soon as it is passed to a different system (for example the local batch system), only summary information about the job life is provided to monitoring infrastructure. Information about jobs is available only for "alive" jobs - jobs waiting in the queue or currently running.

Condor is a specialized job and resource management system for compute-intensive jobs [TTL05], which extends traditional RMS in several areas - support for high-throughput and opportunistic computing. Large number of maintained jobs, together with higher possibility of failures, motivated Condor developers to a more scalable and robust design of monitoring and management subsystems. Moreover, with wider adoption of Grids, Condor system was extended to support job submission to different Grids. Currently Condor supports submission to Globus, UNICORE and of course also Condor based Grids and is still able to monitor job status. Condor provides short history information (equivalent to the L&B history), no permanent information with timescale of Job Provenance is provided.

Specific monitoring area, which is often confused with the job centric monitoring, is the application performance monitoring. Here, information about the actual job run is collected and later used for performance evaluation. The data flow coming from an application can be several orders of magnitude higher than data flows considered within the job centric Grid monitoring. While there is a potential overlap in case of very large scale distributed jobs that span several administrative domains within the Grid, the application monitoring deals with a clearly identified job on a clearly identified resource (or resources). There has been attempts to use the GMA based systems to transport the application monitoring data. In [BG03], monitoring tool which allows performance monitoring and steering of application is described. This report also deals with a problem of different job identifications and their respective mapping, without a unique Grid wide job identifier used by all the components and sub-grids. In [Bon03] jobs are wrapped by an executable which monitors the application and the data are sent through a standalone monitoring infrastructure (R-GMA). Similar approach can be found in Grid Analysis Environment (GAE [Ali04]). Its Job Monitoring Service monitors executables running on a given host and provides web service interface for job monitoring and also for job steering. Monitoring data can be also published into MonALISA monitoring tool.

All the application monitoring systems expect direct user interaction during the monitoring - they provide a steering interface to control the monitoring and to collect data on demand and only when really needed. They are not designed as general fully automated systems monitoring all the Grid jobs, in fact only a very small fraction of jobs is monitored in a high detail, as large amount of produced information restricts its general usability. If the data are transmitted over the network in the real time, specialized transport infrastructure must be used (for example no real-time processing, no encryption, very lightweight components etc.).

As we can see from the job centric point of view, some important features are missing in all the surveyed tools:

L&B and Job Provenance services described in detail in the next sections are designed to provide all these features.

2.2.4   Job Provenance

The EU Provenance project (Enabling and Supporting Provenance in Grids for Complex Problems) unites most European groups working in the area of (job) provenances. Papers presented at the recent Provenance and Annotation workshop give a very good overview of this subject. In [Che05] a broad definition of provenance is proposed - the provenance of a piece of data is the process that led to the data - and proof of concept work on provenance in a Service Oriented Architecture is presented. Most current provenance architectures are concerned with incorporation of provenance capabilities into a Web Service-based Grid environments [Alp05]. They are primarily interested in either the workflow evolution or in the origin and history of a piece of data. The usually considered scenario of execution of a composite service by trusted workflow engine provides the ability to collect and archive the provenance about the transformation of data during invocation of web services. The dependence on the SOA restricts the usability of results when a "legacy", non-WS-based services are potential generators of provenance data, as the encapsulation of such services is still in its infancy and not widely available in all Grid environments.

In EGEE related projects, number of HEP experiments have their own workflow systems that also provide some provenance and annotation (metadata) capabilities. The ATLAS experiment uses AMI database application framework for production physics bookkeeping. AMI also manages "tasks" - the transformations that can be applied to datasets and their configuration parameters - and stores links between tasks and datasets. AliEn (ALICE Environment) is a Grid framework that takes care of job splitting and execution, manages datasets, and keeps track of the basic provenance of each executed job or file transfer. The AliEn File Catalog also provides interface for attaching new annotation (metadata) database tables to its standard directory tables. SAM (Sequential data Access via Meta-data) used in the DZero Experiment is a data handling system designed to store and retrieve files and associated metadata, including a complete record of the processing which has used the files. The major drawback of those systems is that they do not typically record activities of other middleware services and do not provide full provenance data due to their position at the top of the job submission (service invocation) chain. On the contrary, the Job Provenance is designed as an essential Grid service, defining a unified but still flexible framework for collecting and keeping provenance data coming from different middleware components.

3   Just in Time Information Provision - the Logging and Bookkeeping Service

3.1   Purpose and requirements

The Logging and Bookkeeping service (L&B for short) was initially developed in the EU DataGrid project as a part of the Workload Management System (WMS). The development continues in the EGEE and EGEE-II projects, where L&B became an independent part of the gLite middleware [Lau04].

L&B's primary purpose is tracking WMS jobs as they are processed by individual Grid components, not counting on the WMS to provide this data. The information gathered from individual sources is collected, stored in a database and made available at a single contact point. The user get a complete view on her job without the need to inspect several service logs (which she may not be authorized to see in the entirety or she may not be even aware of their existence).

While L&B keeps track of submitted and running jobs, the information is kept by the L& Bservice also after the job has been finished (successfully completed its execution, failed, or has been canceled for any reason). The information is usually available several days after the last event related to the job arrived, to give user an opportunity to check the job final state and eventually evaluate failure reasons.

As L&B collects also information provided by the WMS, the WMS services are no longer required to provide job-state querying interface. Most of the WMS services can be even designed as stateless - they process a job and pass it over to another service, not keeping state information about the job anymore. During development and deployment of the first WMS version this approach turned to be essential in order to scale the services to the required extent [Ave04].

L&B must collect information about all important events in the Grid job life. These include transitions between components or services, results of matching and brokerage, waiting in a queue systems or start and end of actual execution. We decided to achieve this goal through provision of an API (and the associated library) and instrumenting individual WMS services and other Grid components with direct calls to this API. But as L&B is a centralized service (there exists a single point where all information about a particular job must eventually arrive), direct synchronous transfer of data could have prohibiting impact on the WMS operation. The temporary unavailability or overload of the remote L&B service must not prevent (nor block) the instrumented service to perform as usual. An asynchronous model with a clear asynchronous delivery semantics, see Section, is used to address this issue.

As individual Grid components has only local and transient view about a job, they are able to send only information about individual events. This raw, fairly complex information is not a suitable form to be presented to the user for frequent queries. It must be processed at the central service and users are presented primarily this processed form. This form derives its form from the job state and its transition, not from the job events themselves. The raw information is still available, in case more detailed insight is necessary.

While the removal of state information from (some of) the WMS services helped to achieve the high scalability of the whole WMS, the state information is still essential for the decisions made within the resource broker or during the matchmaking process. For example, decision on job resubmission is usually affected by the number of previous resubmission attempts. This kind of information is currently available in the L&B only, so the next "natural" requirement has been to provide an interface for WMS (and other) services to the L&B to query for the state information. However, this requirement contains two contradictions: (i) due to the asynchronous event delivery model, the L&B information may not be up to date and remote queries may lead to unexpected results (or even inconsistent one - some older information may not be available for one query but may arrive before a subsequent query is issued), and (ii) the dependence on a remote service to provide vital state information may block the local service if the remote one is not responding. These problems are addressed by providing local view on the L&B data, see Section.

3.2   Concepts

3.2.1   Jobs and events

To keep track of user jobs on the Grid, we first need some reliable way to identify them. This is accomplished by assigning a unique identifier, which we call jobid ("Grid jobid"), to every job before it enters the Grid. A unique jobid is assigned, making it the primary index to unambiguously identify any Grid job. This jobid is then passed between Grid components together with the job description as the job flows through the Grid; the components themselves may have (and usually do) their own job identifiers, which are unique only within these components.

Every Grid component dealing with the job during its lifetime may be a source of information about the job. The L&B gathers information from all the relevant components. This information is obtained in the form of L&B events, pieces of data generated by Grid components, which mark important points in the job lifetime (for example, passing of job control between the Grid components are important milestones in job lifetime independently on the actual Grid architecture); see Appendix A for a complete list. We collect those events, store them into a database and simultaneously process them to provide higher level view on the job's state. The L&B collects redundant information - the event scheme has been designed to be as redundant as possible - and this redundancy is used to improve resiliency in a presence of component or network failures, which are omnipresent on any Grid.

The L&B events themselves are structured into attribute = value pairs, the set of required and optional attributes is defined by the event type (or scheme). For the purpose of tracking job status on the Grid and with the knowledge of WMS Grid middleware structure we defined an L&B schema with specific L&B event types [ZS05]. The schema contains a common base, the attributes that must be assigned to every single event. The primary key is the jobid, which is also one of the required attributes. The other common attributes are currently the timestamp of the event origin, generating component name and the event sequence code (see Section).

While the necessary and sufficient condition for a global jobid is to be Grid-wide unique, additional desired property relates to the transport of events through the network: All events belonging to the same job must be sent to the same L&B database. This must be done on a per message basis, as each message may be generated by a different component. The same problem is encountered by users when they look for information about their job - they need to know where to find the appropriate L&B database too. While it is possible to devise a global service where each job registers its jobid together with the address of the appropriate database, such a service could easily become a bottleneck. We opted for another solution, to keep the address of the L&B database within the jobid. This way, finding appropriate L&B database address becomes a local operation (at most parsing the jobid) and users can use the same mechanism when connecting to the L&B database to retrieve information about a particular job (users know its jobid). To simplify the situation even further, the jobid has the form of an URL, where the protocol part is "https", server and port identify the machine running the appropriate L&B server (database) and the path contains base64 encoded MD5 hash of random number, timestamp, PID of the generating process and IP address of the machine, where the jobid was generated. Jobid in this form can be used even in the web browser to obtain information about the job, provided the L&B database runs a web server interface. This jobid is reasonably unique - while in theory two different job identifications can have the same MD5 hash, the probability is low enough for this jobid to represent a globally unique job identification.

As described in the previous section, information about jobs are gathered from all the Grid components processing the job in the form of L&B events. The gathering is based on the push model where the components are actively producing and sending events. The push model offers higher performance and scalability than the pull model, where the components are to be queried by the server. In the push model, the L&B server does not even have to know the event sources, it is sufficient to listen for and accept events on defined interface.

The event delivery to the destination L&B server is asynchronous and based on the store-and-forward model to minimize the performance impact on component processing. Only the local processing is synchronous, the L&B event is sent synchronously only to the nearest L&B component responsible for event delivery. This component is at the worst located in the same local area network (LAN) and usually it runs on the same host as the producing component. The event is stored there (using persistent storage - disk file) and confirmation is sent back to the producing component. From the component's point of view, the send event operation is fast and reliable, but its success only means the event was accepted for later delivery. The L&B delivery components then handle the event asynchronously and ensure its delivery to the L&B server even in the presence of network failures and host reloads.

It is important to note that this transport system does not guarantee ordered delivery of events to the L&B server; it does guarantee reliable and secure delivery, however. The guarantees are statistical only, as the protocol is not resilient to permanent disk or node crashes nor to the complete purge of the data from local disk. Being part of the trusted infrastructure, even the local L&B components should run on a trusted and maintained machine, where additional reliability may be obtained for example by a RAID disk subsystem.

3.2.2   Event processing

As described in the previous section, L&B gathers raw events from various Grid middleware components and aggregates them on a single server on a per-job basis. The events contain a very low level detailed information about the job processing at individual Grid components. This level of detail is valuable for tracking various problems with the job and/or the components, and as complementary events are gathered (for example, each job control transfer is logged independently by two components), the information is highly redundant. Moreover, the events could arrive in wrong order, making the interpretation of raw information difficult and not straightforward. Users, on the other hand, are interested in a much higher view, the overall state of their job.

For these reasons the raw events undergo complex processing, yielding a high level view, the job state, that is the primary type of data presented to the user. Various job states form nodes of the job state diagram (Figure). See Appendix B for a list of the individual states.

[Figure]

Figure 1: L&B job state diagram

L&B defines a job state machine that is responsible for updating the job state on receiving a new event. The logic of this algorithm is non-trivial; the rest of this section deals with its main features.

Transitions between the job states happen on receiving events of particular type coming from particular sources. There may be more distinct events assigned to a single edge of the state diagram. For instance, the job becomes Scheduled when it enters batch system queue of a Grid computing element. The fact is witnessed by either Transfer/OK event reported by the job submission service or by Accept event reported by the computing element. Receiving any one of these events (in any order) triggers the state change.

This way, the state machine is highly fault-tolerant - it can cope with delayed, reordered or even lost events. For example, when a job is in the Waiting state and the Done event arrives, it is not treated as inconsistency but it is assumed that the intermediate events are delayed or lost and the job state is switched to the Done state directly.

The L&B events carry various common and event-type specific attributes, for example, timestamp (common) or destination (Transfer type). The job state record contains, besides the major state identification, similar attributes, for example, an array of timestamps indicating when the job entered each state, or location - identification of the Grid component which is currently handling the job. Updating the job state attributes is also the task of the state machine, employing the above mentioned fault tolerance - despite a delayed event cannot switch the major job state back it still may carry valuable information to update the job state attributes.

3.2.3   Event ordering

As described above, the ability to correctly order arriving events is essential for the job state computation. As long as the job state diagram was acyclic (which was true for the initial WMS release), each event had its unique place in the expected sequence hence event ordering could always be done implicitly from the context. However, this approach is not applicable once job resubmission yielding cycles in the job state diagram was introduced.

Event ordering that would rely on timestamps assigned to events upon their generation, assuming strict clock synchronization over the Grid, turned to be a naive approach. Clocks on real machines are not precisely synchronized and there are no reliable ways to enforce synchronization across administrative domains.

To demonstrate a problem with desynchronized clocks, that may lead to wrong event interpretation, let us consider a simplified example in Tab. Table. We assume that the workload manager (WM) sends the job to a computing element (CE) A, where it starts running but the job dies in the middle. The failure is detected and the job is resubmitted back to the WM which sends it to CE B then. However, if A's clock is ahead in time and B's clock is correct (which means behind the A's clock), the events in the right column are treated as delayed. The state machine will interpret events incorrectly, assuming the job has been run on B before sending it to A. The job would always (assuming the A's events arrive before B's events to the L&B) be reported as "Running at A" despite the real state should follow the Waiting...Running sequence. Even the Done event can be sent by B with a timestamp that says this happened before the job has been submitted to A and the job state will end with a discrepancy - it has been reported to finish on B while still reported to run on A.

1. WM: Accept 6. WM: Accept
2. WM: Match A 7. WM: Match B
3. WM: Transfer to A 8. WM: Transfer to B
4. CE A: Accept 9. CE B: Accept
5. CE A: Run 10. CE B: Run
... A dies

Table 1: Simplified L&B events in the CE failure scenario

Therefore we are looking for a more robust and general solution. We can discover severe clock bias if the timestamp on an event is in a future with respect to the time on an L&B server, but this is generally a dangerous approach (the L&B server clock could be severely behind the real time). We decided not to rely on absolute time as reported by timestamps, but to introduce a kind of logical time that is associated with the logic of event generation. The principal idea is arranging the pass through the job state diagram (corresponding to a particular job life), that may include loops, into an execution tree that represents the job history. Closing a loop in the pass through the state diagram corresponds to forking a branch in the execution tree. The scenario in Tab. Table is mapped to the tree in Figure. The approach is quite general - any finite pass through any state diagram (finite directed graph) can be encoded in this way.

[Figure]

Figure 2: Job state sequence in the CE failure scenario, arranged into a tree. Solid lines form the tree, arrows show state transitions.

Our goal is augmenting L&B events with sufficient information that

If such information is available, the execution tree can be reconstructed on the fly as the events arrive, and even delayed events are sorted into the tree correctly. An incoming event is considered for job state computation only if it belongs to the most recent branch.

The situation becomes even more complicated when the shallow resubmission WM advanced feature is enabled. In this mode WM may resubmit the job before being sure the previous attempt is really unsuccessful, potentially creating multiple parallel instances of the job. The situation maps to several branches of the execution tree that evolve really in parallel. However, only one of the job instances becomes active (really running) finally; the others are aborted. Because the choice of active instance is done later, it may not correspond to the most recent execution branch. Therefore, when an event indicating the choice of active instance arrives, the job state must be recomputed, using the corresponding active branch instead the most recent one.

Section describes the current implementation of event ordering mechanism based on ideas presented here.

3.2.4   Queries and notifications

According to the GMA classification the user retrieves data from the infrastructure in two modes, called queries and notifications in L&B.

Querying L&B is fairly straightforward - the user specifies query conditions, connects to the querying infrastructure endpoint, and receives the results. For "single job" queries, where jobid is known, the endpoint (the appropriate L&B server) is inhered from the jobid. More general queries must specify the L&B server explicitely, and their semantics is intentionally restricted to "all such jobs known here". We trade off generality for performance and reliability, leaving the problem of finding the right query endpoint(s), the right L&B servers, to higher level information and service-discovery services.

If the user is interested in one or more jobs, frequent polling of the L&B server may be cumbersome for the user and creates unnecessary overload on the sever. A notification subscription is therefore available, allowing users to subscribe to receive notification whenever a job starts matching user specified conditions. Every subscription contains also the location of the user's listener; successful subscription returns time-limited notification handle. During the validity period of the subscription, the L&B infrastructure is responsible for queuing and reliable delivery of the notifications. The user may even re-subscribe (providing the original handle) with different listener location (for example moving from office to home), and L&B re-routes the notifications generated in the meantime to the new destination. The L&B event delivery infrastructure is reused for the notification transport.

3.2.5   Local views

As outlined in Section WMS components are, besides logging information into L&B, interested in querying this information back in order to avoid the need of keeping per-job state information. However, despite the required information is present in L&B, the standard mode of L&B operation is not suitable for this purpose due to the following reasons:

The problem can be overcome by introducing local view on job data. Besides forwarding events to the server where events belonging to a job are gathered from multiple sources, L&B infrastructure can store the logged events temporarily on the event source, and perform the processing described in Section. In this setup, the logging vs. query semantics can be synchronous - it is guaranteed that a successfully logged event is reflected in the result of an immediately following query, because no network operations are involved. Only events coming from this particular physical node (but potentially from all services running there) are considered, thus the locality of the view. On the other hand, certain L&B events are designed to contain redundant information, therefore the local view on processed data (job state) becomes virtually complete on a reasonably rich L&B data source like the Resource Broker node.

3.3   Current L&B implementation

The principal components of the L&B service and their interactions are shown in Figure (gathering and transferring L&B events) and Figure (L&B query and notification services).

[Figure]

Figure 3: L&B components involved in gathering and transferring the events (large image)

3.3.1   L&B API and library

Both logging events and querying the service are implemented via calls to a public L&B API. The complete API (both logging and queries) is available in ANSI C binding, most of the querying capabilities also in C++. These APIs are provided as sets of C/C++ header files and shared libraries. The library implements communication protocol with other L&B components (logger and server), including encryption, authentication etc.

We do not describe the API here in detail; it is documented in L&B User's Guide [MCC04], including complete reference and both simple and complex usage examples.

Events can be also logged with a standalone program (using the C API in turn), intended for usage in scripts.

The query interface is also available as a web-service provided by the L&B server (Section).

3.3.2   Logger

The task of the logger component is taking over the events from the logging library, storing them reliably, and forwarding to the destination server. The component should be deployed very close to each source of events - on the same machine ideally, or, in the case of computing elements with many worker nodes, on the head node of the cluster2.

Technically the functionality is realized with two daemons:

3.3.3   Server

L&B server is the destination component where the events are delivered, stored and processed to be made available for user queries. The server storage backend is implemented using MySQL database.

Incoming events are parsed, checked for correctness, authorized (only the job owner can store events belonging to a particular job), and stored into the database. In addition, the current state of the job is retrieved from the database, the event is fed into the state machine (Section), and the job state updated accordingly.

On the other hand, the server exposes querying interface (Figure, Section). The incoming user queries are transformed into SQL queries on the underlying database engine. The query result is filtered, authorization rules applied, and the result sent back to the user.

While using the SQL database, its full query power is not made available to end users. In order to avoid either intentional or unintentional denial-of-service attacks, the queries are restricted in such a way that the transformed SQL query must hit a highly selective index on the database. Otherwise the query is refused, as full database scan would yield unacceptable load. The set of indices is configurable, and it may involve both L&B system attributes (for example job owner, computing element, timestamps of entering particular state, ...) and user defined ones.

The server also maintains the active notification handles (Section), providing the subscription interface to the user. Whenever an event arrives and the updated job state is computed, it is matched against the active handles3. Each match generates a notification message, an extended L&B event containing the job state data, notification handle, and the current user's listener location. The event is passed to the notification inter-logger via persistent disk file and directly (see Figure). The daemon delivers events in the standard way, using the specified listener as destination. In addition, the server generates control messages when the user re-subscribes, changing the listener location. Inter-logger recognizes these messages, and changes its routing of all pending events belonging to this handle accordingly.

3.3.4   Proxy

L&B proxy is the implementation of the local view concept (Section). When deployed (on the Resource Broker node in the current gLite middleware) it takes over the role of the local-logger daemon - it accepts the incoming events, stores them in files, and forwards them to the inter-logger.

In addition, the proxy provides the basic principal functionality of L&B server, that is, processing events into job state and providing a query interface, with the following differences:

3.3.5   Sequence codes for event ordering

As discussed in Section, sequence codes are used as logical timestamps to ensure proper event ordering on the L&B server. The sequence code counter is incremented whenever an event is logged and the sequence code must be passed between individual Grid components together with the job control. However, a single valued counter is not sufficient to support detection of branch forks within the execution tree. When considering again the Computing Element failure scenario described in Section, there is no way to know that the counter value of the last event logged by the failed CE A is 5 (Table).

1:x WM: Accept 4:x WM: Accept
2:x WM: Match A 5:x WM: Match B
3:x WM: Transfer to A 6:x WM: Transfer to B
3:1 CE A: Accept 6:1 CE B: Accept
3:2 CE A: Run 6:2 CE B: Run
... A dies

Table 2: The same CE failure scenario: hierarchical sequence codes. "x" denotes an undefined and unused value.

Therefore we define a hierarchical sequence code - an array of counters, each corresponding to a single Grid component class handling the job (currently the following gLite components: Network Server, Workload Manager, Job Controller, Log Monitor, Job Wrapper, and the application itself). Table shows the same scenario with a simplified two-counter sequence code. The counters correspond to the WM and CE component classes and they are incremented when each of the components logs an event. When WM receives the job back for resubmission, the CE counter becomes irrelevant (as the job control is on WM now), and the WM counter is incremented again.

The state machine keeps the current (highest seen) code for the job, being able to detect a delayed event by simple lexicographic comparison of the sequence codes. Delayed events are not used for major state computation, then. Using another two assumptions (that are true for the current implementation):

it is safe to qualify events with lower WM counter (than the already received one) to belong to inactive branches, hence ignore them even for update of job state attributes.

3.4   User interaction

[Figure]

Figure 4: L&B queries and notifications (large image)

So far we focused on the L&B internals and the interaction between its components. In this section we describe the interaction of users with the service.

3.4.1   Event submission

The event submission is mostly implicit, that is, it is done transparently by the Grid middleware components on behalf of the user. Typically, whenever an important point in the job life is reached, the involved middleware component logs an appropriate L&B event. This process is not directly visible to the user.

A specific case is the initial registration of the job. This must be done synchronously, as otherwise subsequent events logged for the same job may be refused with a "no such job" error report. Therefore submission of a job to the WMS is the only synchronous event logging that does not return until the job is successfully registered with the L&B server.

However, the user may also store information into the L&B explicitly by logging user events - tags (or annotations) of the form "name = value". Authorization information is also manipulated in this way, see L&B User's Guide for details.

3.4.2   Retrieving information

From the user point of view, the information retrieval is the most important interaction with the L&B service.

The typical L&B usage are queries on the high-level job state information. L&B supports not only single job queries, it is also possible to retrieve information about jobs matching a specific condition. The conditions may refer to both the L&B system attributes and the user annotations. Rather complex query semantics can be supported, for example, Which of my jobs annotated as "apple" or "pear" are already scheduled for execution and are heading to the "garden" computing element? The L&B User's Guide [TG03] provides a series of similar examples of complex queries.

As another option, the user may retrieve raw L&B events. Such queries are mostly used for debugging, identification of repeating problems, and similar purposes. The query construction refers to event attributes rather than job state.

The query language supports common comparison operators, and it allows two-level nesting of conditions (logically and-ed and or-ed). Our experience shows that it is sufficiently strong to cover most user requirements while being simple enough to keep the query cost reasonable. Complete reference of the query language can be found in L&B User's Guide.

3.4.3   Caveats

L&B is designed to perform well in the unreliable distributed Grid environment. An unwelcome but inevitable consequence of this design are certain contra-intuitive features in the system behavior, namely:

3.5   Security issues

The events passed between the L&B components as well as the results of their processing provide detailed information about the corresponding job and its life. Being used by the users to check status of their jobs and also by other Grid components to control the job, the information on jobs has to be reliable and reflect the real jobs' utilization of the Grid. Also, some user communities (for example biomedicine researchers) often process sensitive data and require the information about their processing is kept private so that only the job owner can access not only the result of the computation but also all information about the job run. Last but not least, according to legislation of some countries the information on users' jobs can be treated as the user private data, which requires an increased level of protection. L&B therefore must pay special attention to security aspects and access control to the data.

All L&B components communicate solely over authenticated channels. The TLS protocol [DA99] is used as the authentication mechanism and each of the L&B uses an X.509 public key certificate to establish a mutually authenticated connection. The users usually use their proxy certificates [Tue04] when accessing the L&B server and retrieving information about jobs. The proxy certificate can also contain other information necessary to create a secure connection, for example, information used for authorization. The L&B security layer is implemented using the the Generic Security Service API [Lin00], which makes it easier to port the application to an environment using mechanism other than PKI (for example Kerberos).

Apart from providing an authentication mechanism, the TLS protocol also allows the communicating parties to exchange an encryption key that is used to encrypt all subsequent communication. The L&B components encrypt all network communication to keep the messages private. Therefore, together with the access control rules implemented by the L&B server, the infrastructure provides very high level of privacy protection.

By default, access to a job information is only allowed to the user who submitted the job (the job owner). The job owner can also assign an access control list to her job in the L&B specifying other users who are allowed to read the data from L&B. The ACLs are represented in the GridSite GACL format [Cor03], which is a simplified version of common Extensible Access Control Markup Language (XACML) [OAS03]. The ACLs are stored in the L&B database along with the job information and are checked at each access to the data. The GridSite XML policy engine is used for policy evaluation. The ACLs are under control of the job owner, who can add and remove entries in the ACL arbitrarily using the L&B API or command-line tools. Each entry of an ACL can specify either a user subject name or a name of a VOMS group. The VOMS [Alf05] is a VO attribute provider service, which is maintained by the EGEE project. It allows to assign a user with groups and roles membership and issues to the users attribute certificates containing information about their current attributes. These attribute certificate are embedded in the user proxy certificate and checked by the L&B server at each user request handling.

Besides of using the ACLs, the L&B administrator can also specify a set of privileged users with access to all job records on a particular L&B server. These privileged users can for example collect information on usage and produce a monitoring data based on the L&B information.

Since the hostname of the L&B server is part of the job identification, it is easy for the user to check that the correct L&B server was contacted and no server spoofing took place and thus the data received from the server can be trusted. The L&B server on the other hand has no means of checking that the logged events originated from an authorized component. Everyone on the Grid possessing a valid certificate from a trusted CA can send an event to the L&B and let it store and process the event and possibly change the status of the corresponding job. This way a malicious user or service can confuse the L&B server by a forged events. This behavior is not a critical issue in the current model and the way in which L&B is used, however, we are designing a solution addressing this weakness. We plan to use the VOMS attributes issued to a selected components. These VOMS attributes must be presented when critical events are logged to the L&B server.

3.6   Advanced use

The usability of the L&B service is not limited to the simple tasks described earlier. It can be easily extended to support real-time job monitoring (not only the notifications) and the aggregate information collected in the L&B servers is a valuable source of data used for post-mortem statistical analysis of jobs and also the Grid infrastructure behavior. Moreover, L&B data can be used to improve scheduling decisions.

3.6.1   L&B and real time monitoring

The L&B server is extended to provide quickly and without any substantial load on the database engine the following data:

  1. number of jobs in the system grouped by internal status (Submitted, Running, Done, ...),

  2. number of jobs that reached final state in the last hour,

  3. associated statistics like average, maximum, and minimum time spent by jobs in the system,

  4. number of jobs that entered the WMS system in the last hour.

L&B server can be regularly queried to provide this data to give an overview about both jobs running on the Grid and also the behavior of the Grid infrastructure as seen from the job (or end user) perspective. Thus L&B becomes a data source for various real-tim Grid monitoring tools.

3.6.2   R-GMA feed

The L&B server also supports streaming the most important data - the job state changes - to another monitoring system. It works as the notification service, sending information about job state changes to a specific listener that is the interface to a monitoring interface. As a particular example of such a generic service, the R-GMA feed component has been developed. It supports sending job state changes to the R-GMA infrastructure that is part of the Grid monitoring infrastructure used in the EGEE Grid.

Currently, only basic information about job state changes is provided this way, taking into account the security limitation of the R-GMA.

3.6.3   L&B Job Statistics

Data collected within the L&B servers are regularly purged, complicating thus any long term post-mortem statistical analysis. Without a Job Provenance, the data from the L&B must be copied in a controlled way and made available in an environment where even non-indexed queries can be asked.

Using the L&B Job Statistics tools, one dump file per job is created when the job reaches a terminal state. These dump files can be further processed to provide and XML encoded Job History Record that contains all the relevant information from the job life. The Job History Records are fed into a statistical tools to reveal interesting information about the job behavior within the Grid.

This functionality is being replaced by the direct download of all the relevant data from the Job Provenance.

3.6.4   Computing Element reputability rank

Production operation of the EGEE middleware showed that misbehaving computing elements may have significant impact on the overall Grid performance. The most serious problem is the "black hole" effect - a CE that accepts jobs at a high rate but they all fail there. Such CE usually appears to be free in Grid information services so the resource brokers keep to assign further jobs to it.

L&B data contain sufficient information to identify similar problems. By processing the incoming data the information was made available as on-line auxiliary statistics like rate of incoming jobs per CE, rate of job failure, average duration of job etc. The implementation is lightweight, allowing very high query rate. On the RB the statistics are available as ClassAd functions, allowing the user to specify that similarly misbehaving CE's should be penalized or completely avoided when RB decides where jobs get submitted.

4   Permanent Information Storage and Retrieval - the Job Provenance Service

4.1   Purpose and requirements

The Job Provenance (JP) service is primarily designed to provide a permanent storage and advanced querying interface to the data about Grid jobs and the environment they were run in. This information is to be used for statistical purposes, lookup for patterns in the Grid behavior and also for job re-submission.

The Job Provenance extends the data model specified by the L&B service with additional information about each job - most specifically the input and output data files - and also information about the run time environment.

The Job Provenance must fulfill rather contradictory requirements. It must keep detailed information about each job, the environment the job run in and the affected files, as possible. On the other hand, being a permanent service, the job records must be kept reasonably small to fit into reasonable sized storage system. Given the expected number of jobs on large scale Grids - for example, the EGEE already reports on the JRA2: Quality Assurance page 20k jobs per day, that is 7.5M jobs per year, a number of jobs before the large experiments will be deployed - the JP must also support very efficient searching and querying features. Another problem is associated with the long life span of the JP service. It must be expected that the data formats will change over the time, while the JP is expected to deal with old and new data formats in a uniform way. They can be achieved via extensibility of the JP data model.

As the data collection serviced by the JP will extensively grow, it is impossible to rely only on the primary data when navigating through it. Users must be able to add annotations to individual job records and these annotations serve two primary purposes - to help in organizing the JP data and to be a source of additional information, not provided directly by the automated collection of primary data. Even annotations must follow the WORM (write once read many times) semantics, as they are always added on top of the already stored data, never re-writing the old annotations. Work with the most recent set of annotations as well as ability to inspect the history of annotations must be supported.

Figure depicts interaction between Job Provenance and other Grid middleware components (on the example of the gLite infrastructure).

[Figure]

Figure 5: Data flow into gLite Job Provenance

4.2   Concepts

4.2.1   Data gathering

The primary data organization in JP is on a per job basis, a concept taken from the L&B data organization. Every data item stored in JP is associated to a particular Grid job. As the overall storage capacity requirements may become enormous, we store only volatile data which are neither stored reliably elsewhere nor are reproducible by the job. The data gathered from the gLite middleware fall into the following categories:

In addition, the service allows the user to add arbitrary annotations to a job in a form of "name = value" pairs. These annotations can be recorded either during the job execution or at any time afterward. Besides providing information on the job (for example, it was a production-phase job of a particular experiment) these annotations may carry information on relationships between the job and other entities like external datasets, forming the desired data provenance record.

Once a piece of data is recorded for a job, it can be never updated or replaced. New values can be recorded4 but the old values are always preserved. Consequently the recorded history cannot be lost.

4.2.2   Data representation

The JP concept distinguishes between two views on the processed data. The data are stored in the JP in the raw representation. Two input modes are assumed, depending mainly on the size and structure of the data:

Most data manipulation is done using the logical view. Any piece of information stored in JP is represented as a value of a particular named attribute. Tags (user annotations) map to attributes in a straightforward way, name and value of the tags becoming name and value of an attribute. An uploaded file is usually a source of multiple attributes, which are automatically extracted via plugins. JP defines a file-type plugin interface API. The task of the plugin is parsing a particular file type and providing calls to retrieve attribute values.

To avoid naming conflicts even with future attributes, an attribute name always falls into a namespace. Currently we declare three different namespaces: for JP system attributes (for example job owner or registration time), attributes inherited from L&B, and unqualified user tags.

We keep the scheme symmetric, which means that none of the currently declared attribute namespaces is privileged in any sense. However, it may present a vulnerability - a malicious user may try to override a JP system attribute using the user annotation interface. Therefore each attribute value carries a further origin classification: currently system, user (recorded as tag), and file (found in an uploaded file).

Finally, as JP does not support updating data intentionally (see Section), multiple values of an attribute are allowed. The order in which the values were recorded can be reconstructed from timestamps attached to each value, getting the "attribute update" semantics if it is required.

Attributes, representing the logical view, is the only way to specify queries on JP. However, once the user knows an actual jobid, bulk files can be retrieved in the raw form, too (assumed to be useful in the case of input sandboxes reused for job re-execution).

4.2.3   Layered architecture

JP is formed of two classes of services: the permanent Primary Storage (JPPS) accepts and stores job data while possibly volatile and configurable Index Servers (JPIS) provide an optimized querying and data-mining interface to the end-users.

The expected large amount of stored data yields a requirement on maximal compactness of the storage. The raw data should be compressed, and the set of metadata kept with each job must be minimal. JP defines a set of primary attributes which are maintained by JPPS for each job. Jobid is the only mandatory primary attribute, other suggested ones are the job owner, submission time, and the virtual organization. All other attributes are retrieved from the raw data only when requested.

The restricted set of primary attributes prohibits user queries to be served by the JPPS directly (with the only exception of the known jobid). Due to the expected low selectivity of primary attributes such queries would result in processing large number of job records, overloading the server when the queries became frequent.

These contradictory requirements (compactness vs. performance) had to be resolved at another component layer. The main idea is preprocessing the huge JP dataset with several queries, each of them covering a superset of one expected class of user queries (for example jobs of particular VO, submitted in certain period). If these super-queries are chosen carefully, they retrieve only a small fraction of the primary data. Their results can thus be stored (or cached) in a richer form, including various indices, hence being suitable for fast response to user queries. Querying JPPS in this way and maintaining the cache of the query result is the task of Index Servers in the JP architecture.

Relationship of JPPS and JPIS is a many-to-many - a single JPIS can query multiple JPPS's and vice versa, a single JPPS is ready to feed multiple JPIS's.

The query conditions restrict the dataset in terms of the matching job records. Similarly, the query specifies a set of attributes to retrieve, reducing also considerably the amount of retrieved data per each matching job.

Index Servers query JPPS in two modes: in batch mode JPIS is populated with all JPPS content matching the query. In this way JPIS can be created from scratch, despite the ope-ration is rather heavy for JPPS. On the other hand, the incremental mode allows JPIS to subscribe with JPPS to receive new matching records as well as updates to already stored records whenever new data arrive to JPPS. This mode allows existing JPIS to be kept up to date while it is still lightweight for JPPS.

Finally, the described layer of JPIS's needn't be the only one. The architecture (despite not the current implementation) allows building another layer of JPIS's with many-to-many relationship with the previous layer instances, combining their data, providing support for other specific user queries etc.

4.3   Prototype implementation

The JP prototype implementation6 follows the described architecture.

4.3.1   Primary Storage

A single instance of JPPS, shown in Figure, is formed by a front-end, exposing its operations via a web-service interface 7 and a back-end, responsible for actual data storage and providing the bulk file transfer interface. The front-end metadata (the primary attributes for each job, authorization information, and JPIS subscription data) are stored in a relational database (currently MySQL).

The back-end uses Globus gridftp server enriched with authorization callbacks accessing the same database to check whether a user is allowed to upload or retrieve a given file. Both the front- and back-ends share a filesystem so that the file-type plugins linked into the front-end access their files via POSIX I/O.

JPPS operations fall into the following categories:

Primary Storage covers the first set of requirements specified for a Job Provenance (see Section), that is, storing a compact job record, allowing the user to add annotations, and providing elementary access to the data.

4.3.2   Index Server

Index Servers are created, configured, and populated semi-dynamically according to particular user community needs. The configuration is formed by:

The set of attributes and the conditions specify the set of data that is retrieved from JPPS, and they reflect the assumed pattern of user queries.

The current JPIS implementation keeps the data also in a MySQL database. Its schema is flexible, reflecting the server configuration (columns are created to hold particular attribute value, as well as indices).

The JPIS interface operations fall into the following categories:

4.3.3   Scalability and deployment

Having evaluated a random sample of L&B data on approx. 1000 jobs, we claim that the usual size of a complete L&B data dump varies from 2 kB to 100 kB, with very rare exceptions (less than 1 %) of sizes up to 5 MB. However, these plain text files contain repeating patterns and they can be compressed with the ratio of 1:4-1:20 in the typical cases and even higher for the large files. Therefore the assumption of 10 kB compressed L&B dump per job is a fairly safe upper limit. Unfortunately, we were not able to do a similar assessment for job sandbox sizes. Expecting the sandboxes to contain only miscellaneous input files we work with the hypothesis of 100 kB-1 MB sandbox size.

The current statistics for the entire infrastructure of the EGEE project report the rate of 20,000 jobs per day, while the middleware performance challenges aim at one million jobs per day.

job rate \ size 10 kB L&B 100 kB sandbox 1 MB sandbox
current 20 k/day 73 GB/year 730 GB/year 7.3 TB/year
challenge 1M/day 3.6 TB/year (36 TB/year) (360 TB/year)

Table 3: Expected aggregate storage size (whole EGEE)

job rate \ size 10 kB L&B 100 kB sandbox 1 MB sandbox
current 20 k/day 2.3 kB/s 23 kB/s 230 kB/s
challenge 1 M/day 115 kB/s (1.15 MB/s) (11.5 MB/s)

Table 4: Expected aggregate incoming data rate (whole EGEE)

Table and Table use the discussed numbers to derive the per-year storage size and per-second incoming data rate requirements on Job Provenance. The sandbox numbers for the 1 M job challenge are shown in parentheses because of being rather hypothetical - in order to achieve the required job throughput at WMS side, the jobs must be submitted in fairly large collections (chunks of approx. 100-10,000 individual jobs) that share a single input sandbox8. Therefore the real aggregate storage and throughput requirements on JP can be reduced by the factor of at least 100.

Despite these figures are aggregate for the whole huge EGEE infrastructure, they clearly show that the requirements could be met even with a single reasonably sized server.

JP is designed to support many-to-many relationship of JPPS and JPIS instances. Therefore there are no strict design requirements on the number and structure of installations. However, for practical reasons (some emerge from Section), it is desirable to keep just small number of well known JPPS's permanent services. Typical setup can be one JP per a larger virtual organization, or even one JP shared by several smaller ones. The outlined numbers show that this approach should not face technical limits. On the other hand, JPIS's are expected to be set up and configured semi-dynamically, according to the varying needs of even small user communities.

4.4   Use patterns

4.4.1   Storing data

Propagation of data from other middleware components to JP is done transparently. The user may affect it indirectly by specifying the destination JPPS and gathered data extent (for example whether to store the job to JP at all, or which sandbox files to keep) via parameters in the job description. These settings may be overridden by the WMS or CE policies.

The user stores data to JP directly when recording annotations (Section and Section).

4.4.2   Single job processing

When full information on a particular job is required (for example for the job re-execution), the JPPS instance which keeps the job data must be contacted. If it is known (for example the only JPPS serving particular VO), the data retrieval is straightforward using JPPS interfaces, as the jobid is the primary key to access JPPS data.

However, if JPPS for the job is not known, it must be looked up using JPIS query. Depending on the amount of the user's knowledge of the job details with respect to JPIS configurations (for example JPIS configured to request information on jobs submitted in a certain time interval is aware of the user's job only if its submission time falls into this interval) it may be necessary to query multiple JPIS's to find the particular job.

4.4.3   Job information retrieval

Besides preserving the job data the principal purpose of the JP is to provide job information according to some criteria, freeing the user of the burden to keep complete records on her jobs.

As discussed in Section the searches cannot be served directly by the JPPS. Therefore, the search must be done with querying a particular JPIS which configuration matches the user query:

Again, it may be necessary to query multiple JPIS's and concatenate the partial results. Currently we do not address the potentially non-trivial problem of finding suitable JPIS's. It falls out of the scope of the JP level, and should be preferably solved at the service discovery level.

4.5   Security

The data stored in the JP are in fact potentially more sensitive as they also include information about the inputs and are kept for eternity. All the interaction between components is authenticated and only encrypted channels are used to transfer data. The basic security model is inherited from the L&B, thus user and server certificates are used for encryption and TLS is used for channel encryption.

The data in JPPS and JPIS are not encrypted as this would create a problem with permanent depository of encryption keys. We also cannot use users' public keys to encrypt the data as this would complicate sharing and also endanger the data in case of private key loss. Instead, with a very limited number of JPPS's deployed, we trust the JP servers. Each JPPS keeps list of authorized components (L&B, Resource Brokers, ...) that are allowed to upload data to the JP server.

The sensitive nature of data requires also strong authorization support. While currently only implicit ACLs (only the job owner has access to the data) are supported, we plan to use the VOMS based authorization service to provide a fine grained (at the user/group level) of authorization control. In the same way as in the L&B, users will be able to specify who is authorized to access the data stored in the JP. In the current model, we plan to support read-only sharing, the annotations should be always stored by the job owner only. However, a way to transfer ownership of the data to another person must be also developed, to cover employees leave and even a death.

5   Summary and future work

We have presented a job centric monitoring framework suitable for tracking jobs in a large scale Grid. Information from many different Grid middleware components, generated in form of events is collected in a central database based system and made available to users in a uniform way. To support scalability and robustness, the asynchronous delivery model with potential out of order delivery of individual events is used.

The state of each job is derived from the arriving events using a job state machine. The event processing is robust and is adapted to deal with delayed and lost messages as well as with the out of order event delivery. The execution tree model is used to support cycles in the job life state diagram (for example caused by job re-submission), increasing even the robustness of the approach. Actual job state is available as a result for specific query or user can register to receive notification about job state changes. As the information is stored in a database, potentially full SQL querying power is available.

Apart from the information about live jobs, a provenance of job related data has also been developed. It stores a copy of events collected during the job lifetime together with job input files and information about the environment the job run in. This data can be used to re-construct the running environment and re-submit the job. They are also very important as a source for Grid performance evaluation and different statistical purposes.

As data about jobs and their runs are potentially highly sensitive, we use a strong security model. All the network passing communication is mutually authenticated and the transport channels are encrypted. Support for fine grain user controlled authorization based on VOMS is available, too.

The EGEE gLite Logging and Bookkeeping and the Job Provenance services are actual implementations of the job centric monitoring approach presented in this report. The L&B is extensively used in the EGEE production Grid with currently more than 50 installations world-wide. The JP is in an experimental deployment.

The development will continue with the main focus on the following five areas:

6   Appendix A. L&B Event Types

Transfer:

Start, success, or failure of job transfer to another component.

Accepted:

Accepting job (successful counterpart to Transfer).

Refused:

Refusing job (unsuccessful counterpart to Transfer).

EnQueued:

The job has been enqueued in an inter-component queue.

DeQueued:

The job has been dequeued from an inter-component queue.

HelperCall:

Helper component is called.

HelperReturn:

Helper component is returning the control.

Running:

Job wrapper started.

Resubmission:

Result of resubmission decision.

Done:

Execution terminated (normally or abnormally).

Cancel:

Cancel operation has been attempted on the job.

Abort:

Job aborted by system.

Clear:

Job cleared, output sandbox removed

Purge:

Job is purged from bookkeeping server.

Match:

Matching CE found.

Pending:

No matching CE found yet.

RegJob:

New job registration.

Chkpt:

Application-specific checkpoint record.

Listener:

Listening network port for interactive control.

CurDescr:

Current state of job processing (optional event).

UserTag:

User tag - arbitrary name=value pair.

ChangeACL:

Management of ACL stored on bookkeeping server.

Notification:

Management of notification service.

ResourceUsage:

Resource (CPU, memory etc.) consumption.

ReallyRunning:

User payload started.

Suspend:

Job execution (queuing) was suspended.

Resume:

Job execution (queuing) was resumed.

7   Appendix B. L&B Job States

Submitted:

Entered by the user to the User Interface or registered by Job Partitioner.

Waiting:

Accepted by WMS, waiting for resource allocation.

Ready:

Matching resources found.

Scheduled:

Accepted by LRMS queue.

Running:

Executable is running.

Done:

Execution finished, output is available.

Cleared:

Output transfered back to user and freed.

Aborted:

Aborted by system (at any stage).

Canceled:

Canceled by user.

Unknown:

Status cannot be determined.

Purged:

Job has been purged from bookkeeping server (for LB->RGMA interface).

References

[AD97] Abela J., Debeaupuis T.: Universal Format for Logger Messages. Internet Draft, IETF, 1997.
[Alf05] Alfieri R. et al.: From gridmap-file to VOMS: managing authorization in a Grid environment. Future Generation Computer Systems pp. 549-558, 2005.
[Ali04] Ali A. et al: Job Monitoring in an Interactive Grid Analysis Environment. Computing in High Energy Physics, 2004.
[Alp05] Alpdemir M. N. et al.: Contextualised Workflow Execution in myGrid. In: Proc. European Grid Conference, Springer-Verlag, LNCS 3470, pp. 444-453, 2005.
[ASV03] Andreozzi S., Sgaravatto M., Vistoli C.: Sharing a conceptual model of Grid resources and services. In: Proc. Conference for Computing in High Energy and Nuclear Physics (CHEP03), 2003.
[Ave04] Avellino G. et al.: The DataGrid Workload Management System: Challenges and Results. Journal of Grid Computing 2(4): 353-367, 2004.
[BS03] Baker M. A., Smith G. C.: GridRM: an extensible resource monitoring system. In: Proc. of the IEEE International Cluster Computing Conference. pp. 207-214, 2003.
[Bal01] Balaton Z. et al.: From Cluster Monitoring to Grid Monitoring Based on GRM. In: In proceedings 7th EuroPar2001 Parallel Processings, Manchester, UK. pp. 874-881, 2001.
[BG03] Balaton Z., Gombas G.: Resource and Job Monitoring in the Grid. In: Proc. of the Euro-Par 2003 International Conference, Klagenfurt, 2003.
[Bon03] Bonacorsi D. et al.: Scalability Tests of R-GMA Based Grid Job Monitoring System for CMS Monte Carlo Data Production. IEEE NSS Conference, Oregon, USA, 2003.
[Car01] Carzaniga A. et al.: Design and evaluation of a wide-area event notification service. ACM Transactions on Computer Systems 19(3): 332-383, 2001.
[Cec05] Ceccanti A. et al.: Towards Scalable and Interoperable Grid Monitoring Infrastructure. In: Proc. First CoreGRID Integration Workshop. pp. 10-18, 2005.
[Che05] Chen L. et al.: A proof of concept: Provenance in a Service Oriented Architecture. In: Proceedings of the fourth UK e-Science All Hands Meeting, Nottingham, UK, 2005.
[Coo03] Cooke A. et al.: R-GMA: An information integration system for grid monitoring. In: Proc. of the 11th International Conference on Cooperative Information Systems, 2003.
[Cor03] Cornwall L. et al.: EU DataGrid and GridPP authorization and access control. In: Proc. UK e-Science All Hands Meeting. pp. 382-384, 2003.
[Cza88] Czajkowski K. et al.: A resource management architecture for metacomputing systems. In: Proc. IPPS/SPDP Workshop on Job Scheduling Strategies for Parallel Processing. pp. 62-82, 1988.
[Cza01] Czajkowski K.: Grid Information Services for Distributed Resource Sharing. In: Proc. Tenth IEEE International Symposium on High-Performance Distributed Computing (HPDC-10), 2001.
[DA99] Dierks T., Allen C.: The TLS Protocol Version 1.0. RFC 2246, IETF, 1999.
[EGE04] EGEE Design Team: Design of the EGEE Middleware Grid Services.
EU deliverable DJRA1.2, CERN, 2004, available online.
[Fis01] Fisher S.: Relational Model for Information and Monitoring. Technical Report GWD-Perf-7-1, GGF, 2001.
[Fos05] Foster I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In: IFIP International Conference on Network and Parallel Computing. Springer-Verlag, LNCS 3779, pp. 2-13, 2005.
[Ger04] Gerndt M. et al.: Performance Tools for the Grid: State of the Art and Future. Tech. Rep. Lehrstuhl fuer Rechnertechnik und Rechnerorganisation, Technische Universitaet Muenchen (LRR-TUM), 2004. Available online.
[HT96] Henderson R., Tweten D.: Portable Batch System: External Reference Specification. NASA, Ames Research Center, 1996.
[JSC01] Jackson D., Snell Q., Clement M.: Core algorithms of the Maui scheduler. In: Proc. 7th Workshop on Job Scheduling Strategies for Parallel Processing, 2001.
[Kou04] Kouřil D. et al.: Distributed Tracking, Storage, and Re-use of Job State Information on the Grid. In: Computing in High Energy and Nuclear Physics (CHEP04), 2004.
[Kra05] Krajíček et al.: Capability Languages in C-GMA. In: Proc. Cracow Grid Workshop (CGW05), 2005.
[Lau04] Laure E.: Middleware for the next generation Grid infrastructure. In: Computing in High Energy Physics and Nuclear Physics (CHEP 2004), 2004.
[Lin00] Linn J.: Generic Security Service Application Program Interface Version 2, Update 1. RFC 2743, IETF, 2000.
[LLM88] Litzkow M., Livny M., Mutka M.: Condor - A Hunter of Idle Workstations. In: Proc. 8th International Conference of Distributed Computing Systems, 1988.
[MCC04] Massie M., Chun B., Culler D.: Ganglia Distributed Monitoring System: Design, Implementation, and Experience. Parallel Computing 30: 817-840, 2004.
[New03] Newman, H. B. et al.: MonALISA: a distributed monitoring service architecture. In: Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, CA, 2003.
[OAS03] OASIS: eXtensible Access Control Markup Language (XACML), Version 1.0. 2003.
[RLS98] Raman R., Livny M., Solomon M.: Matchmaking: Distributed Resource Management for High Throughput Computing. In: Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, 1998.
[Rib98] Ribler R. et al.: Autopilot: adaptive control of distributed applications. In: Proceedings of the Seventh IEEE Symposium on High-Performance Distributed Computing. pp. 172-179, 1998.
[Smi02] Smith W.: A System for Monitoring and Management of Computational Grids. In: Proc. 2002 International Conference on Parallel Processing, 2002.
[SGQ01] Smith W., Gunther D., Quesnel D.: A simple XML producer-consumer protocol. Grid Working Document GWD-Perf-8-2, Global Grid Forum, Performance Working Group, 2001.
[Str05] Streit A. et al.: UNICORE - From Project Results to Production Grids. Grid Computing : New Frontiers of High Performance Computing, pp. 357-376, 2005.
[TTL05] Thain D., Tannenbaum T., Livny M.: Distributed computing in practice: the Condor experience. Concurrency - Practice and Experience 17(2-4): 323-356, 2005.
[Tie01] Tierney B.: A Grid Monitoring Service Architecture. Global Grid Forum Performance Working Group, 2001.
[TG03] Tierney B., Gunter D.: NetLogger: a toolkit for distributed system performance tuning and debugging. In: Proceedings of the IFIP/IEEE Eighth International Symposium on Integrated Network Management (IM 2003). pp. 97-100, 2003.
[TGX05] Townend P., Groth P., Xu J.: A provenance-aware weighted fault tolerance scheme for service-based applications. In: Proc. 8th IEEE International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC 2005), 2005.
[Tue04] Tuecke S. et al.: Internet X.509 Public Key Infrastructure (PKI) proxy certificate profile. RFC 3820, 2004.
[WSH99] Wolski R., Spring N., Hayes J.: The network weather service: a distributed resource performance forecasting service for metacomputing. J. Future Generation Comput. Syst. 15(5/6): 757-768, 1999.
[ZS05] Zanikolas S., Sakellariou R.: A taxonomy of grid monitoring systems. Future Gener. Comput. Syst. 21(1): 163-188, 2005.
[ZFS03] Zhang X., Freschl J. L., Schopf J. M.: A Performance Study of Monitoring and Information Services for Distributed Systems. In: 12th IEEE International Symposium on High Performance Distributed Computing (HPDC-12 03), 2003.
[Zho92] Zhou S.: LSF: Load sharing in large-scale heterogenous distributed systems. In: Proceedings of the Workshop on Cluster Computing, 1992.

 

Footnotes:

  1. The systems presented in this report were developed within the EU EGEE project (INFSO-RI-508833) whose financial support is highly acknowledged.

  2. In this setup logger also serves as an application proxy, overcoming networking issues like private address space of the worker nodes, blocked outbound connectivity etc.

  3. The current implementation enforces specifying an actual jobid in the subscription hence the matching has minimal performance impact.

  4. It seems to make sense only for the annotations, not the middleware data, and the current implementation makes this restriction. However, it can be relaxed without principal impact.

  5. The current implementation supports gsiftp:// only but other protocols can be easily added.

  6. a part of the gLite middleware 3.x

  7. Described in detail in [EGE04], documented web service definitions can be found at our WSDL page

  8. This statement is based on informal discussions with WMS developers. The targeted WMS instance throughput at the time of this manuscript preparation was 1 sandbox-less job per second, that is 84.6 k such jobs per day (achieving the 1 M job rate with WMS clustering only), while the sandbox handling overhead is considered to be unsustainable at this rate.

další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz