Untitled

There’s a really cool site wikipedia.org that I don’t know how I went so long without stumbling over. The concept is really neat. It’s a publically maintained encyclopedia, the information is remarkably good. And I understand that the idea works well for a business tool for storing information.

Also if anyone want to call me I don’t bring my phone with me to school, so it is best to call after 8 EST.

I thought I might post my paper on Grids seeing as how I spent some time and actually produced something.

An Examination of Grids

Due: November 13, 2004

Assignment 2

Written By:

Matthew Warren – mfwarren

Felix Motzoi – fmotzoi

Table of Contents

Part One: Developing the case for Grids.

      1. What is a Grid.

      2. Who will use the Grid?

      3. Benefits of the Grid.

Part Two: The Grid Middle ware.

      1. Globus Toolkit.

      2. Open Grid Services Architecture.

      3. General Design

        1. User Applications, and tools

        2. Collective Services

        3. Resources

        4. Connectivity

        5. Fabric

Part Three: Building a Grid.

      1. Resources and Service management.

      2. The Grid Interface

      3. Reliability

      4. Monitoring

      5. Security

      6. Data Management.

      7. Programming the Grid.

        1. RPC

        2. Task parallelism

        3. Message Passing

        4. Java

Part One: the Case for Grids.

What is a Grid?

The term “the Grid” was coined in the mid 1990s to denote a proposed distributed computed infrastructure for advanced science and engineering [1]. Since that time considerable effort has been directed to realize this vision. In particular companies like IBM have been developing the Globus Toolkit and have even gone so far as to make Grids a major focus of its business [2].

In the Internet as it exists today communication is handled through a collection of standard protocols. These protocols allow any user to do a search for any publicly available information. The Internet has proved to be a very valuable resource, but there are limitations, for instance, if two organizations want to work together on a project. The Internet would provide a way to transfer files between the two sites. With a little bit more infrastructure, in the form of servers, both organizations could connect to the same point and work together on the project. Now if we want to add to this that we want to be able to share CPU resources, different hardware resources, applications, and large databases. But we want to be able to do all of this transparently, and with automatic resource allocation, security, and even billing for services rendered. Now say that we want to be able to arbitrarily allow new partners into this “Virtual Organization.” We quickly realize that implementing this with just the existing Internet protocols and standards would be exceedingly difficult.

The Grid is a powerful concept in the sense that what it attempts to accomplish is the virtualization of computer resources and services. Furthermore the Grid vision wants to do this over the vastness of the Internet on heterogeneous systems. This virtualization is a very important aspect of the Grid, it creates an opportunity for organizations to evaluate other methods for cost cutting on infrastructure requirements, or creating new revenue streams. The Grid will allow a business to outsource computation at peak times rather than owning the resources to handle this itself. On the other hand, if an organization has more computing resources than it requires, the extra CPU cycles could be sold (potentially there could be restrictions on to who, how much, and which resources).

Another question arises in the discussion of what is a Grid. How is a Grid different from a Cluster? A cluster is usually a small collection of homogeneous systems connected on the same network. Many of the problems addressed by clusters are similar to the ones for Grids, such as consistency, distribution of resources and communication. But the Grid is much broader. Indeed the Grid may include one or more clusters, as resources being shared.

Who will use the Grid?

The first people using grids are in the scientific and computation fields. These groups generally have large data stores and many computationally expensive experiments that they would like to be able to run. Only the lucky few have access to the computing power they would like to have so the ability to be able to locate remote computation services on the Grid is used extensively by this community. With a Grid infrastructure, scientists can avoid having to ship their data and programs to another site for computation on tapes or locate a site that is willing to provide services over the Internet. The Grid allows for an abstraction so that the researcher doesn’t need to know how or where his requests are fulfilled. A Grid will reduce the barrier to entry for scientists and researchers that want to do a simulation or computation but don’t have the resources to pay for their own super computer. Furthermore a grid may be able to provide more robust software that can work with a broader range of sensors and scientific equipment.

Another segment that will be interested in the possibilities that grids offer are corporations. Grids allow a corporation to intimately share resources with partners and competitors. Through the use of rules governing the availability of specific resources a company could allow a contractor to use its simulator while denying that resource to competitors. Grids may also allow for new types of businesses to start based around data stores, where interested businesses may request use of the data to perform data-mining operations. With several data warehouses it will be possible that correlations between events could be discovered quicker than is now possible.

The ability of the Grid to unite various global institutions, like for example weather networks, will create higher quality information. Sensors from around the world could be compared to gain a global perspective of meteorological events. That in turn could allow for more accurate predictions.

Finally, consumers will benefit from a grid architecture because of the transparency it provides. The added services provided by a grid middleware will allow innovative and revolutionary new applications to be designed that simplify data sharing and communication.



Benefits of the Grid

A grid will provide a more efficient and more powerful way for people to deal with computers. How? By providing on demand access to resources a company will no longer have to purchase all the hardware it needs, but will be able to outsource it. By easily allowing two companies to work together and share their resources and services. By providing a virtualization of computers and their resources and services it becomes very easy to create new application level services which are platform independent, Operating System independent, and location independent. Virtualization has in the past proved to be very beneficial (eg, Virtual memory, virtual machines, and virtual file systems.)

Part Two: The Grid Middleware

Globus Toolkit

The open source Globus Toolkit version 2 became the de facto standard for implementing Grid services after around 1997 [3]. With a focus towards interoperability and usability, the GT2 allowed for the deployment of thousands of Grids worldwide. It provides solutions to common problems such as authentication, resource discovery, and resource access. These in place GT2 allowed Grid applications to be developed very rapidly. While there were some elements in GT2 that were reviewed by standards bodies such as the GridFTP protocol [4] most of the GT2 standards were not subject to review.

Development of the Globus Toolkit is done by the Globus Alliance (www.globus.org) a collection of corporations and learning institutions that work on the research and development of Grids. The alliance produces open-source software that is central to science and engineering activities totalling nearly a half-billion dollars internationally and is the substrate for significant Grid products offered by leading IT companies.[11] The toolkit is currently in version 3 with version 4 slated to be released in January 2005.[10]

Open Grid Services Architecture

In 2002 The Open Grid Services Architecture emerged. [5] This was a community developed standard that allowed for multiple implementations, including GT 3.0, which was released in 2003. The Open Grid Services Architecture developed and extended the work done on GT2. The OGSA provides a framework within which one can develop a range of interoperable, and portable services. The open nature of the standards allows competitors into the market while still allowing for interoperability.

General Design Standards

The basic stack of services that the Grid provides can be thought of as being

      1. User Application and tools

      2. Collective Services

      3. Resources

      4. Connectivity

      5. Fabric

This Grid architecture is based on the “hourglass model,” [6] where the narrow neck defines a small set of core protocols (in this case that would include TCP, and HTTP ) which works with a large number of applications at the top, and a wide variety of hardware at the bottom.

User Applications (Distributed clients):

Applications are built on a Grid system by calling and using the services implemented by the other parts of the Grid infrastructure. This includes name services, resource discovery, data access, among others. Of course these applications may also use various other libraries not exclusive to what the Grid provides such as CORBA and Web Services.

Collective Services (Look-up servers):

Are services that are not directly associated with any one resource, but instead with many. Some examples include:

    • Naming – performs a name translation so that resources can be given abstract names.

    • Directory services – provides the ability to find resources that are available.

    • Scheduling – takes care of the allocation of resources to users

    • Monitoring and diagnostics – hacking detection, and resource failure

    • Performance tweaking – potentially determine if replication, caching could be used to improve performance.

Resources (Back-end servers):

Provides a way for a user to use remote services and resources. It is the responsibility of this layer to define protocols for monitoring, control, accounting, and billing, and initiation. This layer talks directly with the Fabric layer to access the resources. Two basic jobs of a resource layer protocol is to provide information about a resource, and to manage the resource:

  • CPU cycles

  • Storage space

  • Network management (e.g. reserve bandwidth and enquiry)

  • Code indexing and history management (check-in, check-out)

  • Relational Databases

Connectivity (Transport/Network layer):

The connectivity layer defines a set of core protocols used for authentication and communication. Typically communication is going to be provided by the standard TCP/IP stack and applications like DNS. Because of the difficulty in developing new forms of secure communication existing standards will probably be used as much as possible. The typical features sought after in a Grid with regards to security are:

  • Single sign-on. For ease of use, this allows a user to use resources from many different domains without having to go through multiple login procedures.

  • Delegation. The ability for a user to allow an executing program to assume the access rights of the user that runs it.

  • Integration with local security solutions. Provides a seamless translation between the Grid and local environments.

  • User based trust relationships. For using multiple resources at once. Must not require the administrative domains to interact.

Fabric (Data Link/Physical layer):

The resources that are provided over the Grid need an interface and abstraction so that they can be used. Of course directly implementing the ability for the grid to know how to interact with every piece of hardware would be overkill. The Fabric layer will interact with the local operating system to request use of the services it controls. This allows the Grid to take advantage of the virtualization taking place at the local level, so logical devices can be used such as Clusters which will in turn use their own internal protocols like NFS transparently to the Grid.

Part Three: Building a Grid

Resources and Services

The services and resources of the grid are accumulated as the network of available server computers grows. Though grids can be developed as sets of largely homogeneous intranets there is a large supply of independent heterogeneous servers worldwide. These resource servers must co-operate to provide transparency to the service layer through the use of the grid protocols. Thus, as new resource servers register onto collective servers, they must do so in a transparently homogeneous manner and provide meta-data to the repositories so as to distinguish between the abilities and fabric of the different resource servers. The degree to which the repository distinguishes between these servers is essential to the sustainability of the grid network. The issue is to increase flexibility by allowing similar computers to communicate without using a broker, or any sort of redundant converting. On the other hand, there is a limit to the degree of freedom for representing data since a degree of conformity is necessary for any future consistency, scalability, and reasonable development of the network.

Protocols need to be established for each of the diagnostic and resource retrieval processes. The specs for A2 [8] list the role of the service registry in identifying which services are available. There are, of course, other servers which identify resource servers by object naming, by CPU schedules, and by space and bandwidth availability. Each of these resources is listed in the corresponding collective server, along with any other types of resources available. The service registry is itself a sort of service (loosely speaking) in that it manages other servers. The grid defines a common protocol for each of these resource types as well as any diagnostic or meta-data for the resource. For instance, a space server could be cross-listed with how much space is available, the availability of the server, and the type of files which are stored on the server. Furthermore, as outlined earlier, fabric inconsistencies need to be sorted out at this level so as to be passed up to the grid interface level. Even though a space server may have little in common with a CPU resource, a consistent hierarchy for diagnostic specifics needs to be established within the protocols so to avoid being too complicated or large.

Issues can appear relating to inter-resource management. Resources may need to be replicated between servers. Specific resource protocols must therefore be developed for each type of resource. However, this is not usually a big issue since most resources already have these sorts of management standards. For instance, SQL servers can cooperate according to an established database replication methodology. There are also heterogeneous resources which may need to cooperate their efforts. This could theoretically be done via the grid interface and service registry, which is often the case, but certain resources offer complementary usage and so need to have their diagnostic abilities tightly integrated. Specifically, a bandwidth server may indispensably be linked to storage server(s) so that space and bandwidth considerations are indistinguishable (i.e. whichever network/storage server pair has both resources freely available should be chosen.)

Finally, a common connectivity language needs to be developed for all kinds of servers. As described in the connectivity chapter, remote procedure calls need to follow an established format and these calls need to be performed with the correct authority, independent of administrative layers of the grid or network domain.

The Client-to-Service Grid Interface

The client/service interface depends on a centralized service registry. Once the client has identified the type of service that it requires, it queries the registry via the grid interface protocol and then is free to connect to the collective service. This preliminary level is fairly straightforward, except for the issues surrounding a centralized registry. Much of the scalability issues disappear with replicated registries and inconsistencies between them are rare and largely inconsequential. Therefore, the registry service is fairly straightforward to develop and scale.

Each collective service registers with the service registry according to a transparent standard. Similarly to the resource listing in each service, the registry lists services and encounters the same constraints. Without unnecessarily repeating these details, the constraints relate to fabric, connectivity, and other server-specific meta-data being recorded.

The primary difference between listing services or resources is that comparatively little diagnostic algorithms need to be implemented, other than the evident load-balancing. One other simplicity is that services don’t usually offer tightly related abilities. Once two complementary resources have been identified, the collective services do not need to worry about being involved, since the details are worked out at a lower level. The collective service only reenters should a new resource server need to be chosen. It may however be necessary to replicate services as the network scales, and to have them cooperate their findings.

Once a server has been assigned to the client, it is ready to perform resource queries. The server is summoned according to the protocol indicated by the service registry, and any fabric/connectivity implementation specifics are converted as necessary. The server responds according to the diagnostic requirements of the query and returns whichever resource server best fits the criteria, or a null response if no server is available. Though diagnostic knowledge is not typically returned (unless it is explicitly sought), the fabric and connectivity specifics of the server are passed along with the server’s identity. For grids which are especially consistent, less is specified about the resource’s abilities, the knowledge being implicit.

Once a resource has been identified, the client is ready to interact with the end server.

Monitoring

Queries should be regularly performed by grid administrators to ensure things are working correctly. Inadequacies of the network can then be repaired by changing existing protocols or by introducing new service or resource servers to the network. Different aspects which may be monitored for different resources are average usage, peak usage, availability, and failures. Services, resources, or sub-networks can be repaired or upgraded accordingly.

On the other hand, the transparency of the system and the sheer enormity make it very difficult to monitor everything. The large number of similar or identical services complicates any attempt at diagnosis, and the services must be able to cooperate to do so. Other aspects increase the difficulty such as geographic disparity, fabric and connectivity constraints, and independence of similar services.

Reliability

To increase the reliability of the grid it is necessary to replicate all the layers of the network. Replicating the service registry is the most important, followed by individual collective services, and finally the resource hosts. Of course, the more these layers are replicated, the more difficult it is to maintain network consistency, and the more busy the network becomes (this happens anyway as the network scales). As pointed out earlier, the registry and collective services are easier to replicate since inconsistencies are not a big deal. Replicating resources is far more challenging and requires tighter protocols.

Security

The primary security feature of the grid is the authentication of users and clients on the system. Once users have been authenticated, they are free to use different services which obtain identity data from the registry and pass it down to the resource hosts. It is necessary for servers to ensure users have the correct certificates in order to avoid misuse of the system.

The more challenging and rarer security necessity is allowing hosts to interact between each other. It is difficult for resources to access each other’s domain, specifically across heterogeneous networks, and so these factors must be implemented for each network type. It is also possible to allow for transparency at the service/resource layer by standardizing security details.

Data Management.

The implementation of how knowledge is stored depends largely on the type of resource being stored. Relational databases, file systems, naming hierarchies, or even distributed object models may be used. The details of how these are implemented should be transparent to the client machines to increase consistency. Specifically, access transparency for accessing resources on different resource servers, location transparency for locating resources on distant servers, failure and mobility transparency for failing resource servers (resources need to be replicated for this), and finally replication, performance, and concurrency transparency.

In order to maintain consistency of data, appropriate semantics should be established (maybe / at-least-once / at-most-once) for corrupted resource accesses. These semantics ensure the data itself does not become corrupted.

In case data does become corrupted, it is a good idea to have back-ups and replicas available to restore from.

Programming the Grid.

RPC/RMI

Most of the features of the grid can be implemented via the use of RPCs and RMIs. Stubs and skeletons can be specified and later implemented for all of the major layers of the network. The language used is not very important but maintaining data implementation consistency is essential.

Task parallelism

Concurrency can greatly improve the performance of many applications built on the grid. Horizontal layering of tasks can take advantage of all the usual distributed benefits. The vertical layering of the network enables a different kind of parallelism in that each layer only needs to worry about very specific abilities. Finally, horizontal replicas of services and resources can increase the performance of grid management tasks.

Message Passing

Though different resources may be implemented differently, either persistent or transient messaging is required to communicate between client and server. Transient messaging would be used for the mostpart due to expected availability/reliability of servers. Persistent messaging could be used for servers to communicate between each other as they go online and off. Which may be very useful to small footprint devices like PDAs and cellphones.

Of course synchronous/asynchronous and connectionless/connection-full messages depend almost entirely on the type of message. Synchronous messaging would only be used in cases where the client needs to block and has nothing better to do, while connections are mostly useful for extended data transfers such as for storage or DB servers.

Java

Languages which already implement cross-platform consistency are very useful for a heterogeneous network. Much of the consistency issues discussed earlier can be circumvented by developing overtop the JVM for a relatively small decrease in performance.

[1] Foster, I. and Kesselman, C. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1999

[2] http://www-1.ibm.com/grid/pr_428.shtml

[3] Foster, I. and Kesselman, C. Globus: A metacomputing infrastructure toolkit, International Journal of Supercomputer Application 11(2), 115-129, 1998

[4] Allcock, W., Bester, J., Gresnahan, J., Chervenak, A., Liming, L., and Tuecke, S., GridFTP: Protocal Extension to FTP for the Grid. Global Grid Forum, 2001

[5] Foster, I., Kesselman, C., Nick, J.M., and Tuecke, S. Grid services for distributed systems integration. IEEE Computer 35(6), 37-46, 2002

[6] Realizing the Information Future: The Internet and Beyond. Nation Academy Press, Washington, DC, 1994

[7] Foster, I., Kesselman, C., The Grid: Blueprint for a New Computing Infrastructure: Second Edition, Morgan Kaufmann Publishers, San Francisco, CA

[8]http://db.uwaterloo.ca/~tozsu/courses/cs454/f04/Assignments/Assignment2-F04.pdf

[9] Foster, C. Kesselman, S. Tuecke. http://www.globus.org/research/papers/anatomy.pdf International J. Supercomputer Applications, 15(3), 2001.

[10]http://www-unix.globus.org/toolkit/docs/development/4.0-drafts/GT4Facts/index.html

[11]http://www.globus.org/about/default.asp