Managing Large WWW Sites

Vassilis Prevelakis
vp@unipi.gr
Network Management Centre
University of Piraeus, Greece

Abstract

The trend towards ever larger WWW sites with hundreds of thousand of pages maintained by teams of developers has made apparent the need for tools to manage such large-scale efforts. In this paper we present a mechanism for organising and manipulating groups of nodes and links in WWW sites. These constructs are used to create dynamic views of the data in a given site according to various organisational and presentational criteria. Moreover, proven techniques for the management of hypertext networks (eg versioning, variants, etc) can be readily implemented through the use of this model. Additionally, the model allows the creation of user views through the combination of these constructs via a number of operations. After the description of our model we present a prototype system developed at the University of Geneva that demonstrates how the mechanisms included in our model can be integrated into an existing http server. We then examine how the same mechanisms can be employed in the management of large WWW sites.

Keywords: WWW, versioning, configuration management, parallel activities, perspectives.

1. Introduction

Unlike other hypertext systems where there exists a global scheme that guides the organisation of the material, the WWW is characterised by neighbourhoods: groups of pages that are authored and maintained by different people. This is the reason behind the striking diversity of the current Web.

Individuals, companies and organisations contribute their collections of pages to the global network thus forming the entity that we recognise as the WWW. These collections of pages, however, are not always small. With the entry of large companies and organisations into the WWW, the small neighbourhoods were joined by mega-sites containing thousands of pages. For example, the maintainers of the Microsoft site (www.microsoft.com) estimated that by the end of 1996 their site would host more than one million pages (Microsoft, 1996).

The shift from small to large collections of pages has predictably created the same problems that characterise the co-ordination of the work of large numbers of people and the management of the produced material. Large documentation projects and software development environments have already demonstrated these problems (Prevelakis, 1993) and provided a testing ground for the evaluation of various remedies (Berstein, 1990).

Additionally, as a hypertext system, the WWW shares the information retrieval problems that are so familiar to hypertext users, namely disorientation, serialisation of the information and searching (Halasz, 1988). These problems are exacerbated by the rigidity imposed on the Web by the hypertext links being embedded into the nodes. This makes the entire system inflexible and difficult to maintain (Car, 1994 and Fielding, 1994).

Another problem with the fixed structure is that it is quite difficult to produce alternative organisations of the data in various parts of the Web. In other words, if someone follows a link to a page stored at a remote site, then the links emanating from that page are those chosen by the person maintaining the remote page. It is not possible to add extra links (e.g., in order to provide a uniform interface) or to remove or redefine existing links (e.g., to create a guided tour) without copying the file locally and editing it.

Let us now examine the issues behind the creation and maintenance of the pages located in a large Web site.

1.1 Requirements for the management of large Web sites

Since the primary activity of a Web authoring environment is editing and refining of ideas, the system must provide adequate support for versioning. Versions can be used to track change in the system; thus, in cases where modifications introduce inconsistencies it is important to be able to know exactly what has changed and at what stage in the authoring process. Users should be able to use the versioning facilities to reverse the effects of actions such as deleting or modifying a page.

Systems such as the Workshop System (Clemm, 1988), and the SEPIA authoring system (Streitz, 1992), use this concept of activity oriented versioning. In most cases the user initiates a long transaction (Bernstein, 1987) with the intention of carrying out a specific task. During the execution of this transaction a number of files or objects may be changed. Since these changes have all been performed as part of a specific task (also referred to as a logical change) they are automatically grouped together. If the user backs out that logical change, then the system will undo all the individual physical changes without further user intervention.

Another requirement is configuration management, that is the ability to work on variants of a basic layout. For example, we may wish to alter the displayed pages depending on user supplied information, where we have a warm climate Web edition with advertisement for swim trunks and a cold climate edition with ads for fur coats (Nicol, 1995). Moreover, we may wish to consider the geographical location of the user that is sending a request for a page in our site. In this way we can provide information about the local distributor of our product or perhaps country specific information such as availability or localisations.

We, therefore, require that given a specific configuration, the system should be able to extract the required information (both common and variant) thus constructing one new view of the information contained in the site.

The size of the large Web sites implies that authoring activities are based more often than not on teams of people rather than a single individual. Therefore, the authoring environment must provide the facilities that enable all members of the team to work without interfering with each other (non-interference). This implies the ability to create private workspaces and work in isolation while retaining the option of merging this work with that of the other members of the team. Individuals should be able to create their own customised views of the system and thus experiment with alternative collections of pages.

Annotations are also a valuable feature of an authoring environment both in a personal context when used only by the person who created them in the first place, and in a wider context when distributed to other members of the team. In fact, annotations may be even considered value-added information that enhance the rating of the original document (Roscheisen, 1995).

For annotations to be useful, two conditions must be satisfied. Firstly, users should have the ability to choose which annotations they wish to consult. Secondly, they should have control over whether the contents referred to by the annotations are allowed to change as this may invalidate the annotations. It is as if we open a book and find notes that we had scribbled in the margins but the text is no longer the same because the book has been edited a couple of times since the original annotations were added. Since most people have visual memories, even changes in the appearance of a page can be distracting. Imagine a reference book or dictionary that looked different each time the reader opened it.

An important factor in maintaining quality standards in any environment is the review process. When a new document is authored or an existing one modified, the changes must be reviewed by someone outside the development or authoring team (Brooks, 1975).

For a review process to be carried out effectively and efficiently, the reviewers must be able to see the material as it would appear in the production environment (view sharing). Further, they must be able to trace the person responsible for the material (attribution of changes) and finally, they must be able to submit comments or suggestions to the authors in a way that can be easily correlated with the reviewed material. In shared workspaces (Bentley, 1996) this kind of transfer of data from one member to the next can be effected without interfering with the work of other team members.

In this article we present a model for organising Web subnetworks that attempts to comply with the above mentioned requirements. The model is based on perspectives - graph structures which can be combined and operated on in various ways. In the next section we provide definitions for perspectives and perspective operations such as addition, masking, and projection. Section 3 outlines the architecture and implementation of a prototype system based on perspectives. Finally, section 4 discusses the role of perspectives in a Web authoring environment.

2. The perspective model and framework

Most hypertext systems assume that the quantum of information is a node with links expressing relationships between nodes. By contrast, in our model this quantum is a set of nodes and links called a perspective. Perspectives are used to group nodes containing related information. For example:

Viewpoints, arguments, ideas etc. that may stand on their own or be combined with other supporting material, e.g. examples, footnotes, or background information.
Combining the general with the specific. In producing a product brochure, we need to merge information on the product family with information unique to the particular product. By keeping the general information on one perspective and then using separate perspectives for each different product, we can obtain the brochure by combining the general perspective with the perspective specific to the product.

Nodes comprise of attributes such as data and links, thus, the links inside a perspective are stored as part of the nodes of the perspective. Perspectives can be combined in various ways to form new perspectives via a number of operations that manipulate the nodes.

When two perspectives are combined, nodes from the first perspective overlay the nodes from the second. In this way an operation on two perspectives can be defined as a series of operations between pairs of nodes (one from each perspective). Note that perspective operations cannot involve pairs of nodes from the same perspective. In figure 1, the pairs are chosen by the topological placement of the nodes in the perspective plane, but in most other cases the matching is done on the basis of some unique node identification (e.g. node name, number etc.).

Figure 1: An overlay of perspectives

By changing the perspectives comprising a given overlay, we can create different views of the underlining hypertext (see figure 2). In the product brochure example mentioned earlier, the family information would be placed in the Base Perspective, while each of the Overlay Perspectives would contain information specific to a different product within the family.

The layering concept was introduced in a system called PIE (Goldstein, 1980). Although PIE was a software development environment for Smalltalk, it encompassed many features and characteristics that would latter be associated with repositories (Bernstein, 1994).

Figure 2: Different overlays give different views

In this paper we will give brief descriptions of the operations that are relevant to our discussion, while formal definitions of perspectives and their operations may be found elsewhere (Prevelakis, 1996).

Addition operation: creates the union of the sets of attributes of each pair of nodes.

Figure 3: Addition Operation.

In figure 3 we can see that all the nodes and their attributes from the P₁ and P₂ perspectives were transferred to P₃. Since the links are maintained as attributes, they are carried over in the same way as the other node contents.

Masking operation: nodes in the second (top) perspective replace nodes in the first (lower) perspective.

Unlike the addition operation, the masking operation allows the replacement of node contents. Removal of nodes and their attributes is achieved through the use of "empty" (or placeholder) nodes that have no attributes. Usually the user interface refrains from displaying these "empty" nodes so that their behaviour is similar to the "white-outs" used in the Sun TFS (Hendricks, 1988).

Figure 4: Masking Operation.

Additional operations allow for the removal of nodes from the user view and for the selective manipulation of node contents on the basis of a user specified criterion. The latter operation is used for the processing of queries. There is also an operation similar to the masking operation that instead of replacing node contents, applies differential changes. In this way nodes can be changed rather than replaced.

An important characteristic of the definition of the perspective operations is that they apply to each pair of nodes independently. Thus, if we are only interested in one node we can evaluate the operation only for that node ignoring neighbouring nodes in the perspectives participating in the operation.

This property makes the processing of perspective operations highly parallelisable. Thus perspectives containing thousands of nodes can be stored across different servers and perspective expressions with hundreds of operations can be evaluated with negligible communications overheads between the servers (Prevelakis, 1996).

3. Prototype System

In order to experiment with the practical side of the perspective model and also to demonstrate the effectiveness of the model and framework in the context of the WWW, we implemented a prototype Web server. To speed up the development and simplify the design, the following decisions were made:

Use off-the-shelf browsers such as the Netscape Navigator.
Implement a stateless server which implies that information on all nodes and perspectives that are known locally is maintained by the server. There is no information about perspectives held on other servers. Additionally, user requests carry with them the required state in the form of a perspective expression.
Employ standard protocols were possible. Thus, http is used for the communications between the browser and the perspective server. However, a non-standard form of a URL has been introduced to submit perspective-related requests to the server. Details on this and the format of the submitted expressions are described in detail elsewhere (Prevelakis, 1997).

To use the enhanced features of the system, the reader must first select a view that will be used throughout the session. The desired view is defined by submitting an expression that describes how various perspectives that exist on the server will be combined to create the view. This expression is not visible to the user, but is usually hidden within a special URL. Once the expression is submitted to the server, it is used for all subsequent requests, thus creating the illusion that the view is the entire document.

The user can designate the expression to be used for all the subsequent requests by selecting an existing expression, creating a new expression from scratch, or by combining existing expressions to create a new one. In most cases the first approach is used, since it is quite straightforward to create a page with a number of expressions as anchors that can generate the non-standard URLs when selected by the user (see figure 5).

Figure 5: Special URLs allow the selection of views.

4. The role of perspectives in a Web site management system

Given the definition of perspectives we can see that they can be used as a means of organising the data in an authoring system. Users can have their own, private, views of the system, while being able to request different organisations by selecting appropriate perspectives. Users can remove perspectives that contain information that is of no interest to them and concentrate on the perspectives that they need.

Similarly, we can restrict access to certain parts of the system by either refusing access to the perspective containing the information or by forcing the user to use a masking perspective that effectively removes the privileged information from the user view.

We can have a system where all the changes made during a long transaction are stored in one perspective. In this way users can 'back out' changes by simply removing the corresponding perspectives from their overlays.

Another advantage of a perspective-based organisation is that there is no need for locking of nodes or any kind of write protection. Read protection can be implemented via special masking overlays that the users must use to access the system.

In the rest of this section we will present three examples demonstrating how perspectives can be used to satisfy the key requirements we identified in the introduction: versioning, configuration management and parallel activities.

4.1 Versioning

Perspectives are rarely completely independent. We expect that users start with a base perspective P₁ and then do a series of changes and enhancements all of which can be expressed by another perspective P₁'. They can then view only the changes they have made (P₁'), the result of the changes on P₁(P₁ ^o P₁'), or both P₁ and the changes they have made to it (P₁ + P₁ ^o P₁'). For example, if P₁ is a software version, P₁' represents the changes, P₁ ^o P₁' is the new version and P₁ + P₁ ^o P₁' is the base version plus all the changes to get to the new version (see figure 6).

Figure 6: Versioning using differential perspectives

Note that each time we only need to keep track of the changes P₁', P₂', ... The rest can be computed, including arbitrary expressions on the perspectives.

4.2 Configuration Management

In a similar manner, configuration management operations can be performed with the help of perspectives. We can envisage a situation where there is a base perspective containing the country-independent pages of a product description, while the country-dependent parts are placed in individual perspectives. So starting from page version 3.0 we can construct the version 3 release 2 system for Canada using the expression P_v3.0 ^o P_fix1 ^o P_fix2 ^o P_Canada.

We may also use this technique to construct guided tours where, depending on the requirements of the user, different paths through the Web site may be constructed. In figure 7 we see that pages 4, 5 and 9 in the original tour were replaced. Such replacements may be achieved through the use of the masking operation, while the addition operation may be used to add material to the user view.

Figure 7: Create different configurations by mixing and matching pages

4.3 Parallel Activities

Let us take a small workgroup as an example. This team consists of authors, layout artists, reviewers, etc. All these people have to work on the same pages and make alterations as they go about their work. The objective of the system is to create a stable environment so that work done by one person is not immediately visible to the others. In this way, changes can be tested before they are released to the rest of the group. When the various members of the team are ready to share their pages or other work, the process of integrating the new elements with the existing data should be as painless as possible.

By combining long transactions with perspective construction, we have a new perspective only when the author is satisfied with the changes. Even then, the other members of the team may not chose to include this new perspective in their view of the system, thus ignoring the new changes. When everybody is happy with the modification they can all include it in their views. If later-on, there are second thoughts about the change, the perspective can be removed and the system tested without the offending changes.

Annotations may also be placed on separate perspectives so that users can chose whether to view the base text, the comments or both.

4.4 Scalability, distribution

Any mechanism that claims to be able to be used in large sites must be scalable. In other words it must be efficient when dealing with sites containing a few hundred nodes to sites that contain hundreds of thousands. Scalability is inherent in the way the perspective operations were defined. If, for example, we specify a view that involves merging five perspectives with 10000 nodes each, then in order to retrieve a single node we would only need to perform at most four operations (one for each pair of perspectives). If, however, we wanted to generate every single node in the new view, we would, of course, have to perform thousands of operations. Since the way the Web is used is by following links to individual nodes the "lazy evaluation" of the nodes in the view is highly appropriate. At the end of the session when the view is destroyed, the system would have created only the few nodes requested by the user.

Figure 8: Each WWW server is responsible for the nodes contained in the perspectives stored in that server

Another benefit of the way perspective expressions are evaluated, is that we can have perspectives that span physical sites (see figure 8). In this case a user retrieves three nodes from three different servers using a single perspective expression. The servers themselves will not need to communicate between them to satisfy the requests since each one requires information that is always available locally. Thus if the number of nodes in a given server grows beyond the available capacity, the web site may be partitioned between two servers without affecting the way the users view the site or creating communication overheads between the old and new servers.

5. Conclusion, future plans

In this paper we have presented perspectives as a mechanism for organising and manipulating groups of nodes and links in a WWW network. The description was in three stages: presentation of the model and the prototype system that is used as a testbed for the evaluation of perspectives, followed by a discussion on how the model copes with the authoring requirements.

We have outlined the use of perspectives: to provide alternative configurations for a WWW subnetwork, to serve as a container of changes in a versioning system, and to support the execution of parallel activities within the workgroup without any interference between members of the team. Our experience, so far, has indicated that the overlay mechanism is intuitive, yet powerful enough to satisfy all the requirements that we set out in the beginning. We, thus, believe that the perspective framework can provide an effective core for the management of large Web sites.

An important area where further work is needed is the evaluation of the perspective expressions. In particular we will be investigating the effects of caching on the performance of a perspective-based web server. We are also interested in transfering some of the work involving the handling of the perspective expressions to the clients through the use of Java based applets.

Although allowing users to reconfigure their network is quite useful the trend is to allow the system to do the reconfiguration on its own. On some systems the reconfiguration is carried out as a response to a query by the user (Watters, 1990) while on others the user supplies (or selects) the criterion and the system constructs a new perspective that arranges the nodes accordingly (Pintado, 1990).

Finally, in co-operation with the University of Piraeus, Greece, we hope to have a production system running by the middle of next year.

6. References

Bentley, Richard, Uwe Busbach, and Klaas Sikkek, "The Architecture of the BSCW Shared Workspace System," Proceedings of the ERCIM Conference on CSCW and the Web, Sankt Augustin, Germany (Feb 1996).
Bernstein, Philip A. and Umeshwar Dayal, "An Overview of Repository Technology," Proceedings of the 20th International Conference on Very Large Databases, Santiago, Chile, Sep 1994, p. 705-713.
Bernstein, Philip A., "Database System Support for Software Engineering," 9th International Conference on Software Engineering, pp. 166-178, Monterey, California (March 1987).
Brooks, Frederick P., Jr., "The Mythical Man-Month: Essays on Software Engineering," Addison Wesley (1975).
Car, Les, Wendy Hall, Hugh Davis, and Rupert Hollom, "The Microcosm Link Service and its Applications to the World Wide Web," Proceedings of the First International World-Wide Web Conference, pp. 25-34, Geneva, Switzerland (May 1994).
Clemm, Geoffrey, "The Workshop System - A Practical Knowledge-Based Software Environment," Proceedings of the ACM SIGSOFT/SIGPLAN Software Engineering Symposium on Practical Software Development Environments, ACM SIGPLAN Notices, 13, 5, pp. 55-64 (Nov 1988).
Fielding, Roy T., "Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web," Proceedings of the First International World-Wide Web Conference, pp. 147-156, Geneva, Switzerland (May 1994).
Goldstein, Ira P. and Daniel B. Bobrow, Descriptions for a Programming Environment, "Proceedings of the 1st Annual Conference of the National Association on Artificial Intelligence," Stanford, CA, Aug 1980, p. 187-194.
Halasz, Frank, "Reflections on NoteCards: Seven issues for the next generation of hypermedia systems," Communications of the ACM, 31, 7 perspective. 836-852 (July 1988).
Hendricks, David, "The Translucent File Service," Proceedings EUUG Autumn 1988, Cascais, Portugal, October 1988, pp. 87-93.
Microsoft Corporation, "Managing Web Content Using Microsoft Visual SourceSafe," (March 1996).
Nicol, Gavin and Kent Summers, "Internet Publishing: Past, Present and Future," , Electronic Book Technologies (April 1995).
Pintado, Xavier and Dennis Tsichritzis, "Satellite: Hypermedia Navigation by Affinity," Proceedings 1st European Conference on Hypertext (ECHT'90), pp. 274-287, Paris, France (Nov 1990).
Prevelakis Vassilis and Dennis Tsichritzis, "Dynamic Version and Configuration Management Using Perspectives," Proceedings of the 4th Panhellenic Conference on Information Technology, perspective. 369-387, Patras, Greece (Dec 1993).
Prevelakis, Vassilis, "A Model for the Organisation and Dynamic Reconfiguration of Information Networks," PhD Dissertation, University of Geneva, Geneva, Switzerland, 1996.
Prevelakis, Vassilis, "Views, Perspectives and the WWW," 6th Hellenic Conference on Informatics, Athens, Greece 1997.
Roscheisen, Martin and Terry Winograd, "Generalized Annotations for Shared Commenting, Content Rating and Other Collaborative Usages," Proceedings of the Workshop on World Wide Web and Collaboration, Cambridge, Mass. (Sep 95).
Streitz, Norbert, Jorg Haake, Jorg Hannemann, Andreas Lemke, Wolfgang Schuler, Helge Schutt, and Manfred Thuring, "SEPIA: A Co-operative Hypermedia Authoring Environment," Proceedings of the 4th ACM Conference on Hypertext, pp. 11-22, Milano, Italy (Nov 1992).
Watters, Carolyn and Michael A. Shepherd, "A Transient Hypergraph-Based Model for Data Access," ACM Transactions on Information Systems, 8, 2, pp. 77-102 (April 1990).