Evaluating Automatically Generated Hypertext Versions of Scholarly Articles

James Blustein
Department of Computer Science
University of Western Ontario
London, Ontario, N6A 5B7
E-mail: jamie@csd.uwo.ca or jamie@acm.org
Created: 15 February 1998
Revised: 20 March 1998


In this paper I present an experimental approach to the evaluation of a type of hypermedia application. My overall objective is to develop and evaluate ways of automatically incorporating hypermedia links into pre-existing scholarly journal articles. The focus of this paper is the evaluation method. My method allows the results to be applied to other documents than just those tested.

To properly convert ordinary documents into useful hypermedia two constraints must be satisfied: the links must be useful to the readers and the risk of disorientation introduced by the new structure imposed by the links must be minimized. I describe a rule-based approach for making links. In my experiment I use two methods to detect when the rules should be applied. The effectiveness of the links is tested by people performing realistic tasks. Readers judge the quality of links (and thereby the quality of the rules used to forge them) and the overall effectiveness of the hypermedia.

Keywords Hypertext, Task-based Evaluation, Automatic Link Generation, Electronic Publishing

Document Structure This document is composed of a main text and two subsidiary texts (about the information retrieval methods and the experimental process used in the experiment). Links to those texts appear in the main text below.


  1. Abstract
  2. Introduction
    1. Motivation
    2. Hypermedia Evaluation
    3. What is Needed
    4. Some Potential Difficulties
  3. Details of Method
  4. Link Types
    1. Structural Links
    2. Definition Links
    3. Semantic Links
  5. How The Links Are Made
  6. Linking With Information Retrieval
  7. Evaluation Experiment
    1. Real-World Task
    2. Questions To Be Answered
    3. Generalizability of Results
    4. General Method
    5. Reader Training
    6. The Reader's Task
      1. After the Reading Task
      2. Statistical Analysis
  8. Summary
  9. References
Also: information retrieval methods and experimental process.


Recently there have been calls for hypertext evaluation methods based on tasks [Dil96,CR96]. Here I describe the method I developed to evaluate ways of automatically incorporating links into scholarly journal articles. My evaluation focuses on people using the hypertext to perform real-world tasks. My method allows the results to be applied to other documents than just those tested.


Many electronic journals and electronic versions of paper journals already exist. It is only a matter of time before there is widespread use of hypermedia links (as seen in the World Wide Web). Therefore we should try to determine what types of links should be made and how to make them automatically.

I believe that articles with hypermedia links can be more useful than versions without links. Hypermedia is not always a suitable application however: People sometimes prefer hypermedia even when they do more poorly with it [NL94]. They sometimes seem to do better with it even when they do not [Cha94]. We must therefore evaluate hypermedia with people performing real-world tasks in the way that we expect the systems will be used. We must measure users performance not merely their satisfaction. In any hypermedia evaluation we need to consider the people who will use the hypermedia and the tasks they will perform with it. Obviously the statistical model used in the evaluation should reflect the underlying theory.

Hypermedia Evaluation

Task-based evaluation of hypermeida systems is somewhat unusual. It is easier to work with abstract models of user needs then users [BWT+96]. There is a growing literature about the use of objective measures of the properties of networks of hypermedia documents, e.g. Botafogo, et al. [BRS92]. Too often hypermedia systems are tested only for acceptability to users rather than adequacy for tasks [Wri91]. Many systems are tested only for information retrieval characteristics. When systems are tested with people it is often as part of an interative design process. The SuperBook experiments [ERG89] are probably the best known of this class. The SuperBook team pre-determined both their users goals and methods. The result is a browsing system that is better suited to searching for certain types of information in electronic documents than in similar paper ones.

What is Needed

Although people read texts for many reasons, I am only concerned with the strategies they pursue when they read for information. These activities stretch from the directed, such as processing queries to find specific information, e.g. searching to find keywords, to the undirected, e.g. some types of browsing.

We know that readers of scholarly papers typically do not read those papers in a linear fashion [Wri93,DRM89,Ol94]. I want to make it easier to read such papers in a nonlinear fashion by making hypermedia versions that support the ways we believe people use paper versions while providing additional useful features.

The point of making hypermedia versions of scholarly articles is to help readers get the information encoded in the articles. What are a reader's goals in reading a scholarly article? Readers scan articles to quickly determine if an article's content is of interest to them [DRM89]. They browse articles to find particular portions that interest them. When readers search for specific information in an article they are querying. Egan et al. [BAB+87,ERG89] and Instone et al. [ITL93] evaluated how particular hypermedia can aid querying. Egan et al. tested their SuperBook hypermedia system's ability to help readers extract facts from a computer manual converted to hypermedia form. Instone et al. employed user testing to guide their redesign of a hypermedia version of a paper encyclopedia.

Some Potential Difficulties

From studies in cognitive psychology it has been shown that text comprehension -- how people read and understand ordinary text -- is a complex process, and that changes in the way text is presented change the way people think [Car89a]. Different types of text (e.g. poetry, narrative and discourse), create different expectations in readers [Cha94a]. Readers, especially those unfamiliar with computers, become confused when a document's structure is inconsistent with their expectations of its contents. Because hypermedia allows for unfamiliar types of text structuring this problem can be considerably worse for its readers than readers of ordinary text. One common difficulty readers have with hypermedia documents is knowing when they have read enough. McDonald and Stevenson reported that even expert users became confused in reading online hypermedia when they could not determine the length of the document they were reading [MS96].

The effects of the user interface (UI) cannot easily be separated from the effects of the structure [Wri93a,ERG89]. A poor user interface can render an otherwise excellent program unusable but not even the best user interface imaginable can rescue a program that is not suitable for its users. In the case of a hypermedia version of a text document, the structure of links that make the hypermedia are more important than the UI that presents the links and text to the reader. My research concentrates on the process of link creation and the evaluation of hypermedia generated by the application of rules by rote. Recognizing the importance of user interfaces I test my automatically generated hypermedia using a standard user interface and control for the effects of the UI in my evaluation experiment.

Details of Method

Link Types

Article readers need to: determine if a scholarly article interests them, focus on particular parts of that article, and locate information in it. I am creating three types of links to support those activities: structural, definitional and semantic.

Although the method I present is designed to be fully automatic it might be best implemented as a tool that suggests links to a human operator. The person could then decide which links to include.

Structural Links

To reduce readers' confusion, the hypermedia I create mimics printed articles in some important ways. My documents are all in one chunk so readers can gauge their location. Structural links mimic structural connections in the original text, e.g., citations, cross-references, and tables of contents. The documents each begin with a table of contents generated from the section headings. Each section begins with a link to the part of the table of contents for that section. Each subsection heading contains links to the enclosing section, e.g. section 1.2.3 would have links to sections 1 and 1.2.

Definition Links

Definition links connect the use of special terms to their definition in the document. These links make it easier for readers to quickly determine the meaning of terms used in the article without having to read the entire document first.

Semantic Links

Semantic links connect related but separate passages. I make two main kinds of semantic links: links from summary sections to summarized sections, and links connected related but distant passages. For example, a sentence in the abstract may have a link leading to the beginning of a passage which occurs later in the document and is summarized in the abstract. If created properly, semantic links can provide a method for readers to use documents more effectively than without them.

How The Links Are Made

Sructural links

Structural links are made applying simple pattern-matching rules.

Definition links

Technical terms are often italicized at the point where they are defined in a document. For those documents definition terms are identified by their unusual presentation style and links are forged from the use of those terms to their definition.

For documents where that convention is not followed the definitions are identified manually.

Another approach is to identify terms that occur frequently in the document but are rare in common vocabulary. Links could then be made from the use of terms to their first appearance or to a passage that contains indicative phrases such as `by foo we mean ...'.

Semantic links

I make two types of semantic links: scattered discussion links connect passages that discuss the same topic but are not adjacent, and summary links connect sentences in the abstract and conclusion of articles to the sections they summarize. The scatterred discussion links are from either sentences or groups of words in sentences, to sentences, paragraphs, or sections. The source (also known as tail) and destination (also known as head) of the links is always indicated.

The following section provides more detail of the method I am using to make the semantic links. Please note that my discussion of that method is brief as I intend mostly to discuss the evaluation method.

Linking With Information Retrieval

I have written programs to detect the heads and tails of links automatically. Semantic links are based on the relatedness of the vocabulary used in the passages considered for linking. I am using the principles of information retrieval (IR) to detect the heads and tails of semantic links. In my prototype system I manually encode the links.

I will be comparing links made using two competing IR systems -- Bellcore's LSI [FDD+88,DDF+90] and Cornell's SMART [Buc92] -- against each other, and to a version -- without semantic links. Both systems have the same goals but different methods. Each method is used to create one hypermedia version of each document. Details of my evaluation method appear in a separate section below. By using two systems I hope to prevent developing of a system-specific evaluation method.

A brief description of IR and the methods I am using is available in a separate document.

Evaluation Experiment

Hypermedia must be evaluated carefully. Readers sometimes prefer hypermedia to traditional documents even when it is more difficult for them to use [NL94,Cha94]. I believe that hypermedia like mine should be evaluated not only for user satisfaction but also for how useful it is. When we ask how useful hypermedia is then we must take into account the tasks it is being used for, the people who are performing those tasks, and the context in which they perform them. My view is supported by recent calls for hypermedia evaluation methods based on tasks. Chen and Rada [CR96] analysed 23 experiments with various hypermedia systems to determine what factors were important in determining users success with hypermedia. They concluded with a call for a taxonomy of tasks to better compare different hypermedia systems and models. Dillon called for a `contextually determined view of usability' [Dil96a] as part of an approach to make better hypermedia.

Real-World Task

My evaluation focuses on people using the hypermedia to perform real-world tasks. Specifically I am concentrating on making links help researchers scan and browse scholarly articles. As they read each article subjects will write a brief summary of it; and the links they follow will be recorded automatically. After they have finished reading each article subjects will rate the quality of each link they followed. The summary is a real-world task that also serves as a comprehension test. Because readers of scholarly articles often write summaries of articles for future reference I do not expect the task to significantly change the way they use the article.

Questions To Be Answered

I expect my evaluation to provide answers to several interesting questions:


I have three major hypotheses:

My final hypothesis requires some explanation. The uses of semantic links are not as easy to grasp as the other types. Structural links and definition links are already present to some degree in printed papers, but semantic links are not. Some additional training may be required before semantic links can be best appreciated by readers.

Generalizability of Results

The combination of my link making and evaluation methods allows the results of the evaluation to be applied to other documents than just those tested. Because the links will be generated by rules the evaluation of a link will reflect on the rule used to generate the link. With this approach we can test rules for many documents. By determining which links readers found useful, and which links good readers liked we can determine which rules were helpful and which should be changed. Whether the rules produce useful links or not we can deduce their applicability to other documents.

Once I have made the links following rules, I will perform tests to evaluate how useful such links are to readers. If the links are useful then I'll be able to conclude that the rules used to make them could be successfully applied to other documents. If the links are not useful then I'll need to know why they were not. In either case I expect to be able to tell by debriefing the people who use the hypermedia documents.

General Method

I am evaluating two things: hypermedia with and without semantic links, and two methods of creating such links. I will test three types of text: fully-linked hypermedia with semantic links created using SMART (C); fully-linked hypermedia with semantic links created using LSI (L); or primitive hypermedia, i.e. hypermedia with only structural links (P). The text with only structural links will serve as a control for the other two.

Table 1: Reading orders for the experiment
Primitive HM first
S# Reading Order
1 P1 C2 L3
2 P1 C3 L2
3 P2 C1 L3
4 P2 C3 L1
5 P3 C1 L2
6 P3 C2 L1
7 P1 L2 C3
8 P1 L3 C2
9 P2 L1 C3
10 P2 L3 C1
11 P3 L1 C2
12 P3 L2 C1
LSI HM first
S# Reading Order
13 L1 C2 P3
14 L1 C3 P2
15 L2 C1 P3
16 L2 C3 P1
17 L3 C1 P2
18 L3 C2 P1
19 L1 P2 C3
20 L1 P3 C2
21 L2 P1 C3
22 L2 P3 C1
23 L3 P1 C2
24 L3 P2 C1
Cornell HM first
S# Reading Order
25 C1 L2 P3
26 C1 L3 P2
27 C2 L1 P3
28 C2 L3 P1
29 C3 L1 P2
30 C3 L2 P1
31 C1 P2 L3
32 C1 P3 L2
33 C2 P1 L3
34 C2 P3 L1
35 C3 P1 L2
36 C3 P2 L1

Outcomes might be significantly affected by the documents themselves rather than just the links in them. I will attempt to eliminate this effect by using three documents (1, 2 and 3) presented in different orders to each subject. The Latin square [CC57] in Table 1 shows the various combinations that will be used with subjects to test my hypotheses. The first subject (S# 1) will read the primitively linked version of document 1 (P1), followed by the hypermedia version of document 2 created with rules implemented using SMART (C2), followed by the hypermedia version of document 3 created using rules implemented using LSI (L3). All the documents are of approximately the length, and type (survey articles) and topic of discourse (computer science or library and information science). I will be using graduate students from those departments as experimental subjects, so the reading level and field of discourse will be appropriate.

Reader Training

Research indicates that so-called active readers -- those who think about how they are reading as well as what they are reading -- do better with hypermedia than other, more passive readers [Cha94,CR96,RT96a]. To make subjects more like active readers I will supply them with a list of general questions to consider when reading the documents, e.g. `What is the author's main point? Do they make it well?'. I will get the exact questions from the Learn Write Centre or the Educational Resources Office at The University of Western Ontario. Subjects will be free to ignore the questions. It may be argued that an attempt to make subjects behave more like active readers will affect the results of the experiment. I maintain that active reading skills, like familiarization with the UI, is a necessary precondition for effective use of a hypermedia system.

All documents will be presented with the same WWW browser (so the user interface will be identical). Subjects are trained and practice reading, scrolling, using the `Back' button, etc. on a neutral text.

The Reader's Task

I will ask experimental subjects to complete the following tasks:
  1. Subjects will told that they should imagine themselves as researchers interested in the topic of the article and that they should write a summary of it, to include in an annotated bibliography for instance.

  2. The links that each subject follows will be recorded automatically. For further details about the link-tracking method and the interface see the subsidiary text about link-tracking. The document will be in one of three forms listed above.

  3. Each subject will write a brief summary of the article they are reading.

  4. If the subject followed any links in the document, I will show them the source and destination locations of each link they followed, in order, and ask them to rate the quality of the link on a nine-point scale. The links will have been created by rules, so I will be asking subjects to rank the usefulness of the links (and hence the underlying rules) after they have used the links. Care must be taken to ensure that subjects ratings reflect the appropriateness and usefulness of a link's destination and not whether there should have been a link where there was one.

    The subject will answer some basic comprehension questions about the article. The subject will also complete a short multiple choice questionnaire rating their experience with the article.

  5. Steps 2 to 4 will be repeated for three different documents, each of which will be composed of a different article and hypermedia condition (see Table 1). Because they will read a document without links created with either of the test methods, subjects will act as their own controls.

After the Reading Task

After reading each document, readers will be asked to rate the article and the hypermedia using questions from a generic user-evaluation questionnaire for interactive systems [QUIS94]. Answers to the questions will show which type of hypermedia each subject prefers.

Independent judges familiar with the article's field will score subjects' comprehension questions. Judges will rate the answers as a percentage correct based on the content of the article not as a measure of what they might expect the subject to know. Similarly, subjects' summaries will be scored for accuracy and completeness.

Statistical Analysis

The Latin square model in Table 1 allows me to use the powerful repeated measures design for my hypotheses. Various type of analyses of variance can be computed to test the hypotheses. The rules for forming the links will be the independent variables. The outcomes of the experiment will be the dependent variables.

Another possible experimental design would have the pool of subjects partitioned into groups each of which will read one of the possible article/link-type combinations. The repeated measures design however is less prone to erroneously rejecting the null hypothesis because every subject acts as their own control.


I am developing methods to create and evaluate hypertext versions of scholarly articles. The evaluation focuses on people using the articles in realistic tasks. My evaluation method allows the results to be applied to other documents than just those tested.

Article readers need to: determine if a scholarly article interests them, focus on particular parts of that article, and locate information in it. I am creating three types of links to support those activities: structural, definitional and semantic. Structural links mimic structural connections in the original text, e.g., citations and cross-references. Definition links connect the use of special terms to their definition in the document. Semantic links connect related but separate passages.

Links will be generated by rules. The evaluation of a link reflects on the rules used to generate the link. With this approach we can test rules for many documents. Whether the rules produce useful links, or not, we can deduce their applicability to other documents.

Experimental subjects each read three hypertext documents: one with only simple links, and two with simple links and semantic links formed by one of the two methods. The document with only the simple links acts as a control for the others. No document will be presented twice to any subject. Because the order of presentation can be a significant factor, a Latin square (see Table 1) is used to balance the order of presentation. As they read each article subjects write a brief summary of it; and the links they follow are recorded. The summary is a real-world task that also serves as a comprehension test. Readers of scholarly articles often write summaries of articles for future reference. The task is not intrusive as, for instance, talking aloud would be. After they have finished reading each article subjects rate the quality of each link they followed. Since links are made by applying rules, care must be taken to ensure that the rating must reflect the usefulness of the link.

I am using the method to test several hypotheses. For instance: readers will comprehend the articles presented with hypertext better than those without; readers will prefer definition links to structural links, and structural links to semantic links. My evaluation method allows the results to be applied to other documents than just those tested.

The method I present takes into account the people who will use the hypermedia documents, and the tasks I expect them to use the documents for. I measure users performance and not merely their satisfaction, because we know that users sometimes prefer systems that are detrimental to their performance. I believe my method will demonstrate how a class of hypermedia documents should be linked and evaluated.


William O. Beeman, Kenneth T. Anderson, Gail Bader, James Larkin, Anne P. McClard, Patrick McQuillan, and Mark Shields. Superbook: An automatic tool for information exploration: hypertext? In Hypertext '87 Papers, pages 175 - 188, The University of North Carolina, Chapel Hill, North Carolina, 13 - 15 November 1987. Association for Computing Machinery.
Rodrigo A. Botafogo, Ehud Rivlin, and Ben Shneiderman. Structural Analysis of Hypertexts: Identifying Hierarchies and Useful Metrics. ACM Transactions on Information Systems, 10(2): 142 - 180, 1992.
See also: Yamada et al. [YHS95], Smeaton and Morrissey [SM95] for application of some of the metrics. Cf. Smith [Sm96] and Blustein et al. [BWTS97].
James Blustein, Robert E. Webber, and Jean Tague-Sutcliffe. Methods for evaluating the quality of hypertext links. Information Processing & Management, 33(2):255 - 271, 1997. Note: The abstract is available online.
Chris Buckley. Smart version 11.0 available via anonymous FTP at <URL:ftp://ftp.cs.cornell.edu/pub/smart/smart.11.0.tar.Z>, 12 July 1992.
Patricia Ann Carlson. Hypertext and intelligent interfaces for text retrieval. In Edward Barrett, editor, The Society of Text: Hypertext, Hypermedia, and the Social Construction of Information. The MIT Press, 1989.
p. 73
Davida Charney. The effect of hypertext on processes of reading and writing. In Cynthia L. Selfe and Susan Hilligoss, editors, Literacy and Computers: The Complications of Teaching and Learning with Technology, chapter 10, pages 238 - 263. The Modern Language Association of America, 1994.
p. 245
Chaomei Chen and Roy Rada. Interacting with hypertext: A meta-analysis of experimental studies. Human-Computer Interaction, 11(2):125 - 156, 1996.
William G. Cochran and Gertrude M. Cox. Experimental Designs. Wiley, second edition, 1957. Third printing,1962.
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391 - 407, September 1990.
Andrew Dillon. TIMS: A framework for the design of usable electronic text. In Herrevan Oostendorp and Sjaak de Mul, editors, Cognitive Aspects of Electronic Text Processing, volume LVIII of Advances in Discourse Processes, chapter 5, pages 99 - 119. Ablex Publishing Corporation, 1996.
p. 102
Andrew Dillon, John Richardson, and Cliff McKnight. Human factors of journal usage and design of electronic texts. Interacting with Computers, 1(2):183 - 189, 1989.
Dennis E. Egan, Joel R. Remde, Louis M.Gomez, Thomas K. Landauer, Jennifer Eberhardt, and Carol C. Lochbaum. Formative design-evaluation of SuperBook. ACM Transactions on Information Systems, 7(1):30 - 57, January 1989.
George W. Furnas, Scott Deerwester, Susan T. Dumais, Thomas K. Landauer, Richard A. Harshman, Lynn A. Streeter, and Laren E. Lochbaum. Information retrieval using a singular value decomposition model of latent semantic structure. In SIGIR '88, Grenoble, France, 1988.
Keith Instone, Barbee Mynatt Teasley, and Laura Marie Leventhal. Empirically-based re-design of a hypertext encyclopedia. In Stacey Ashlund, Kevin Mullet, Austin Henderson, Erik Hollnagel, and Ted White, editors, Proceedings of INTERCHI 1993, pages 500 - 506. Addison-Wesley, 24 - 29 April 1993.
Jean M. Mandler and Nancy S. Johnson. Remembrance of things parsed: Story structure and recall. Cognitive Psychology, 9:111 - 151, 1977.
Sharon McDonald and Rosemary J. Stevenson. Disorientation in hypertext: the effects of three text structures on navigation performance. Applied Ergonomics, 27(1):61 - 68, 1996.
Jan Olsen. Electronic Journal Literature: Implications for Scholars. Mecklermedia, 1994.
Jakob Nielsen and Jonathan Levy. Measuring usability: Preference vs. performance. Communications of the ACM, 37(4), April 1994.
QUIS 5.5b: The questionnaire for user interaction satisfaction. Available for license from University of Maryland's Office of Technology Liaison., 1994. © University of Maryland Human-Computer Interaction Laboratory.
Jean-François Rouet and André Tricot. Task and Activity Models in Hypertext Usage. In Herrevan Oostendorp and Sjaak de Mul, editors, Cognitive Aspects of Electronic Text Processing, volume LVIII of Advances in Discourse Processes, chapter 11, pages 239 - 264. Ablex Publishing Corporation, 1996.
p. 256 - 257
Ben Shneiderman. Reflections on authoring, editing, and managing hypertext. In Edward Barrett, editor, The Society of Text: Hypertext, Hypermedia, and the Social Construction of Information. The MIT Press, 1989.
Alan F. Smeaton and Patrick J. Morrissey. Experiments on the automatic construction of hypertext from texts. The New Review of Hypermedia and Multimedia, pp. 23 - 39, 1995.
Pauline A. Smith. Towards a practical measure of hypertext usability. Interacting with Computers, 8(4):365 - 381, 1996.
Note: Focuses on performance of specific information seeking tasks. Goal of developing a set of measures which would enable the usability of a hypermedia system to be assessed in terms of: (1) effectiveness with which users find information in hypertext; (2)  degree to which they become lost; (3) their confidence in their ability to find the right information.
Patricia Wright. Cognitive Overheads and Prostheses: Some Issues in Evaluating Hypertexts. In Janet H. Walker, editor Hypertext '91 Third ACM Conference on Hypertext Proceedings, 1991.
Patricia Wright. To jump or not to jump: Strategy selection while reading electronic texts. In C. McKnight, A. Dillon, and J. Richardson, editors, Hypertext: A psychological perspective. Ellis Horwood, 1993.
pp. 146 - 147
Shoji Yamada, Jung-Kook Kong, and Shigeharu Sugita. Development and evaluation of Hypermedia for Museum Education: Validation of Metrics. ACM Transactions of Computer-Human Interaction, 2(4):284 - 307, December 1995.

Jump to: Top of document | Subsidiary text about IR | Subsidiary text about link tracking

Copyright © J. Blustein, 1998.