In this paper I present an experimental approach to the evaluation of a type of hypermedia application. My overall objective is to develop and evaluate ways of automatically incorporating hypermedia links into pre-existing scholarly journal articles. The focus of this paper is the evaluation method. My method allows the results to be applied to other documents than just those tested.
To properly convert ordinary documents into useful hypermedia two constraints must be satisfied: the links must be useful to the readers and the risk of disorientation introduced by the new structure imposed by the links must be minimized. I describe a rule-based approach for making links. In my experiment I use two methods to detect when the rules should be applied. The effectiveness of the links is tested by people performing realistic tasks. Readers judge the quality of links (and thereby the quality of the rules used to forge them) and the overall effectiveness of the hypermedia.
Keywords Hypertext, Task-based Evaluation, Automatic Link Generation, Electronic Publishing
Document Structure This document is composed of a main text and two subsidiary texts (about the information retrieval methods and the experimental process used in the experiment). Links to those texts appear in the main text below.
Recently there have been calls for hypertext evaluation methods based on tasks [Dil96,CR96]. Here I describe the method I developed to evaluate ways of automatically incorporating links into scholarly journal articles. My evaluation focuses on people using the hypertext to perform real-world tasks. My method allows the results to be applied to other documents than just those tested.
Many electronic journals and electronic versions of paper journals already exist. It is only a matter of time before there is widespread use of hypermedia links (as seen in the World Wide Web). Therefore we should try to determine what types of links should be made and how to make them automatically.
I believe that articles with hypermedia links can be more useful than versions without links. Hypermedia is not always a suitable application however: People sometimes prefer hypermedia even when they do more poorly with it [NL94]. They sometimes seem to do better with it even when they do not [Cha94]. We must therefore evaluate hypermedia with people performing real-world tasks in the way that we expect the systems will be used. We must measure users performance not merely their satisfaction. In any hypermedia evaluation we need to consider the people who will use the hypermedia and the tasks they will perform with it. Obviously the statistical model used in the evaluation should reflect the underlying theory.
Although people read texts for many reasons, I am only concerned with the strategies they pursue when they read for information. These activities stretch from the directed, such as processing queries to find specific information, e.g. searching to find keywords, to the undirected, e.g. some types of browsing.
We know that readers of scholarly papers typically do not read those papers in a linear fashion [Wri93,DRM89,Ol94]. I want to make it easier to read such papers in a nonlinear fashion by making hypermedia versions that support the ways we believe people use paper versions while providing additional useful features.
The point of making hypermedia versions of scholarly articles is to help readers get the information encoded in the articles. What are a reader's goals in reading a scholarly article? Readers scan articles to quickly determine if an article's content is of interest to them [DRM89]. They browse articles to find particular portions that interest them. When readers search for specific information in an article they are querying. Egan et al. [BAB+87,ERG89] and Instone et al. [ITL93] evaluated how particular hypermedia can aid querying. Egan et al. tested their SuperBook hypermedia system's ability to help readers extract facts from a computer manual converted to hypermedia form. Instone et al. employed user testing to guide their redesign of a hypermedia version of a paper encyclopedia.
From studies in cognitive psychology it has been shown that text comprehension -- how people read and understand ordinary text -- is a complex process, and that changes in the way text is presented change the way people think [Car89a]. Different types of text (e.g. poetry, narrative and discourse), create different expectations in readers [Cha94a]. Readers, especially those unfamiliar with computers, become confused when a document's structure is inconsistent with their expectations of its contents. Because hypermedia allows for unfamiliar types of text structuring this problem can be considerably worse for its readers than readers of ordinary text. One common difficulty readers have with hypermedia documents is knowing when they have read enough. McDonald and Stevenson reported that even expert users became confused in reading online hypermedia when they could not determine the length of the document they were reading [MS96].
The effects of the user interface (UI) cannot easily be separated from the effects of the structure [Wri93a,ERG89]. A poor user interface can render an otherwise excellent program unusable but not even the best user interface imaginable can rescue a program that is not suitable for its users. In the case of a hypermedia version of a text document, the structure of links that make the hypermedia are more important than the UI that presents the links and text to the reader. My research concentrates on the process of link creation and the evaluation of hypermedia generated by the application of rules by rote. Recognizing the importance of user interfaces I test my automatically generated hypermedia using a standard user interface and control for the effects of the UI in my evaluation experiment.
Article readers need to: determine if a scholarly article interests them, focus on particular parts of that article, and locate information in it. I am creating three types of links to support those activities: structural, definitional and semantic.
Although the method I present is designed to be fully automatic it might be best implemented as a tool that suggests links to a human operator. The person could then decide which links to include.
Definition links connect the use of special terms to their definition in the document. These links make it easier for readers to quickly determine the meaning of terms used in the article without having to read the entire document first.
Semantic links connect related but separate passages. I make two main kinds of semantic links: links from summary sections to summarized sections, and links connected related but distant passages. For example, a sentence in the abstract may have a link leading to the beginning of a passage which occurs later in the document and is summarized in the abstract. If created properly, semantic links can provide a method for readers to use documents more effectively than without them.
Structural links are made applying simple pattern-matching rules.
Technical terms are often italicized at the point where they are defined in a document. For those documents definition terms are identified by their unusual presentation style and links are forged from the use of those terms to their definition.
For documents where that convention is not followed the definitions are identified manually.
Another approach is to identify terms that occur frequently in the document but are rare in common vocabulary. Links could then be made from the use of terms to their first appearance or to a passage that contains indicative phrases such as `by foo we mean ...'.
I make two types of semantic links: scattered discussion links connect passages that discuss the same topic but are not adjacent, and summary links connect sentences in the abstract and conclusion of articles to the sections they summarize. The scatterred discussion links are from either sentences or groups of words in sentences, to sentences, paragraphs, or sections. The source (also known as tail) and destination (also known as head) of the links is always indicated.
The following section provides more detail of the method I am using to make the semantic links. Please note that my discussion of that method is brief as I intend mostly to discuss the evaluation method.
I have written programs to detect the heads and tails of links automatically. Semantic links are based on the relatedness of the vocabulary used in the passages considered for linking. I am using the principles of information retrieval (IR) to detect the heads and tails of semantic links. In my prototype system I manually encode the links.
I will be comparing links made using two competing IR systems -- Bellcore's LSI [FDD+88,DDF+90] and Cornell's SMART [Buc92] -- against each other, and to a version -- without semantic links. Both systems have the same goals but different methods. Each method is used to create one hypermedia version of each document. Details of my evaluation method appear in a separate section below. By using two systems I hope to prevent developing of a system-specific evaluation method.
A brief description of IR and the methods I am using is available in a separate document.
Hypermedia must be evaluated carefully. Readers sometimes prefer hypermedia to traditional documents even when it is more difficult for them to use [NL94,Cha94]. I believe that hypermedia like mine should be evaluated not only for user satisfaction but also for how useful it is. When we ask how useful hypermedia is then we must take into account the tasks it is being used for, the people who are performing those tasks, and the context in which they perform them. My view is supported by recent calls for hypermedia evaluation methods based on tasks. Chen and Rada [CR96] analysed 23 experiments with various hypermedia systems to determine what factors were important in determining users success with hypermedia. They concluded with a call for a taxonomy of tasks to better compare different hypermedia systems and models. Dillon called for a `contextually determined view of usability' [Dil96a] as part of an approach to make better hypermedia.
My evaluation focuses on people using the hypermedia to perform real-world tasks. Specifically I am concentrating on making links help researchers scan and browse scholarly articles. As they read each article subjects will write a brief summary of it; and the links they follow will be recorded automatically. After they have finished reading each article subjects will rate the quality of each link they followed. The summary is a real-world task that also serves as a comprehension test. Because readers of scholarly articles often write summaries of articles for future reference I do not expect the task to significantly change the way they use the article.
I have three major hypotheses:
The combination of my link making and evaluation methods allows the results of the evaluation to be applied to other documents than just those tested. Because the links will be generated by rules the evaluation of a link will reflect on the rule used to generate the link. With this approach we can test rules for many documents. By determining which links readers found useful, and which links good readers liked we can determine which rules were helpful and which should be changed. Whether the rules produce useful links or not we can deduce their applicability to other documents.
Once I have made the links following rules, I will perform tests to evaluate how useful such links are to readers. If the links are useful then I'll be able to conclude that the rules used to make them could be successfully applied to other documents. If the links are not useful then I'll need to know why they were not. In either case I expect to be able to tell by debriefing the people who use the hypermedia documents.
I am evaluating two things: hypermedia with and without semantic links, and two methods of creating such links. I will test three types of text: fully-linked hypermedia with semantic links created using SMART (C); fully-linked hypermedia with semantic links created using LSI (L); or primitive hypermedia, i.e. hypermedia with only structural links (P). The text with only structural links will serve as a control for the other two.
Outcomes might be significantly affected by the documents themselves rather than just the links in them. I will attempt to eliminate this effect by using three documents (1, 2 and 3) presented in different orders to each subject. The Latin square [CC57] in Table 1 shows the various combinations that will be used with subjects to test my hypotheses. The first subject (S# 1) will read the primitively linked version of document 1 (P1), followed by the hypermedia version of document 2 created with rules implemented using SMART (C2), followed by the hypermedia version of document 3 created using rules implemented using LSI (L3). All the documents are of approximately the length, and type (survey articles) and topic of discourse (computer science or library and information science). I will be using graduate students from those departments as experimental subjects, so the reading level and field of discourse will be appropriate.
Research indicates that so-called active readers -- those who think about how they are reading as well as what they are reading -- do better with hypermedia than other, more passive readers [Cha94,CR96,RT96a]. To make subjects more like active readers I will supply them with a list of general questions to consider when reading the documents, e.g. `What is the author's main point? Do they make it well?'. I will get the exact questions from the Learn Write Centre or the Educational Resources Office at The University of Western Ontario. Subjects will be free to ignore the questions. It may be argued that an attempt to make subjects behave more like active readers will affect the results of the experiment. I maintain that active reading skills, like familiarization with the UI, is a necessary precondition for effective use of a hypermedia system.
All documents will be presented with the same WWW browser (so the user interface will be identical). Subjects are trained and practice reading, scrolling, using the `Back' button, etc. on a neutral text.
Subjects will told that they should imagine themselves as researchers interested in the topic of the article and that they should write a summary of it, to include in an annotated bibliography for instance.
The links that each subject follows will be recorded automatically. For further details about the link-tracking method and the interface see the subsidiary text about link-tracking. The document will be in one of three forms listed above.
Each subject will write a brief summary of the article they are reading.
If the subject followed any links in the document, I will show them the source and destination locations of each link they followed, in order, and ask them to rate the quality of the link on a nine-point scale. The links will have been created by rules, so I will be asking subjects to rank the usefulness of the links (and hence the underlying rules) after they have used the links. Care must be taken to ensure that subjects ratings reflect the appropriateness and usefulness of a link's destination and not whether there should have been a link where there was one.
The subject will answer some basic comprehension questions about the article. The subject will also complete a short multiple choice questionnaire rating their experience with the article.
Steps 2 to 4 will be repeated for three different documents, each of which will be composed of a different article and hypermedia condition (see Table 1). Because they will read a document without links created with either of the test methods, subjects will act as their own controls.
After reading each document, readers will be asked to rate the article and the hypermedia using questions from a generic user-evaluation questionnaire for interactive systems [QUIS94]. Answers to the questions will show which type of hypermedia each subject prefers.
Independent judges familiar with the article's field will score subjects' comprehension questions. Judges will rate the answers as a percentage correct based on the content of the article not as a measure of what they might expect the subject to know. Similarly, subjects' summaries will be scored for accuracy and completeness.
The Latin square model in Table 1 allows me to use the powerful repeated measures design for my hypotheses. Various type of analyses of variance can be computed to test the hypotheses. The rules for forming the links will be the independent variables. The outcomes of the experiment will be the dependent variables.
Another possible experimental design would have the pool of subjects partitioned into groups each of which will read one of the possible article/link-type combinations. The repeated measures design however is less prone to erroneously rejecting the null hypothesis because every subject acts as their own control.
I am developing methods to create and evaluate hypertext versions of scholarly articles. The evaluation focuses on people using the articles in realistic tasks. My evaluation method allows the results to be applied to other documents than just those tested.
Article readers need to: determine if a scholarly article interests them, focus on particular parts of that article, and locate information in it. I am creating three types of links to support those activities: structural, definitional and semantic. Structural links mimic structural connections in the original text, e.g., citations and cross-references. Definition links connect the use of special terms to their definition in the document. Semantic links connect related but separate passages.
Links will be generated by rules. The evaluation of a link reflects on the rules used to generate the link. With this approach we can test rules for many documents. Whether the rules produce useful links, or not, we can deduce their applicability to other documents.
Experimental subjects each read three hypertext documents: one with only simple links, and two with simple links and semantic links formed by one of the two methods. The document with only the simple links acts as a control for the others. No document will be presented twice to any subject. Because the order of presentation can be a significant factor, a Latin square (see Table 1) is used to balance the order of presentation. As they read each article subjects write a brief summary of it; and the links they follow are recorded. The summary is a real-world task that also serves as a comprehension test. Readers of scholarly articles often write summaries of articles for future reference. The task is not intrusive as, for instance, talking aloud would be. After they have finished reading each article subjects rate the quality of each link they followed. Since links are made by applying rules, care must be taken to ensure that the rating must reflect the usefulness of the link.
I am using the method to test several hypotheses. For instance: readers will comprehend the articles presented with hypertext better than those without; readers will prefer definition links to structural links, and structural links to semantic links. My evaluation method allows the results to be applied to other documents than just those tested.
The method I present takes into account the people who will use the hypermedia documents, and the tasks I expect them to use the documents for. I measure users performance and not merely their satisfaction, because we know that users sometimes prefer systems that are detrimental to their performance. I believe my method will demonstrate how a class of hypermedia documents should be linked and evaluated.
Jump to: Top of document | Subsidiary text about IR | Subsidiary text about link tracking
Copyright © J. Blustein, 1998.