Xml: Study of Controlling Overlap

Amit  Dhamija

Xml: Study of Controlling Overlap

by Amit Dhamija*,

- Published in Journal of Advances in Science and Technology, E-ISSN: 2230-9659

Volume 3, Issue No. 4, Feb 2012, Pages 0 - 0 (0)

Published by: Ignited Minds Journals

ABSTRACT

The direct application of standard ranking techniques to retrieveindividual elements from a collection of XML documents often produces a resultset in which the top ranks are dominated by a large number of elements takenfrom a small number of highly relevant documents. This paper presents andevaluates an algorithm that re-ranks this result set, with the aim ofminimizing redundant content while preserving the benefits of elementretrieval, including the benefit of identifying topic-focused componentscontained within relevant documents. The test collection developed by theInitiative for the Evaluation of XML Retrieval (INEX) forms the basis for theevaluation.

KEYWORD

XML, ranking techniques, retrieval, redundant content, element retrieval, topic-focused components, test collection, INEX

1. INTRODUCTION

The representation of documents in XML provides an opportunity for information retrieval systems to take advantage of document structure, returning individual document components when appropriate, rather than complete documents in all circumstances. In response to a user query, an XML information retrieval system might return a mixture of paragraphs, sections, articles, bibliographic entries and other components. This facility is of particular benefit when a collection contains very long documents, such as product manuals or books, where the user should be directed to the most relevant portions of these documents.

Figure : A journal article encoded in XML.

Figure provides an example of a journal article encoded in XML, illustrating many of the important characteristics of XML documents. Tags indicate the beginning and end of each element, with elements varying widely in size, from

Available online at www.ignited.in Page 2

one word to thousands of words. Some elements, such as paragraphs and sections, may be reasonably presented to the user as retrieval results, but others are not appropriate. Elements overlap each other articles contain sections, sections contain subsections, and subsections contain paragraphs. Each of these characteristics affects the design of an XML IR system, and each leads to fundamental problems that must be solved in an successful system. Most of these fundamental problems can be solved through the careful adaptation of standard IR techniques, but the problems caused by overlap are unique to this area [4,11] and form the primary focus of this paper. The article of figure may be viewed as an XML tree, as illustrated in figure . Formally, a collection of XML documents may be represented as a forest of ordered, rooted trees, consisting of a set of nodes N and a set of directed edges E connecting these nodes. For each node x N, the notation x: parent refers to the parent node of x, if one exists, and the notation x: children refers to the set of child nodes

Figure : Example XML tree.

of x. Since an element may be represented by the node at its root, the output of an XML IR system may be viewed as a ranked list of the top-m nodes. The direct application of a standard relevance ranking technique to a set of XML elements can produce a result in which the top ranks are dominated by many structurally related elements. A high scoring section is likely to contain several high scoring paragraphs and to be contained in an high scoring article. For example, many of the elements in figure would receive a high score on the keyword query “text index compression algorithms”. If each of these elements are presented to a user as an individual and separate result, she may waste considerable time reviewing and rejecting redundant content. One possible solution is to report only the highest scoring element along a given path in the tree, and to remove from the lower ranks any element containing it, or contained within it. Unfortunately, this approach destroys some of the possible benefits of XML IR. For example, an outer element may contain a substantial amount of information that does not appear in an inner element, but the inner element may be heavily focused on the query topic and provide a short overview of the key concepts. In such cases, it is reasonable to report elements which contain, or are contained in, higher ranking elements. Even when an entire book is relevant, a user may still wish to have the most important paragraphs highlighted, to guide her reading and to save time [6]. This paper presents a method for controlling overlap. Starting with an initial element ranking, a re-ranking algorithm adjusts the scores of lower ranking elements that contain, or are contained within, higher ranking elements, reflecting the fact that this information may now be redundant. For example, once an element representing a section appears in the ranking, the scores for the paragraphs it contains and the article that contains it are reduced. The inspiration for this strategy comes partially from recent work on structured documents retrieval, where

Available online at www.ignited.in Page 3

terms appearing in different fields, such as the title and body, are given different weights [20]. Extending that approach, the re-ranking algorithm varies weights dynamically as elements are processed. The remainder of the paper is organized as follows: After a discussion of background work and evaluation methodology, a baseline retrieval method is presented in section 4. This baseline method represents a reasonable adaptation of standard IR technology to XML. Section 5 then outlines a strategy for controlling overlap, using the baseline method as a starting point. A re-ranking algorithm implementing this strategy is presented in section 6 and evaluated in section 7. Section 8 discusses an extended version of the algorithm.

2. BACKGROUND

This section provides a general overview of XML information retrieval and discusses related work, with an emphasis on the fundamental problems mentioned in the introduction. Much research in the area of XML retrieval views it from a traditional database perspective, being concerned with such problems as the implementation of structured query languages [5] and the processing of joins [1]. Here, we take a “content oriented” IR perceptive, focusing on XML documents that primarily contain natural language data and queries that are primarily expressed in natural language. We assume that these queries indicate only the nature of desired content, not its structure, and that the role of the IR system is to determine which elements best satisfy the underlying information need. Other IR research has considered mixed queries, in which both content and structural requirements are specified [2, 6, 14, 17, 23]. Term and Document Statistics - In traditional information retrieval applications the standard unit of retrieval is taken to be the “document”. Depending on the application, this term might be interpreted to encompass many different objects, including web pages, newspaper articles and email messages. When applying standard relevance ranking techniques in the context of XML IR, a natural approach is to treat each element as a separate “document”, with term statistics available for each [16]. In addition, most ranking techniques require global statistics (e.g. inverse document frequency) computed over the collection as a whole. If we consider this collection to include all elements that might be returned by the system, a specific occurrence of a term may appear in several different “documents”, perhaps in elements representing a paragraph, a subsection, a section and an article. It is not appropriate to compute inverse document frequency under the assumption that the term is contained in all of these elements, since the number of elements that contain a term depends entirely on the structural arrangement of the documents [13, 23]. Retrievable Elements - While an XML IR system might potentially retrieve any element, many elements may not be appropriate as retrieval results. This is usually the case when elements contain very little text [10]. For example, a section title containing only the query terms may receive a high score from a ranking algorithm, but alone it would be of limited value to a user, who might prefer the actual section itself. Other elements may reflect the document's physical, rather than logical, structure, which may have little or no meaning to a user. An effective XML IR system must return only those elements that have sufficient content to be usable and are able to stand alone as independent objects [15, 18]. Standard document components such as paragraphs, sections, subsections,

Available online at www.ignited.in Page 4

and abstracts usually meet these requirements; titles, italicized phrases, and individual metadata fields often do not. Evaluation Methodology - Over the past three years, the Initiative for the Evaluation of XML Retrieval (INEX) has encouraged research into XML information retrieval technology [7, 8]. INEX is an experimental conference series, similar to TREC, with groups from different institutions completing one or more experimental tasks using their own tools and systems, and comparing their results at the conference itself. Over 50 groups participated in INEX 2004, and the conference has become as influential in the area of XML IR as TREC is in other IR areas. The research described in this paper, as well as much of the related work it cites, depends on the test collections developed by INEX. Overlap causes considerable problems with retrieval evaluation, and the INEX organizers and participants have wrestled with these problems since the beginning. While substantial progress has been made, these problem are still not completely solved. Kazai et al. [11] provide a detailed exposition of the overlap problem in the context of INEX retrieval evaluation and discuss both current and proposed evaluation metrics. Many of these metrics are applied to evaluate the experiments reported in this paper, and they are briey outlined in the next section.

3. BASELINE RETRIEVAL METHOD

This section provides an overview of baseline XML information retrieval method currently used in the Multi Text IR system, developed by the Information Retrieval Group at the University of Waterloo [3]. This retrieval method results from the adaptation and tuning of the Okapi BM25 measure [21] to the XML information retrieval task. The Multi Text system performed respectably at INEX 2004, placing in the top ten under all of the quantization functions, and placing first when the quantization function emphasized exhaustively. To support retrieval from XML and other structured document types, the system provides generalized queries of the form: rank X by Y where X is a sub-query specifying a set of document elements to be ranked and Y is a vector of sub-queries specifying individual retrieval terms. For our INEX 2004 runs, the sub-query X specified a list of retrievable elements as those with tag names as follows: abs app article bb bdy bm fig fm ip1 li p sec ss1 ss2 vt This list includes bibliographic entries (bb) and _gure captions (fig) as well as paragraphs, sections and subsections. Prior to INEX 2004, the INEX collection and the INEX 2003 relevance judgments were manually analyzed to select these tag names. Tag names were selected on the basis of their frequency in the collection, the average size of their associated elements, and the relative number of positive relevance judgments they received. Automating this selection process is planned as future work. For INEX 2004, the term vector Y was derived from the topic by splitting phrases into individual words, eliminating stop words and negative terms (those starting with “-”), and applying a stemmer. For example, keyword field of topic

166

became the four-term query

Available online at www.ignited.in Page 5

where the “$" operator within a quoted string stems the term that follows it. Our implementation of Okapi BM25 is derived from the formula of Robertson et al. [21] by setting parameters k2 = 0 and Given a term set Q, an element x is assigned the score Where

Figure : Impact of k1 on inex-2002 mean average precision with b = 0:75 (INEX 2003 CO topics).

Prior to INEX 2004, the INEX 2003 topics and judgments were used to tune the b and k1 parameters, and the impact of this tuning is discussed later in this section. For the purposes of computing document-level statistics (D, Dt and lavg) a document is defined to be an article. These statistics are used for ranking all element types. Following the suggestion of Kamps et al. [10], the retrieval results are filtered to eliminate very short elements, those less than 25 words in length. The use of article statistics for all element types might be questioned. This approach may be justified by viewing the collection as a set of articles to be searched using standard document-oriented techniques, where only articles may be returned. The score computed for an element is essentially the score it would receive if it were added to the collection as a new document, ignoring the minor adjustments needed to the document-level statistics. Nonetheless, we plan to examine this issue again in the future. In our experience, the performance of BM25 typically benefits from tuning the b and k1 parameters to the collection, whenever training queries are available for this purpose. Prior to INEX 2004, we trained the MultiText system using the INEX 2003 queries. As a starting point we used the values b = 0:75 and k1 = 1:2, which perform well on TREC adhoc collections and are used as default values in our system. The results were surprising. Figure shows the result of varying k1 with b = 0:75 on the MAP values under three quantization functions. In our experience, optimal values for k1 are typically in the range 0.0 to 2.0. In this case, large values are required for good performance. Between k1 = 1:0 and k1 = 6:0 MAP increases by over 15% under the strict quantization. Similar improvements

Available online at www.ignited.in Page 6

are seen under the generalized and sog quantizations. In contrast, our default value of b = 0:75 works well under all quantization functions (figure ). After tuning over a wide range of values under several quantization functions, we selected values of k = 10:0 and b = 0:80 for our INEX 2004 experiments, and these values are used for the experiments reported in section 7.

Figure : Impact of b on inex-2002 mean average precision with k1 = 10 (INEX 2003 CO topics).

4. EVALUATION

None of the metrics described in section 3.3 is a close fit with the view of overlap advocated by this paper. Nonetheless, when taken together they provide insight into the behaviour of the re-ranking algorithm. The INEX evaluation packages (inex_eval and inex_eval_ng) were used to compute values for the inex-2002 and inex-2003 metrics. Values for the XCG metrics were computed using software supplied by its inventors [11]. Figure plots the three variants of inex-2002 MAP metric together with the XCG metric. Values for these metrics

Figure : Impact of _ on inex-2003 MAP (INEX 2004 CO topics; assessment set I).

are plotted for values of between 0.0 and 1.0. Recalling that the XCG metric is designed to penalize overlap, while the inex-2002 metric ignores overlap, the conflict between the metrics is obvious. The MAP values at one extreme and the XCG value at the other extreme represent retrieval performance comparable to the best systems at INEX 2004 [8, 12]. Figure plots values of the inex-2003 MAP metric for two quantizations, with and without consideration of overlap.

Available online at www.ignited.in Page 7

Once again, conflict is apparent, with the influence of substantially lessened when overlap is considered.

5. EXTENDED ALGORITHM

One limitation of the re-ranking algorithm is that a single weight _ is used to adjust the scores of both the ancestors and descendants of reported elements. An obvious extension is to use different weights in these two cases. Furthermore, the same weight is used regardless of the number of times an element is contained in a reported element. For example, a paragraph may form part of a reported section and then form part of a reported article. Since the user may now have seen this paragraph twice, its score should be further lowered by increasing the value of the weight. Motivated by these observations, the re-ranking algorithm may be extended with a series of weights Where is the weight applied to a node that has been a descendant of a reported node j times. Note that an upper bound on M is h, the maximum height of any XML tree in the collection. However, in practice M is likely to be relatively small (perhaps 3 or 4). Figure presents replacements for the Up and Down routines of _gure 6, incorporating this series of weights. One extra _eld is required in each node, as follows: The value of x: j is initially set to zero in all nodes and is incremented each time Down is called with x as its argument. When computing the score of node, the value of x: j selects

Figure : Extended tree traversal routines.

the weight to be applied to the node by adjusting the value of xt in equation 1, as follows: where ft and gt are the components of corresponding to term t. A few additional changes are required to extend Up and Down. The Up routine returns immediately (line 2) if its argument has already been reported, since term frequencies have already been adjusted in its ancestors. The Down routine does not report its argument, but instead re-computes its score and adds it back into S.

Available online at www.ignited.in Page 8

A node cannot be an argument to Down more than M+1 times, which in turn implies an overall time complexity of O((nM + mh) log n). Since the time complexity is also O(nh log n).

6. CONCLUDING DISCUSSION

When generating retrieval results over an XML collection, some overlap in the results should be tolerated, and may be beneficial. For example, when a highly exhaustive and fairly specific (3,2) element contains a much smaller (2,3) element, both should be reported to the user, and retrieval algorithms and evaluation metrics should respect this relationship. The algorithm presented in this paper controls overlap by weighting the terms occurring in reported elements to reflect their reduced importance. Other approaches may also help to control overlap. For example, when XML retrieval results are presented to users it may be desirable to cluster structurally related elements together, visually illustrating the relationships between them. While this style of user interface may help a user cope with overlap, the strategy presented in this paper continues to be applicable, by determining the best elements to include in each cluster. At Waterloo, we continue to develop and test our ideas for INEX 2005. In particular, we are investigating methods for learning the _ and _j weights. We are also re-evaluating our approach to document statistics and examining appropriate adjustments to the k1 parameter as term weights change [20].

7. REFERENCES

1. N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal XML pattern matching. In Proceedings of the 2002 ACM SIGMOD International Conference on the Management of Data, pages 310{321, Madison, Wisconsin, June

2002.

2. D. Carmel, Y. S. Maarek, M. Mandelbrod, Y. Mass, and A. Sofler. Searching XML documents via XML fragments. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 151{158, Toronto, Canada, 2003. 3. C. L. A. Clarke and P. L. Tilker. MultiText experiments for INEX 2004. In INEX 2004 Workshop Proceedings, 2004. Published in LNCS

3493 [8].

4. A. P. de Vries, G. Kazai, and M. Lalmas. Tolerance to irrelevance: A user-effort oriented evaluation of retrieval systems without predefined retrieval unit. In RIAO 2004 Conference Proceedings, pages 463{473, Avignon, France, April 2004. 5. D. DeHaan, D. Toman, M. P. Consens, and M. T. Ozsu. A comprehensive XQuery to SQL translation using dynamic interval encoding. In Proceedings of the 2003 ACM SIGMOD International Conference on the Management of Data, San Diego, June

2003.

6. N. Fuhr and K. GroBjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172{180, New Orleans, September 2001.

Available online at www.ignited.in Page 9

7. N. Fuhr, M. Lalmas, and S. Malik, editors. Initiative for the Evaluation of XML Retrieval. Proceedings of the Second Workshop (INEX 2003), Dagstuhl, Germany, December 2003. 8. N. Fuhr, M. Lalmas, S. Malik, and Zoltan Szlavik, editors. Initiative for the Evaluation of XML Retrieval. Proceedings of the Third Workshop (INEX 2004), Dagstuhl, Germany, December 2004. Published as Advances in XML Information Retrieval, Lecture Notes in Computer Science, volume 3493, Springer, 2005. 9. K. and J. . Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems,

20(4):422{446, 2002.

10. J. Kamps, M. de Rijke, and . Length normalization in XML retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 80{87, She_eld, UK, July 2004. 11. G. Kazai, M. Lalmas, and A. P. de Vries. The overlap problem in content-oriented XML retrieval evaluation. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 72{79, Sheffield, UK, July 2004. 12. G. Kazai, M. Lalmas, and A. P. de Vries. Reliability tests for the XCG and inex-2002 metrics. In INEX 2004 Workshop Proceedings, 2004. Published in

LNCS 3493 [8].

13. , M. Junkkari, P. Arvola, and T. Aalto. TRIX 2004 | Struggling with the overlap. In INEX 2004 Workshop Proceedings, 2004. Published in LNCS 3493 [8]. 14. S. Liu, Q. Zou, and W. W. Chu. Configurable indexing and ranking for XML information retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 88{95, Sheffield, UK, July 2004. 15. Y. Mass and M. Mandelbrod. Retrieving the most relevant XML components. In INEX 2003 Workshop Proceedings, Dagstuhl, Germany, December 2003. 16. Y. Mass and M. Mandelbrod. Component ranking and automatic query refinement for XML retrieval. In INEX 2004 Workshop Proceedings, 2004. Published in LNCS 3493 [8]. 17. P. Ogilvie and J. Callan. Hierarchical language models for XML component retrieval. In INEX 2004 Workshop Proceedings, 2004. Published in LNCS

3493 [8].

18. J. Pehcevski, J. A. Thom, and A. Vercoustre. Hybrid XML retrieval re-visited. In INEX 2004 Workshop Proceedings, 2004. Published in LNCS

3493 [8].

19. B. Piwowarski and M. Lalmas. Providing consistent and exhaustive relevance assessments for XML retrieval evaluation. In Proceedings of the 13th ACM Conference on Information and Knowledge Management, pages 361{370, Washington, DC, November 2004.

Available online at www.ignited.in Page 10

20. S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In Proceedings of the 13th ACM Conference on Information and Knowledge Management, pages 42{50, Washington, DC, November 2004. 21. S. E. Robertson, S. Walker, and M. Beaulieu. Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. In Proceedings of the Seventh Text REtrieval Conference, Gaithersburg, MD, November 1998. 22. A. Trotman and . NEXI, now and next. In INEX 2004 Workshop Proceedings, 2004. Published in LNCS 3493 [8]. 23. J. Vittaut, B. Piwowarski, and P. Gallinari. An algebra for structured queries in bayesian networks. In INEX 2004 Workshop Proceedings, 2004. Published in LNCS 3493 [8].