Paltoglou, G., Salampasis, M., Satratzemi, M.: Hybrid Results Merging, In Proc. 16th Conference on Information and Knowledge Management. Lisbon, 6-9 November 2007
Abstract -: The problem of results merging in distributed information retrieval environments has been approached by two different directions in research. Estimation approaches attempt to calculate the relevance of the returned documents through ad-hoc methodologies (weighted score merging, regression etc) while download approaches, download all the documents locally, partially or completely, in order to estimate “first hand” their relevance. Both have their advantages and disadvantages. It is assumed that download algorithms are more effective but they are very expensive in terms of time and bandwidth. Estimation approaches on the other hand, usually rely on document relevance scores being returned by the remote collections in order to achieve maximum perfor- mance. In addition to that, regression algorithms, which have proved to be more effective than weighted scores merg- ing, rely on a significant number of overlap documents in order to function effectively, practically requiring multiple interactions with the remote collections. The new algorithm that is introduced reconciles the above two approaches, com- bining their strengths, while minimizing their weaknesses. It is based on downloading a limited, selected number of documents from the remote collections and estimating the relevance of the rest through regression methodologies. The proposed algorithm is tested in a variety of settings and its performance is found to be better than estimation approaches, while approximating that of download.
Paltoglou, G., Salampasis, M., Satratzemi, M.: Results Merging Algorithm Using Multiple Regression Models, In Proc. 29th European Conference on Information Retrieval, 2007, p. 173-184. (Acceptance Rate: 19%).
Abstract -: This paper describes a new algorithm for merging the results of remote collections in a distributed information retrieval environment. The algorithm makes use only of the ranks of the returned documents, thus making it very efficient in environments where the remote collections provide the minimum of cooperation. Assuming that the correlation between the ranks and the relevancy scores can be expressed through a logistic function and using sampled documents from the remote collections the algorithm assigns local scores to the returned ranked documents. Subsequently, using a centralized sample collection and through linear regression, it assigns global scores, thus producing a final merged document list for the user. The algorithm’s effectiveness is measured against two state-of-the-art results merging algorithms and its performance is found to be superior to them in environments where the remote collections do not provide relevancy scores.
Paltoglou, G., Salampasis, M., Satratzemi M. Modeling information sources as integrals for effective and efficient source selection, Information Processing and Management Journal Elsevier (accepted).
Abstract -:In this paper, a new source selection algorithm for uncooperative distributed information retrieval environments is presented. The algorithm functions by modeling each information source as an integral, using the relevance score and the intra-collection position of its sampled documents in reference to a centralized sample index and selects the collections that cover the largest area in the rank-relevance space. Based on the above novel metric, the algorithm explicitly focuses on addressing the two goals of source selection; high-recall, which is important for source recommendation applications and high-precision which is important for distributed information retrieval, aiming to produce a high-precision final merged list. For the latter goal in particular, the new approach steps away from the usual practice of DIR systems of explicitly declaring the number of collections that must be queried and instead focuses solely on the number of retrieved documents in the final merged list, dynamically calculating the number of collections that are selected and the number of documents requested from each. The algorithm is tested in a wide range of testbeds in both recall and precision-oriented settings and its effectiveness is found to be equal or better than other state-of-the-art algorithms.
G. Paltoglou, M. Salampasis, and M. Satratzemi. A Comparison of Centralized and Distributed Information Retrieval Approaches. In Proceedings of the 12th Pan-Hellenic Conference on Informatics
Abstract -: Distributed Information Retrieval (DIR) has been suggested to offer a prospective solution to a number of issues concerning information retrieval in the WWW. On the other hand, previous studies have indicated that centralized approaches offer the best solution for optimal quality of result (i.e. effectiveness). In this paper, we revisit those claims and investigate if and under which conditions can DIR offer a new paradigm for both ef?cient and effective information retrieval.
Paltoglou G., Salampasis, M., Satratzemi, M., Evangelidis, G.: Using linkage information to approximate the distribution of relevant documents in DIR, In Proc. 11th Panhellenic Conference on Informatics (2007).
Abstract -: In this paper, a method for addressing the source selection problem in a distributed information retrieval environment is presented. The method proposes a solution to the source selection problem by making use of the outlink distribution extracted from a sampling collection in order to locate the most authoritative collections for a particular query. Experiments carried out with the algorithm show that its performance exceeds that of the uniform approach, with much more economical means. Specifically, the link-based method is much more efficient in terms of collection utilization than the uniform strategy, which can be applied under the same conditions. A key feature of the algorithm is that it can be combined and work in parallel with content-based source selection algorithms, thus enhancing performance in information seeking environments containing linkage information.
Georgios Paltoglou, Michail Salampasis and Fotis Lazarinis : Indexing and retrieval of a Greek corpus. In proceedings of the 2nd International ACM Workshop Improving Non- English Web Searching (iNEWS08) CIKM08
Abstract -: Greek is one of the most difficult languages to handle in Web Information Retrieval (IR) related tasks. Its difficulty stems from the fact that it is grammatically, morphologically and orthographically more complex than the lingua franca of IR, English. In this paper, we address a significant number of issues that originate from the Greek language. We use a number of techniques to determine the correct encoding that is used by web pages written in Greek. We test the effect of using a Greek stopword list in a realistic and controlled Web environment. We employ a character mapping scheme, in order to overcome the problem of the diversity of diacritics used in the language, such as accents and diaeresis. We utilize word distance and fuzzy similarity metrics in order to make up for the different forms that nouns, verbs and articles appear because of conjugations and inflections and additionally handle greeklish queries, a transliterated form of Greek. The conducted experiments present some effective ways to increase the accuracy in Greek IR tasks.
G. Paltoglou, M. Salampasis, and M. Satratzemi. Simple adaptations of data fusion algorithms for source selection. In Proceedings of 31th European Conference on Information Retrieval (ECIR 2009)
Abstract -: Source selection deals with the problem of selecting the most appropriate information sources from the set of, usually non-intersecting, available document collections. On the other hand, data fusion techniques (also known as metasearch techniques) deal with the problem of aggregating the results from multiple, usually completely or partly intersecting, document sources in order to provide a wider coverage and a more effective retrieval result. In this paper we study some simple adaptations to traditional data fusion algorithms for the task of source selection in uncooperative distributed information retrieval environments. The experiments demonstrate that the performance of data fusion techniques at source selection tasks is comparable with that of state-of-the-art source selection algorithms and they are often able to surpass them.