| CS-2006-01 | ||||
|---|---|---|---|---|
| Title | SI-WF 2Q: WF2Q Approximation with Small Constant Execution Overhead - Extended Version | |||
| Authors | Martin Karsten | |||
| Abstract | This paper presents a novel priority queue data structure and access operations for timer maintenance in the context of traffic shaping and scheduling in packet networks. The data structure and operations are used to construct an efficient General Processor Sharing (GPS) approximation scheduler. In contrast to existing proposals, this scheduler has constant and near-optimal delay and fairness properties, /and/ can be implemented with bounded amortized O(1) algorithmic complexity, /and/ has very low absolute execution overhead. The paper presents the data structure, the scheduling algorithm, and studies its execution complexity. The scheduling properties are analyzed and shown to relate nicely to existing work. Some illustrative simulation results are presented to reaffirm those properties. | |||
| Date | January 2006 | |||
| Report | SI-WF 2Q: WF2Q Approximation with Small Constant Execution Overhead - Extended Version (PDF) | |||
| CS-2006-02 | ||||
|---|---|---|---|---|
| Title | An Investigation on Sieve and Detour Effects Affecting the Interaction of Infrared Radiation with Plant Leaves | |||
| Authors | Gladimir V. G. Baranoski and Denise Eng | |||
| Abstract | The retrieval of plant biophysical and biochemical properties from high spectral resolution data represents an active area of research within the remote sensing field. Scientific studies in this area are usually supported by computational simulations of light attenuation processes within foliar tissues. In heterogeneous organic materials, like plant leaves, sieve and detour effects can affect these processes, and ultimately change the light gradients within these tissues and their spectral signatures. Although these effects have been extensively examined for applications involving the interactions of visible radiation with plant leaves, little is known about their role in the infrared domain. In this paper, we describe the procedural basis for their incorporation in the modelling of infrared radiation transport (in the range of 750 to 2500nm) within plant leaves. We also assess their impact on the predictability of simulation solutions relating the directionality of the incident radiation and the internal arrangement of the tissues to changes on foliar spectral signatures in this domain. Our investigation is grounded by observations involving modeled results and quantitative and qualitative data reported in the literature | |||
| Date | November 2, 2006 | |||
| Report | An Investigation on Sieve and Detour Effects Affecting the Interaction of Infrared Radiation with Plant Leaves (PDF) | |||
| CS-2006-03 | ||||
|---|---|---|---|---|
| Title | Optimal lower bounds for rank and select indexes | |||
| Authors | Alexander Golynski | |||
| Abstract | We develop a new lower bound technique for data structures. We show an optimal $\Omega(n \lg\lg n / \lg n)$ space lower bounds for storing an index that allows to implement rank and select queries on a bit vector $B$ provided that $B$ is stored explicitly. These results improve upon [Miltersen, SODA'05]. We show $\Omega((m/t) \lg t)$ lower bounds for storing rank/select index in the case where $B$ has $m$ $1$-bits in it (e.g. low $0$-th entropy) and the algorithm is allowed to probe $t$ bits of $B$. We simplify select index given in [Raman et al., SODA'02] and show how to implement both rank and select queries with an index of size $(1 + o(1)) (n \lg\lg n / \lg n) + O(n / \lg n)$ (i.e. we give an explicit constant for storage) in the RAM model with word size $\lg n$. | |||
| Date | February 10, 2006 | |||
| Report | Optimal lower bounds for rank and select indexes (PDF) | |||
| CS-2006-04 | ||||
|---|---|---|---|---|
| Title | Comparisions on Different Approachs to Assign Missing Attribute Values | |||
| Authors | Jiye Li and Nick Cercone | |||
| Abstract | A commonly-used and naive solution to process data with missing attribute values is to ignore the instances which contain missing attribute values. This method may neglect important information within the data, significant amount of data could be easily discarded, and the discovered knowledge may not contain significant rules. Some methods, such as assigning the most common values or assigning an average value to the missing attribute, may make good use of all the available data. However the assigned value may not come from the information which the data originally derived, thus noise is brought to the data. We introduce a new approach RSFit on processing data with missing attribute values based on rough sets theory. By matching attribute-value pairs among the same core or reduct of the original data set, the assigned value preserves the characteristics of the original data set. We compare our approach with "closest fit approach globally" and "closest fit approach in the same concept". Experimental results on UCI data sets and a real geriatric care data set show our approach achieves comparable accuracy on assigning the missing values while significantly reduces the computation time. | |||
| Date | January 2006 | |||
| Report | Comparisions on Different Approachs to Assign Missing Attribute Values (PDF) | |||
| CS-2006-05 | ||||
|---|---|---|---|---|
| Title | A Scalable Peer-to-peer Protocol Enabling Efficient and Flexible Search | |||
| Authors | Reaz Ahmed and Raouf Boutaba | |||
| Abstract | Efficient
        discovery
        of
        information,
        based
        on
        partial
        knowledge,
        is
        a
        challenging
        problem
        faced
        by
        many
        large
        scale
        distributed
        systems.
        This
        paper
        presents
        a
        peer-to-peer
        search
        protocol
        that
        addresses
        this
        problem.
        The
        proposed
        system
        provides
        an
        efficient
        mechanism
        for
        advertising
        a
        binary
        pattern,
        and
        discovering
        it
        using
        any
        subset
        of
        its
        1-bits.
        A
        pattern
        (e.g.,
        Bloom
        filter)
        summarizes
        the
        properties
        (e.g.,
        keywords
        or
        service
        description)
        associated
        with
        a
        shared
        object
        (e.g.,
        document
        or
        service). The proposed system has a partially decentralized architecture involving superpeers and adopts a novel structured search mechanism derived from the theory of Error Correcting Codes (ECC). Better resilience to peer failure is achieved by utilizing replication and redundant routing paths. The number of routing hops and the number of links maintained by each superpeer scales logarithmically with the number of superpeers. The concept presented in this paper is supported with theoretical analysis, and simulation results. | |||
| Date | January 2006 | |||
| Report | A Scalable Peer-to-peer Protocol Enabling Efficient and Flexible Search (PDF) | |||
| CS-2006-06 | ||||
|---|---|---|---|---|
| Title | The R-Acyclic Semiunification Problem | |||
| Authors | Brad Lushman, Gordon V. Cormack | |||
| Abstract | We recast Kfoury and Wells' formulation of the acyclic semiunification problem (ASUP) in graph- theoretic terms and prove equivalence between the two formulations. We then relax and simplify the graph-theoretic formulation; we call the resulting problem the R-acyclic semiunification problem (R-ASUP), which we show to be a strict superset of ASUP. We prove that the ASUP solution procedure terminates and produces most general solutions for R-ASUP (and hence for ASUP) in the same sense as Robinson's unification algorithm. We thus extend the class of semiunification instances known to be decidable. | |||
| Date | March 13, 2006 | |||
| Report | The R-Acyclic Semiunification Problem (PDF) | |||
| CS-2006-07 | ||||
|---|---|---|---|---|
| Title | FIX: Feature-based Indexing Technique for XML Documents | |||
| Authors | Ning Zhang, M. Tamer Ozsu, Ihab F. Ilyas, and Ashraf Aboulnaga | |||
| Abstract | In this paper, we study the problem of indexing an XML database. Existing XML indexing techniques focus on clustering methods based on the combinatorial structural properties of an XML document. These techniques cluster tree nodes into an index tree or graph based on their similarities in ancestor-descendant or sibling relationships. Index look-up then amounts to pattern matching on the clustered tree or graph. In this paper, we propose a feature-based indexing technique, called FIX, based on the spectral graph theory. The basic idea is that for each twig pattern in a collection of XML documents, we calculate a vector of features based on its structural properties. These features are used as a key for the patterns and stored in a B-tree or a multidimensional index tree. Given an XPath query, its feature vector is first calculated and looked up in the index. Then a further refinement phase is performed to fetch the final results. We experimentally study the indexing technique over two scenarios: a large collection of relatively smaller documents, and a single large document. Our experiments show that FIX provides great pruning power and could gain an order of magnitude performance improvement for many XPath queries over existing evaluation techniques. | |||
| Date | March 14, 2006 | |||
| Report | FIX: Feature-based Indexing Technique for XML Documents (PDF) | |||
| CS-2006-08 | ||||
|---|---|---|---|---|
| Title | A
      <I>More</I>
      Direct
      Algorithm
      for
      Type
      Inference
      in
      the
      Rank-2
      Fragment of the Second-Order Lambda-Calculus | |||
| Authors | Brad Lushman, Gordon V. Cormack | |||
| Abstract | We present an algorithm for rank-2 type inference in the second-order $\lambda$-calculus. Our algorithm differs from the well-known algorithm of Kfoury and Wells in that it employs only a quadratically fewer type variables and inequalities. Our algorithm consists of a translation from a $\lambda$-term to an instance of $R$-ASUP (a decidable superset of ASUP) in which the variables correspond more directly to features in the original term. We claim that our construction, being simpler and more direct, is more amenable to proof and extension. | |||
| Date | March 29, 2006/Revised July 27, 2007 | |||
| Report | A <I>More</I> Direct Algorithm for Type Inference in the Rank-2 Fragment (PDF) | |||
| CS-2006-09 | ||||
|---|---|---|---|---|
| Title | Robust Numerical Valuation of European and American Options under the CGMY Process | |||
| Authors | Iris R. Wang, Justin W.L. Wan and Peter A. Forsyth. | |||
| Abstract | We develop an implicit discretization method for pricing European and American options when the underlying asset is driven by an infinite activity Levy process. For processes of finite variation, quadratic convergence is obtained as the mesh and time step are refined. For infinite variation processes, better than first order accuracy is achieved. The jump component in the neighbourhood of log jump size zero is specially treated by using a Taylor expansion approximation and the advection term is dealt with using a semi-Lagrangian scheme. The resulting Partial Integro-Differential Equation (PIDE) is then solved using preconditioned BiCGSTAB method coupled with a fast Fourier transform. Proofs of fully implicit timestepping stability and monotonicity are provided. The convergence properties of the BiCGSTAB scheme are discussed and compared with a fixed point iteration. Numerical tests on convergence and performance of European and American options under processes of finite and infinite variation are presented. | |||
| Date | April 7, 2006 | |||
| Report | Robust Numerical Valuation of European and American Options under the CGMY Process (PDF) | |||
| CS-2006-10 | ||||
|---|---|---|---|---|
| Title | Succinct Encodings for XPath Location Steps | |||
| Authors | JC)rC)my Barbay and S. Srinivasa Rao | |||
| Abstract | We consider in this paper the problem of encoding XML documents in small space while still supporting XPath Location steps efficiently. We model XML documents as multi-labeled trees, and propose for those an encoding which takes space close to the lower bound suggested by information theory, while still supporting the search for the ancestors, descendants and children matching a given label efficiently. To achieve this goal, we extend the previous results from Golynski et al. on strings over large alphabets, and from Barbay et al. on binary relations, to support the rank and select operators in several orders at once; and we extend the results from Geary et al. on ordinal trees to support the DFUDS order, in addition to the other operators. | |||
| Report | Succinct Encodings for XPath Location Steps (PDF) | |||
| CS-2006-11 | ||||
|---|---|---|---|---|
| Title | Adaptive Search Algorithm for Patterns, in Succinctly Encoded XML | |||
| Authors | JC)rC)my Barbay | |||
| Abstract | We propose an adaptive algorithm for context queries (queries expressed as preorder and ancestor-descendant relations on labeled nodes), which can be used to find patterns in XML documents. Our algorithm takes advantage of the correlation between terms of the query without any preprocessed information, and it runs in time (kd(lg lg min(n,s)+lg lg(r))) in the RAM model, where k is the number of terms in the query, d is the non-deterministic complexity of the query on the multi-labeled tree (i.e. the minimum number of operations required to check the answer to the query), n is the number of nodes in the tree, s is the number of relations between nodes and labels, and r is the maximal number of nodes matching a label on any rooted path in the tree. | |||
| Date | April 21, 2006 | |||
| Report | Adaptive Search Algorithm for Patterns, in Succinctly Encoded XML (PDF) | |||
| CS-2006-12 | ||||
|---|---|---|---|---|
| Title | Age and Geographic Inferences of the LiveJournal Social Network | |||
| Authors | Ian McKinnon, Rob Warren | |||
| Abstract | Online social networks are often a by-product of blogging and other online media sites on the Internet. Services such as LiveJournal allow their users to specify who their "friends" are, and thus a social network is formed. This paper will explore the relationship between users with the intent of being able to make a prediction of a users age and country of residence based on the information given by their friends on this social network. | |||
| Date | June, 2006 | |||
| Report | Age and Geographic Inferences of the LiveJournal Social Network (PDF) | |||
| CS-2006-13 | ||||
|---|---|---|---|---|
| Title | Predicting Missing Attribute Values based on Frequent Itemset and RSFit | |||
| Authors | Jiye Li and Nick Cercone | |||
| Abstract | How to process missing attribute values is an important data preprocessing problem in data mining and knowledge discovery tasks. A commonly-used and naive solution to process data with missing attribute values is to ignore the instances which contain missing attribute values. This method may neglect important information within the data and a significant amount of data could be easily discarded. Some methods, such as assigning the most common values or assigning an average value to the missing attribute, make good use of all the available data. However the assigned value may not come from the information which the data originally derived from, thus noise is brought to the data. We introduce an integrated approach ItemRSFit to effectively predict missing attribute values by combining frequent itemset and RSFit approaches together. Frequent itemset is generated from the association rules algorithm and it displays the correlations between different items in a transaction data set. Using frequent itemset as a knowledge base to predict missing attribute values is shown to have a high prediction accuracy. However this approach alone cannot predict all the existing missing attributes. RSFit is a newly developed approach to predict missing attribute values based on the similarities of attribute-value pairs by only considering attributes contained in the core or the reduct of the data set. The RSFit approach provides a faster prediction and can be used for the cases that are not covered by the itemset approach. Empirical studies on UCI data sets and a real world data set demonstrate a significant increase of predicting accuracy obtained from this new integrated approach. | |||
| Date | April 25, 2006 | |||
| Report | Predicting Missing Attribute Values based on Frequent Itemset and RSFit (PDF) | |||
| CS-2006-14 | ||||
|---|---|---|---|---|
| Title | Seeking Empirical Evidence for Self-Organized Criticality in Open Source Software Evolution | |||
| Authors | Jingwei Wu, Richard C. Holt | |||
| Abstract | We examine 11 open source software systems and present empirical evidence for the existence of fractal structures in software evolution. In our study, fractal structures are measured as power laws through the lifetime of a software system. We describe two specific power law related phenomena: the probability distribution of software changes decreases as a power function of change sizes; and the time series of software change exhibits long range correlations with power law behavior. The existence of such spatial (across the system) and temporal (over the system lifetime) power laws suggests that Self-Organized Criticality (SOC) occurs in the evolution of open source systems. As a consequence, SOC may be useful as a conceptual framework for understanding software evolution dynamics (the cause and mechanism of change or growth). We give a qualitative explanation of software evolution based on SOC. We also discuss some implications of SOC to current software practices. | |||
| Date | April 26, 2006 | |||
| Report | Seeking Empirical Evidence for Self-Organized Criticality in Open Source Software Evolution (PDF) | |||
| CS-2006-15 | ||||
|---|---|---|---|---|
| Title | Collaborative and Coordinated Product Configuration | |||
| Authors | Marcilio Mendonca, Toacy Oliveira, Donald Cowan | |||
| Abstract | Product configuration is a key activity of product engineering that regards the constrained combination and parameterization of product line assets as a means to achieve correct software specification. Current product configuration approaches frequently rely on the application engineer to translate user requirements into correct configuration choices. This process is error-prone and risky as requirements may lead to conflicting decisions at configuration time. Indeed, we deem that an important aspect of product configuration has long been neglected: its collaborative nature. In our research, we advocate that product configuration is enhanced by a collaborative perspective, providing that conflicting scenarios are properly handled. We propose an approach to support collaborative and co-ordinated product configuration by promoting processes to first-order elements for the explicit guidance of configuration decisions. We provide insights on important coordination issues and introduce an algorithm to derive process models from annotated feature models to illustrate the approach's feasibility. Keywords: Product Configuration, Software Processes, Software Product Lines, Collaborative Software Configuration. | |||
| Date | May 17, 2006 | |||
| Report | Collaborative and Coordinated Product Configuration (PDF) | |||
| CS-2006-16 | ||||
|---|---|---|---|---|
| Title | Geometrical Mesh Improvement Properties of Delaunay Terminal Edge Refinement | |||
| Authors | Bruce
      Simpson,
      David
      Cheriton
      School
      of
      Computer
      Science,
      University
      of
      Waterloo,
      Waterloo,
      Ontario,
      Canada,
      N2L
      3G1 and Maria-Cecilia Rivars, Department of Computer Science, Universidad de Chile, Blanco Encalada 2120, Santiago, Chile | |||
| Abstract | "The use of edge based refinement in general, and Delaunay terminal edge refinement in particular are well established for planar meshing, but largely on a heuristic basis. In this paper, we present a series of theoretical results on the geometric mesh improvement properties of these methods. The discussion is based on refining a mesh to meet a specified angle tolerance." | |||
| Date | July 25, 2006 | |||
| Report | Geometrical Mesh Improvement Properties of Delaunay Terminal Edge Refinement (PDF) | |||
| CS-2006-17 | ||||
|---|---|---|---|---|
| Title | A Program Extractor Suite for C and C++: Choosing the Right Tool for the Job | |||
| Authors | Jingwei Wu, Richard C. Holt | |||
| Abstract | This report describes a suite of program extractors, which we developed by adopting and extending existing program parsing and extraction techniques or tools. This suite is called CX because it is mainly targeted at extracting facts from C and C++ programs. This suite is currently composed of four extractors: CPPX, BFX, LDX and CTSX. The main goal of creating CX is to provide a convenient set of program extractors, which complement each other and work in a systematic manner. The benefits of this extractor suite will be discussed in terms of two practical applications: (1) creating program comprehension pipelines to support various understanding tasks, and (2) building an open source software evolution database (EvolDB) to support empirical research on software evolution. | |||
| Date | June 3, 2006 | |||
| Report | A Program Extractor Suite for C and C++: Choosing the Right Tool for the Job (PDF) | |||
| CS-2006-18 | ||||
|---|---|---|---|---|
| Title | A Survey of Data Management in Peer-to-Peer Systems | |||
| Authors | Rolando Blanco, Nabeel Ahmed, David Hadaller, L. G. Alex Sung, Herman Li, and Mohamed Ali Soliman | |||
| Abstract | Peer-to-Peer (P2P) systems provide a decentralized infrastructure for resource sharing. In particular, file sharing was the initial motivation behind many of the first successful P2P systems. As P2P systems evolve to support sharing of structured and semantically rich data, several data management issues must be addressed. Specifically, P2P systems need to deal with data location, data integration, data querying, and data consistency issues. Data management is further complicated in P2P systems due to the volatility and scale of these systems. In this survey we propose a reference architecture for P2P systems that focuses on the data management activities required to support sharing of structured data. Based on this reference architecture, we present and classify technologies that have been proposed to solve the data management issues particular to P2P systems. | |||
| Date | June 20 , 2006 | |||
| Report | A Survey of Data Management in Peer-to-Peer Systems (PDF) | |||
| CS-2006-19 | ||||
|---|---|---|---|---|
| Title | Incremental Maintenance of Global Aggregates | |||
| Authors | Nabeel Ahmed, David Hadaller, and Srinivasan Keshav | |||
| Abstract | Providing local access to global information can improve the efficiency of many distributed applications. Examples include applications that aggregate sensor values, search in Peer-to-Peer systems, or perform Top-K queries in stream-oriented databases. Efficient computation of such aggregates is difficult due to the massive scale and dynamics of such systems and has led to the proposal of several approximate techniques based on randomized gossip algorithms. However, maintenance of such aggregates has not been adequately addressed in the literature. Changes in node state, therefore, require a full and expensive re-computation of the global aggregate. This paper makes three contributions to this field. First, we propose a variant of the well-known FM aggregation scheme that allows us to support incremental maintenance of aggregates. Second, we propose the concept of significance thresholds and illustrate their benefits. Finally, we present a detailed performance evaluation of our techniques and find that we can reduce computation time by 60% compared to recomputing the aggregate, as is traditionally done. | |||
| Date | June 21, 2006 | |||
| Report | Incremental Maintenance of Global Aggregates (PDF) | |||
| CS-2006-20 | ||||
|---|---|---|---|---|
| Title | "On Pattern Expression Languages" | |||
| Authors | Cezar Campeanu and Nicolae Santean | |||
| Abstract | In
        this
        paper
        we
        show
        that
        the
        family
        of
        pattern
        expression
        languages
        is
        closed
        under
        the
        intersection
        with
        regular
        languages.
        Since
        this
        family
        is
        not
        closed
        under
        complement
        but
        is
        closed
        under
        reverse,
        a
        natural
        question
        arises,
        that
        is,
        whether
        particular
        languages
        such
        as
        those
        containing
        words
        of
        type
        $ww^R$
        are
        pattern
        expression
        languages
        or
        not.
        We
        give
        a
        proof
        for
        a
        negative
        answer
        to
        this
        question,
        and
        we
        provide
        several
        examples
        of
        languages
        which
        can
        not
        be
        specified
        by
        pattern
        expressions. Keywords: pattern, pattern automata, pattern expressions, regex, regular expressions | |||
| Date | December 11, 2006 | |||
| Report | "On Pattern Expression Languages" (PDF) | |||
| CS-2006-21 | ||||
|---|---|---|---|---|
| Title | A Robust Overlay for Distributed Range Query Processing | |||
| Authors | AndrC)
      Allavena
      Qiang
      Wang
      Ihab
      Ilyas
      Srinivasan
      Keshav University of Waterloo {aallavena, q6wang, ilyas, keshav}@uwaterloo.ca | |||
| Abstract | Large-scale data-centric services are often handled by clusters of computers that include hundreds of thousands of computing nodes. However, traditional distributed query processing techniques fail to handle the large-scale distribution, peer-to-peer communication and frequent disconnection. In this paper, we introduce LOT, a robust, fault-tolerant and highly distributed overlay network for large-scale peer-to-peer query processing. LOT is based on a robust tree overlay for distributed systems. It uses virtualization, replication, geographic-based clustering and flexible state definition as basic design principles. We show how we map these principles to desirable performance goals. Moreover, we provide a light-weight maintenance mechanism for updating state information. Analysis and simulations show that our approach is superior to other well-known alternatives in its query processing performance and handling of churn. | |||
| Date | July 14, 2006 | |||
| Report | A Robust Overlay for Distributed Range Query Processing (PDF) | |||
| CS-2006-22 | ||||
|---|---|---|---|---|
| Title | Fast, Efficient and Robust In-Network Computation | |||
| Authors | AndrC)
      Allavena
      and
      Srinivasan
      Keshav University of Waterloo {aallaven,keshav}@uwaterloo.ca} | |||
| Abstract | Today,
        companies
        such
        as
        eBay,
        Amazon,
        Google,
        and
        IBM
        routinely
        operate
        clusters
        with
        more
        than
        10,000
        servers
        located
        in
        data
        centres
        around
        the
        world.
        Developing
        applications
        that
        efficiently
        use
        the
        resources
        of
        such
        a
        distributed
        cluster
        is
        challenging.
        This
        task
        is
        made
        eased
        by
        a
        middleware
        layer
        that
        provides
        application
        programmers
        with
        the
        illusion
        of
        dynamically
        updated
        "global
        state".
        Maintaining
        global
        state
        can
        be
        viewed
        as
        repeatedly
        computing
        an
        arbitrary
        function
        over
        a
        set
        of
        time-varying
        values,
        where
        the
        values
        are
        held
        at
        each
        node,
        and
        every
        node
        needs
        to
        know
        the
        resultant
        function.
        Many
        well-known
        problems
        in
        distributed
        systems,
        such
        as
        load
        balancing,
        leader
        election,
        barrier
        synchronisation,
        distributed
        file
        system
        maintenance,
        and
        co-ordinated
        intrusion
        detection
        can
        be
        cast
        in
        this
        form. We present and rigorously analyse an algorithm that uses a a tree of virtual nodes to compute nearly arbitrary functions of global state. Our scheme is fast: running time is $\Theta(ln n)$. It is efficient: nodes exchange O(n ln n) messages. Most of these messages are within a data centre and therefore are relatively cheap. It is accurate: the computed value does not have inherent errors either due to double-counting, as with standard gossip, or due to stochastic counting, as in the Flajolet-Martin approach. Finally, it is fault tolerant: The algorithm fails with probability O( 1 / n^{c(1+r)} )$ where c is a constant and 0 <= r \leq (\ln n)^Cst a user-chosen reliability parameter. We therefore believe that our work is the basis for building robust and efficient distributed systems in a variety of problem domains. | |||
| Date | July 14, 2006 | |||
| Report | Fast, Efficient and Robust In-Network Computation (PDF) | |||
| CS-2006-23 | ||||
|---|---|---|---|---|
| Title | Weighted Queries on Binary Relations and Multi-Labeled Trees | |||
| Authors | Jeremy Barbay, Aleh Veraskouski | |||
| Abstract | The common way to search large indexed databases is through conjunctive queries on binary relations, such as those associating keywords to web page references. We extend this model by adding positive weights to the terms of the query. In this context we give and analyze two algorithms to answer such queries in a pertinent way. We extend this approach to solve weighted queries on tree-structured objects such as file-systems or XML documents. | |||
| Date | July 27, 2006 | |||
| Report | Weighted Queries on Binary Relations and Multi-Labeled Trees (PDF) | |||
| CS-2006-24 | ||||
|---|---|---|---|---|
| Title | Succinct Indexes for Strings, Binary Relations and Multi-labeled Trees | |||
| Authors | Jeremy Barbay, Meng He, J. Ian Munro, and S. Srinivasa Rao | |||
| Abstract | We
        define
        and
        design
        succinct
        indexes
        for
        several
        abstract
        data
        types
        (ADTs).
        The
        concept
        is
        to
        design
        auxiliary
        data
        structures
        called
        succinct
        indexes
        that
        occupy
        asymptotically
        less
        space
        than
        the
        information-theoretic
        lower
        bound
        on
        the
        space
        required
        to
        encode
        the
        given
        data,
        and
        support
        an
        extended
        set
        of
        operations
        using
        the
        basic
        operators
        defined
        in
        the
        ADT. As opposed to succinct encodings, The main advantage of succinct indexes is that we make assumptions only on the ADT through which the main data is accessed, rather than the way in which the data is encoded. This allows more freedom in the encoding of the main data.In this paper, we present succinct indexes for various data types, namely strings, binary relations and multi-labeled trees. Given the support for the interface of the ADTs of these data types, we can support various useful operations efficiently by constructing succinct indexes for them. When the operators in the ADTs are supported in constant time, our results are comparable to previous results, while allowing more flexibility in the encoding of the given data. Using our techniques, we design the first succinct encoding that represents a string of length $n$ over alphabet $[\sigma]$ using $nH_k+o(n\lg \sigma)$ bits \footnote{We use $\lg n$ to denote $\lceil \log_2 n \rceil$.} that support access / rank / select operations in $o((\lg\lg \sigma)3)$ time. We also design the first succinct text index using $nH_k+o(n\lg\sigma)$ bits that supports pattern matching queries in $O(m\lg\lg\sigma + occ\lg^{1+\epsilon}n\lg\lg\sigma)$ time, for a given pattern of length $m$. Previous results on these two problems either have a $\lg\sigma$ factor instead of $\lg\lg\sigma$ in terms of running time, or are not compressible, but our results do not have such problems. | |||
| Date | July 27, 2006 | |||
| Report | Succinct Indexes for Strings, Binary Relations and Multi-labeled Trees (PDF) | |||
| CS-2006-25 | ||||
|---|---|---|---|---|
| Title | Top-k Query Processing in Uncertain Databases | |||
| Authors | Mohamed A. Soliman, Ihab F. Ilyas, Kevin Chen-Chuan Chang | |||
| Abstract | Top-k processing in uncertain databases is semantically and computationally different from traditional top-k processing. The interplay between score and uncertainty information makes traditional processing techniques inapplicable to uncertain databases. In this paper we introduce new probabilistic formulations for top-k queries. Our formulations are based on marriage of traditional top-k semantics with possible worlds semantics. In the light of these formulations, we construct a framework that encapsulates a novel probabilistic model and efficient query processing techniques to tackle the challenges raised by uncertain data settings. We prove that our techniques are optimal in terms of the number of accessed tuples. Our experiments show the efficiency of our techniques under different data distributions with orders of magnitude improvement over naive materialization of possible worlds. | |||
| Date | August 2, 2006 /Updated August 14, 2007 | |||
| Report | Top-k Query Processing in Uncertain Databases (PDF) | |||
| CS-2006-26 | ||||
|---|---|---|---|---|
| Title | Multi-Query Optimization of Sliding Window Aggregates by Schedule Synchronization | |||
| Authors | Lukasz Golab, Kumar Gaurav Bijay, and M. Tamer Ozsu | |||
| Abstract | Data
        stream
        systems
        process
        persistent
        queries,
        typically
        posed
        over
        sliding
        windows
        and
        re-evaluated
        periodically
        as
        the
        windows
        slide
        forward.
        Due
        to
        their
        long-running
        nature,
        a
        number
        of
        similar
        persistent
        queries
        may
        run
        in
        parallel
        at
        any
        given
        time,
        therefore
        multi-query
        optimization
        is
        particularly
        important.
        In
        traditional
        multi-query
        optimization,
        one
        of
        the
        primary
        goals
        is
        to
        detect
        common
        parts
        across
        multiple
        queries
        issued
        at
        the
        same
        time
        and
        perform
        the
        common
        task
        only
        once.
        Another
        related
        goal
        is
        to
        check
        if
        a
        new
        query
        may
        be
        answered
        using
        the
        answer
        of
        a
        related
        query
        which
        has
        been
        computed
        previously
        (and
        is
        stored
        as
        a
        materialized
        view).
        In
        this
        paper,
        we
        argue
        that
        in
        the
        context
        of
        periodically
        re-evaluated
        queries,
        multi-query
        optimization
        requires
        an
        additional
        step
        beyond
        common
        sub-expression
        matching.
        This
        is
        because
        queries
        that
        have
        been
        identified
        as
        similar
        may
        be
        re-evaluated
        with
        different
        frequencies
        and
        therefore
        may
        be
        scheduled
        at
        different
        times.
        Thus,
        the
        additional
        step
        must
        attempt
        to
        synchronize
        the
        re-execution
        times
        of
        similar
        queries
        in
        order
        to
        take
        advantage
        of
        computation
        sharing. The solution presented in this paper focuses on periodically-evaluated aggregates over sliding windows of various lengths, which are a common class of persistent queries used for monitoring purposes. The proposed solution assumes that users specify an upper bound on the interval between re-evaluations of their queries and is based upon the following insight: it may be cheaper to re-execute some queries more often if their re-execution schedules can be synchronized with those of similar queries, thereby amortizing the computation costs. We also show that additional schedule synchronization is possible when the system is forced to lengthen the desired re-execution intervals during periods of overload. Our solutions are based upon a variant of the earliest-deadline-first algorithm that views persistent queries are periodic tasks. Experimental results show significant improvements in system throughput due to increased resource sharing. | |||
| Date | August, 2006 | |||
| Report | Multi-Query Optimization of Sliding Window Aggregates by Schedule Synchronization (PDF) | |||
| CS-2006-27 | ||||
|---|---|---|---|---|
| Title | Sliding Window Query Processing over Data Streams | |||
| Authors | Lukasz Golab | |||
| Abstract | Database
        management
        systems
        (DBMSs)
        have
        been
        used
        successfully
        in
        traditional
        business
        applications
        that
        require
        persistent
        data
        storage
        and
        an
        efficient
        querying
        mechanism.
        Typically,
        it
        is
        assumed
        that
        the
        data
        are
        static,
        unless
        explicitly
        modified
        or
        deleted
        by
        a
        user
        or
        application.
        Database
        queries
        are
        executed
        when
        issued
        and
        their
        answers
        reflect
        the
        current
        state
        of
        the
        data.
        However,
        emerging
        applications,
        such
        as
        sensor
        networks,
        real-time
        Internet
        traffic
        analysis,
        and
        on-line
        financial
        trading,
        require
        support
        for
        processing
        of
        unbounded
        data
        streams.
        The
        fundamental
        assumption
        of
        a
        data
        stream
        management
        system
        (DSMS)
        is
        that
        new
        data
        are
        generated
        continually,
        making
        it
        infeasible
        to
        store
        a
        stream
        in
        its
        entirety.
        At
        best,
        a
        sliding
        window
        of
        recently
        arrived
        data
        may
        be
        maintained,
        meaning
        that
        old
        data
        must
        be
        removed
        as
        time
        goes
        on.
        Furthermore,
        as
        the
        contents
        of
        the
        sliding
        windows
        evolve
        over
        time,
        it
        makes
        sense
        for
        users
        to
        ask
        a
        query
        once
        and
        receive
        updated
        answers
        over
        time. This dissertation begins with the observation that the two fundamental requirements of a DSMS are dealing with transient (time-evolving) rather than static data and answering persistent rather than transient queries. One implication of the first requirement is that data maintenance costs have a significant effect on the performance of a DSMS. Additionally, traditional query processing algorithms must be re-engineered for the sliding window model because queries may need to re-process expired data and "undo" previously generated results. The second requirement suggests that a DSMS may execute a large number of persistent queries at the same time, therefore there exist opportunities for resource sharing among similar queries. The purpose of this dissertation is to develop solutions for efficient query processing over sliding windows by focusing on these two fundamental properties. In terms of the transient nature of streaming data, this dissertation is based upon the following insight. Although the data keep changing over time as the windows slide forward, the changes are not random; on the contrary, the inputs and outputs of a DSMS exhibit patterns in the way the data are inserted and deleted. It will be shown that the knowledge of these patterns leads to an understanding of the semantics of persistent queries, lower window maintenance costs, as well as novel query processing, query optimization, and concurrency control strategies. In the context of the persistent nature of DSMS queries, the insight behind the proposed solution is that various queries may need to be refreshed at different times, therefore synchronizing the refresh schedules of similar queries creates more opportunities for resource sharing. | |||
| Date | August, 2006 | |||
| Report | Sliding Window Query Processing over Data Streams (PDF) | |||
| CS-2006-28 | ||||
|---|---|---|---|---|
| Title | "Deterministic Simulation of a NFA with k-symbol Lookahead" | |||
| Authors | Bala Ravikumar and Nicolae Santean | |||
| Abstract | We investigate deterministically simulating (i.e., solving the membership problem for) nondeterministic finite automata (NFA), relying solely on the NFA's resources (states and transitions). Unlike the standard NFA simulation, involving an algorithm which stores at each step all the states reached nondeterministically while reading the input, we consider deterministic finite automata (DFA) with lookahead, which choose the "right" NFA transitions based on a fixed number of input symbols read ahead. This concept, known as lookahead delegation, arose in a formal study of web services composition and its subsequent practical applications. Here we answer several related questions, such as "when is lookahead delegation possible?" and "how hard is it to find a delegator with a given lookahead buffer size?". In particular, we show that only finite languages have the property that all of their NFA's have delegators. This implies, among others, that delegation is a machine property, rather than a language property. We also prove that the existence of lookahead delegators for unambiguous NFA is decidable, thus partially solving an open problem. Finally, we show that finding delegators (even for a given buffer size) is hard in general, and is efficient for unambiguous NFA, and we give an algorithm and a compact characterization for NFA delegation in general. | |||
| Date | August, 2006 | |||
| Report | "Deterministic Simulation of a NFA with k-symbol Lookahead" (PDF) | |||
| CS-2006-29 | ||||
|---|---|---|---|---|
| Title | Query Processing and Optimization in Native XML Databases | |||
| Author | Ning Zhang | |||
| Abstract | XML has emerged as a semantic markup language for documents as well as the de facto language for data exchange over the World Wide Web. Declarative query languages, such as XPath and XQuery, are proposed for querying over large volumes of XML data. A number of techniques have been proposed to evaluate XML queries more efficiently. Many of these techniques assume a tree model of XML documents and are, therefore, also applicable to other data sources that can be explicitly or implicitly translated into a similar data model. The focus of this thesis is on efficient evaluation and optimization of path expressions in native XML databases. Specifically, the following issues are considered: storage system design, design of physical operators and efficient execution algorithms, and the cost-based query optimizer. The proposed storage system linearizes the tree structure into strings that can be decomposed into disk pages. Simple statistics are kept in the page headers to facilitate I/O-efficient navigation. Based on this storage system, a hybrid approach is developed to evaluate path expressions that exploit the advantages of navigational and join-based approaches. Path expressions that only contain "local" axes (child, parent, attribute, self, following-sibling, and preceding-sibling) are evaluated by means of the proposed "Next-of-Kin" (NoK) operator. A general path expression that contains both local axes and global ones (ancestor, descendant, ancestor-or-self, descendant-or-self, following, and preceding) is decomposed into NoK subtrees whose intermediate results are structurally joined to produce the final result. Experiments show that the navigational operator can be an order of magnitude faster than join-based approaches in some cases, but slower in others. Thus a cost-based query optimizer is necessary to choose the optimal execution plan based on estimates of the cost of each operator. The cost of an operator heavily depends on the cost model and its input. The inputs to the cost model are usually the cardinalities of path expressions. In this thesis, a synopsis structure called XSEED is proposed to estimate the cardinality of a path expression. An XSEED synopsis can be constructed by compressing an XML document to a small kernel first, and then more information can be added to the synopsis. XSEED results in more accurate cardinality estimation than previous approaches and is easier to construct, easier to update, and can incorporate query feedback. Efficient query execution exploits indexes. The last component of the thesis is a feature-based structural index, called FIX, to expedite tree matching on a large tree. This is based on the observation that the navigational operation is expensive and applying it to every tree node is very inefficient. FIX extracts distinctive features from subtrees and uses them as the index keys. Similarly, for each incoming query, the features of the query tree are extracted and used as search keys to retrieve the candidate results. Thus, it is sufficient to match only on these candidate results. The experimental results show that the pruning power of FIX is very high -- more than 99% for structure-rich data sets, and more than 20% for data sets with less structural variety. | |||
| Date | August, 2006 | |||
| Report | Query Processing and Optimization in Native XML Databases (PDF) | |||
| CS-2006-30 | ||||
|---|---|---|---|---|
| Title | "On the definition of stochastic lambda-transducers" | |||
| Authors | Stavros Konstantinidis and Nicolae Santean | |||
| Abstract | We
        propose
        a
        formal
        definition
        for
        the
        general
        notion
        of
        stochastic
        transducer,
        called
        stochastic
        lambda-transducer.
        Our
        definition
        is
        designed
        with
        two
        objectives
        in
        mind:
        (i)
        to
        extend
        naturally
        the
        established
        notion
        of
        stochastic
        automaton
        with
        output
        -
        as
        defined
        in
        the
        classic
        books
        of
        Paz
        (1971)
        and
        Starke
        (1972)
        -
        by
        permitting
        pairs
        of
        input-output
        words
        of
        different
        lengths;
        (ii)
        to
        be
        compatible
        with
        the
        more
        general
        notion
        of
        weighted
        transducer
        so
        that
        one
        can
        apply
        tools
        of
        weighted
        transducers
        to
        address
        certain
        computational
        problems
        involving
        stochastic
        transducers.
        The
        new
        transducers
        can
        be
        used
        to
        model
        stochastic
        input-output
        processes
        that
        cannot
        be
        modeled
        using
        classic
        stochastic
        automata
        with
        output. Keywords: Probabilistic transducer, Probabilistic automaton, Stochastic transducer, Stochastic automaton, Stochastic transduction, Weighted transducer, Transducer, Automaton | |||
| Date | September 5, 2006 | |||
| Report | "On the definition of stochastic lambda-transducers" (PDF) | |||
| CS-2006-31 | ||||
|---|---|---|---|---|
| Title | Generalized Labeled LCA Queries | |||
| Authors | Jeremy Barbay, Ehsan Chiniforooshan, Alexander Golynski, Jui-Yi Kao, Aleh Veraskouski | |||
| Abstract | Schema-free queries permit to search XML documents without knowing their schema. Among them, lowest common ancestor (LCA) queries were introduced in several variants on labeled trees. We define threshold LCA queries to generalize all those variants, and to extend them to the case where weights are assigned to each term of the query. We study how to solve those in the context where access to the document is streamed, and in the context where the document is accessed through a precomputed index. We propose space-efficient algorithms for both contexts, using space O(h+s) independant from the size of the document, where h is the height of the input document and s is the number of different labels. We describe two distinct algorithms in the streamed model, both of which read the input stream exactly once and run in linear time, and one algorithm in the indexed model, which provably runs in sublinear time for many instances, characterized by a difficulty measure. | |||
| Date | Revised July 2007 | |||
| Report | Generalized Labeled LCA Queries (PS) | Generalized Labeled LCA Queries (PDF) | ||
| CS-2006-32 | ||||
|---|---|---|---|---|
| Title | Graph Isomorphism Completeness for Perfect Graphs and Subclasses of Perfect Graphs | |||
| Authors | C. Boucher and D. Loker | |||
| Abstract | A problem is said to be GI-complete if it is provably as hard as graph isomorphism; that is, there is a polynomial-time Turing reduction from the graph isomorphism problem. It is known that the GI problem is GI-complete for some special graph classes including regular graphs, bipartite graphs, chordal graphs and split graphs. In this paper, we prove that deciding isomorphism of double split graphs, the class of graphs exhibiting a 2-join, and the class of graphs exhibiting a balanced skew partition are GI-complete. Further, we show that the GI problem for the larger class including these graph classes--that is, the class of perfect graphs--is also GI-complete. | |||
| Date | September 6, 2006 | |||
| Report | Graph Isomorphism Completeness for Perfect Graphs and Subclasses of Perfect Graphs (PDF) | |||
| CS-2006-33 | ||||
|---|---|---|---|---|
| Title | Expected Approximation guarantees for the Demand Matching Problem | |||
| Author | C. Boucher, D. Loker | |||
| Abstract | The objective of the demand matching problem is to obtain the subset $M$ of edges which is feasible and where the sum of the profits of each of the edges is maximized. The set $M$ is feasible if for each vertex $v$ the total demand of edges in $M$ incident to $v$ is at most $b_v$. In the case where each of the edges has one unit profit, the problem becomes finding the subset of largest size and hence, is called the cardinality problem. Shepherd and Vetta [SV06] demonstrate that the integrality gap for the general demand matching problem for bipartite graphs is between $2.5$ and $2.764$, and between $3$ and $3.264$ non-bipartite graphs. We demonstrate that an expected $2.5$-approximation guarantee and $3$-approximation guarantee is achieveable for bipartite graphs and non-bipartite graphs and give some connections to the independent set and weighted independent set problem. | |||
| Date | September 18, 2006 | |||
| Report | Expected Approximation guarantees for the Demand Matching Problem (PDF) | |||
| CS-2006-34 | ||||
|---|---|---|---|---|
| Title | Metro: An Analysis Toolkit for Template Semantics | |||
| Authors | Joanne
      M.
      Atlee,
      Nancy
      A.
      Day, Jianwei Niu, Eunsuk Kang, Yun Lu, David Fung, Leonard Wong | |||
| Abstract | We describe the Metro toolkit, which supports software modelling and analysis for requirements notations that have configurable semantics. Metro is based on a formalism, called template semantics, which structures the operational semantics of a family of notations as a predefined parameterized template that is instantiated with user-provided parameter values. Thus, the semantics of a single notation can be expressed succinctly as a set of parameter values to this template. The Metro toolkit takes as input a specification and a set of template-parameter values, and it produces an analyzable model. The toolkit can either translate the specification into the input language of an existing model checker (e.g., SMV), or compile the specification into a more primitive form (e.g., logic, BDDs) that is suitable for analysis. MagicDraw is used as a front-end for editing specifications and animating SMV-generated counterexamples. | |||
| Date | September 22 , 2006 | |||
| Report | Metro: An Analysis Toolkit for Template Semantics (PDF) | |||
| CS-2006-35 | ||||
|---|---|---|---|---|
| Title | Formal Verification of the A-7E Software Requirements using Template Semantics | |||
| Authors | Eunsuk Kang and Nancy A. Day | |||
| Abstract | Template semantics is a template-based approach to ease the process of identifying the essential differences among model-based notations. In this approach, a template captures semantics that are common among notations and allows users to specify only the distinctive features of a notation. In this paper, we illustrate the method of describing requirements in Software Cost Reduction (SCR) using the Metro toolkit, which is the framework for the modelling and analysis of notations in template semantics. Furthermore, we demonstrate the usage of Metro to verify the A-7E software requirements and compare our verification effort to an alternative method of requirements analysis, which does not use template semantics. | |||
| Date | September 22 , 2006 | |||
| Report | Formal Verification of the A-7E Software Requirements using Template Semantics (PDF) | |||
| CS-2006-36 | ||||
|---|---|---|---|---|
| Title | Improved Bounds for the Online Steiner Tree Problem in Graphs of Bounded Edge-Asymmetry | |||
| Author | Spyros Angelopoulos | |||
| Abstract | In this paper we consider the Online Steiner Tree problem in weighted directed graphs of bounded edge-asymmetry $\alpha$. The edge-asymmetry of a directed graph is defined as the maximum ratio of the cost (weight) of antiparallel edges in the graph. The problem has applications in multicast routing over a network with non-symmetric links. We improve the previously known upper and lower bounds on the competitive ratio of any deterministic algorithm due to Faloutsos et al. In particular, we show that a better analysis of a simple greedy algorithm yields a competitive ratio of $O(\min \{k, \frac{\alpha \log k} {\log \log \alpha}\})$, where $k$ denotes the number of terminals requested. On the negative side, we show a lower bound of $\Omega(\min \{k^{1-\epsilon}, \frac{\alpha \log k}{\log \log k} \})$ on the competitive ratio of every deterministic algorithm for the problem, for any arbitrarily small constant $\epsilon$. | |||
| Date | October 2, 2006 | |||
| Report | Improved Bounds for the Online Steiner Tree Problem in Graphs of Bounded Edge-Asymmetry (PDF) | |||
| CS-2006-37 | ||||
|---|---|---|---|---|
| Title | Optimal Superblock Instruction Scheduling for Multiple-Issue Processors using Constraint Programming | |||
| Authors | Abid M. Malik, Tyrel Russell, Michael Chase, and Peter van Beek | |||
| Abstract | Modern processors have multiple pipelined functional units and can issue more than one instruction per clock cycle. This puts great pressure on the instruction scheduling phase in a compiler to expose maximum instruction level parallelism. Instruction level parallelism (ILP) at the local level using basic blocks is limited. Compilers increase ILP by doing instruction scheduling at the global level using larger regions, which are created by combining basic blocks. Superblocks are one of the most commonly used scheduling regions for global instruction scheduling. Superblock scheduling is NP-complete, and is done sub-optimally in production compilers using heuristic approaches. In this work, we present a constraint programming approach to the superblock instruction scheduling problem that is both optimal and fast enough to be incorporated into production compilers. We experimentally evaluated our optimal scheduler on the SPEC2000 integer and floating point benchmarks. On this benchmark suite, the optimal scheduler was very robust and scaled to the largest superblocks. Depending on the architectural model, between 99.991% to 99.999% of all superblocks were solved to optimality. The scheduler was able to routinely solve the largest superblocks, including superblocks with up to 2600 instructions. This compares favourably to the recent best work by Shobaki and Wilken on optimal superblock scheduling using dynamic programming and enumeration. | |||
| Date | October 17, 2006 | |||
| Report | Optimal Superblock Instruction Scheduling for Multiple-Issue Processors using Constraint Programming (PDF) | |||
| CS-2006-38 | ||||
|---|---|---|---|---|
| Title | Study of the Meiotic Recombination Hotspot Diffusion Paradox | |||
| Authors | Jérémy Barbay | |||
| Abstract | Zoologists and evolutionists ponder about an apparent paradox in the current model of how sexual reproduction happens, basing their conclusions on extensive simulations. We show through a mathematical analysis that the results of those simulations can be predicted for a larger class of models, and we deduce from this analysis one of the key features of the model which yield the paradox. Based on this analysis, we define another mathematical model which solves this paradox, and we check our results through simulation. | |||
| Date | October 10, 2006 | |||
| Report | Study of the Meiotic Recombination Hotspot Diffusion Paradox (PS) | Study of the Meiotic Recombination Hotspot Diffusion Paradox (PDF) | ||
| CS-2006-39 | ||||
|---|---|---|---|---|
| Title | Caspian: A QoS-Aware Deployment Approach for Channel-based Component-based Applications | |||
| Authors | Abbas Heydarnoori | |||
| Abstract | With significant advances in software development technologies in recent years, it is now possible to have complex software applications that include a large number of heterogeneous software components distributed over a large network of computers with different computational capabilities. To run such applications, their components must be instantiated on proper hardware resources in their target environments so that requirements and constraints are met. This process is called software deployment. For large, distributed, component-based applications with many constraints and requirements, it is difficult to do the deployment process manually and automated tools and techniques are required. This report presents a graph-based approach for this purpose that is not dependent on any specific component technology and does the deployment planning with respect to the communication resources, i.e. channels, required by application components and communication resources available on the hosts in the target environment. In our approach, component-based applications and distributed environments are modeled with the help of graphs. Deployment of an application is then defined as the mapping of the application graph to the target environment graph. | |||
| Date | November 6, 2006 | |||
| Report | Caspian: A QoS-Aware Deployment Approach for Channel-based Component-based Applications (PDF) | |||
| CS-2006-40 | ||||
|---|---|---|---|---|
| Title | Unaligned Binary Codes for Index Compression in Schema-Independent Text Retrieval Systems | |||
| Authors | Stefan Buettcher and Charles L. A. Clarke | |||
| Abstract | We
        examine
        index
        compression
        techniques
        for
        schema-independent
        inverted
        files
        used
        in
        text
        retrieval
        systems.
        Schema-independent
        inverted
        files
        contain
        full
        positional
        information
        for
        all
        index
        terms
        and
        allow
        the
        structural
        unit
        of
        retrieval
        to
        be
        specified
        dynamically
        at
        query
        time,
        rather
        than
        statically
        during
        index
        construction.
        Schema-independent
        indices
        have
        different
        characteristics
        than
        document-oriented
        indices,
        and
        this
        difference
        can
        affect
        the
        effectiveness
        of
        index
        compression
        algorithms
        greatly. Our experimental results show that unaligned binary codes that take into account the special properties of schema-independent indices achieve better compression rates than methods designed for compressing document indices and that they can reduce the size of the index by around 15% compared to byte-aligned index compression. Moreover, we present a number of performance-enhancing techniques that may be used to very efficiently decode unaligned codes. Thus, their more compact index representation does not carry the cost of a substantially decreased query processing performance. This contradicts most earlier results on the decoding performance of unaligned codes. | |||
| Date | October 11, 2006 | |||
| Report | Unaligned Binary Codes for Index Compression in Schema-Independent Text Retrieval Systems (PDF) | |||
| CS-2006-41 | ||||
|---|---|---|---|---|
| Title | BSim: A System for Three-Dimensional Visualization of Brownian Motion | |||
| Authors | Tenn Francis Chen and Gladimir V. G. Baranoski | |||
| Abstract | Brownian motion is considered to be of fundamental importance for theoretical and applied scientific research. To date, the computational renderings of this phenomenon available in the scientific literature have been limited to two-dimensional case studies. This paper describes an interactive computer graphics system for the three-dimensional visualization of this phenomenon. The visual simulations are based on the formulas provided in Einstein's seminal paper on this topic, which are carefully revisited to introduce the phenomenon's underlying physics. The system's predictive capability, which is a key attribute for educational and scientific applications, is demonstrated by the qualitative agreement between simulation results and actual Brownian motion observations. | |||
| Date | November 22, 2006 (Revised April 29, 2009) | |||
| Report | BSim: A System for Three-Dimensional Visualization of Brownian Motion (PDF) | |||
| CS-2006-42 | ||||
|---|---|---|---|---|
| Title | Second-Tier Cache Management Using Write Hints | |||
| Authors | Xuhui Li, Ashraf Aboulnaga, Kenneth Salem, Aamer Sachedina, Shaobo Gao | |||
| Abstract | Storage servers, as well as storage clients, typically have large memories in which they cache data blocks. This creates a two-tier cache hierarchy in which the presence of a first-tier cache (at the storage client) makes it more difficult to manage the second-tier cache (at the storage server). Many techniques have been proposed for improving the management of second-tier caches, but none of these techniques use the information that is provided by {\em writes} of data blocks from the first tier to help manage the second-tier cache. In this paper, we illustrate how the information contained in writes from the first tier can be used to improve the performance of the second-tier cache. In particular, we argue that there are different reasons why storage clients write data blocks to storage servers (e.g., cleaning dirty blocks vs. limiting the time to recover from failure). These different types of writes can provide strong indications about the current state and future access patterns of a first-tier cache, which can help in managing the second-tier cache. We propose that storage clients inform the storage servers about the types of writes that they perform by passing {\em write hints}. These write hints can then be used by the server to manage the second-tier cache. We focus on the common and important case in which the storage client is a database system running a transactional (OLTP) workload. We describe, for this case, the different types of write hints that can be passed to the storage server, and we present several cache management policies that rely on these write hints. We demonstrate using trace driven simulations that these simple and inexpensive write hints can significantly improve the performance of the second-tier cache. | |||
| Date | January 18, 2007 | |||
| Report | Second-Tier Cache Management Using Write Hints (PDF) | |||
| CS-2006-43 | ||||
|---|---|---|---|---|
| Title | Semantic Prefetching of Correlated Query Sequences | |||
| Authors | Ivan T. Bowman, Kenneth Salem | |||
| Abstract | We present a system that optimizes sequences of related client requests by combining small requests into larger ones, thus reducing per-request overhead. The system predicts upcoming requests and their parameter values based on past observations, and prefetches results that are expected to be needed. We describe how the system makes its predictions and how it uses them to optimize the request stream. We also characterize the benefits with several experiments. | |||
| Date | November 21, 2006 | |||
| Report | Semantic Prefetching of Correlated Query Sequences (PDF) | |||
| CS-2006-44 | ||||
|---|---|---|---|---|
| Title | Correcting Sample Selection Bias by Unlabeled Data | |||
| Authors | Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, Bernhard Sholkopf | |||
| Abstract | We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to first recover sampling distributions and then make appropriate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estimation. Our method works by marching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice. | |||
| Date | January 18, 2007 | |||
| Report | Correcting Sample Selection Bias by Unlabeled Data (PDF) | |||
| CS-2006-45 | ||||
|---|---|---|---|---|
| Title | Skyline and Top-k Processing in Web Bargaining | |||
| Authors | Mohamed A. Soliman, Ihab F. Ilyas, Nick Koudas | |||
| Abstract | Skyline and top-k queries are gaining increasing importance in many emerging applications. Current skyline and top-k query processing techniques work on deterministic object attributes and known scores. However, in many practical scenarios these settings are inapplicable. In this paper we focus on web interaction scenarios where each interaction is a data object with a set of possible outcomes (scores). Obtaining exact interaction scores is expensive as it involves complex arbitration between interacting parties. Moreover, the outcome of each interaction might redefine other interactions scores. We demonstrate that the search space for solving such problems is very large. Based on this we formulate and present skyline and top-k processing algorithms that can efficiently reduce the search space. We present the results of a thorough experimental evaluation quantifying the relative performance of the algorithms we propose herein with respect to costly exact solutions. Our results indicate that our techniques can efficiently reduce the space and identify precise solutions. | |||
| Date | November 23, 2006 | |||
| Report | Skyline and Top-k Processing in Web Bargaining (PDF) | |||
| CS-2006-46 | ||||
|---|---|---|---|---|
| Title | List Update with Locality of Reference: MTF Outperforms All Other Algorithms | |||
| Authors | Spyros Angelopoulos, Reza Dorrigiv, and Alejandro Lopez-Ortiz | |||
| Abstract | It
        has
        been
        observed
        that
        in
        practice,
        typical
        request
        sequences
        for
        the
        list
        update
        problem
        exhibit
        a
        certain
        degree
        of
        locality
        of
        reference.
        We
        first
        extend
        the
        locality
        of
        reference
        model
        for
        the
        paging
        problem
        due
        to
        Albers
        et
        al~[STOC
        2002/JCSS
        2005]
        to
        the
        domain
        of
        list
        update;
        this
        addresses
        the
        open
        question
        of
        defining
        an
        appropriate
        model
        for
        capturing
        locality
        of
        reference
        in
        the
        context
        of
        list
        update
        [Hester
        and
        Hirschberg
        ACM
        Comp.
        Surv.
        1985].
        We
        then
        apply
        this
        model
        in
        conjunction
        with
        a
        recent
        technique
        for
        comparing
        online
        algorithms,
        namely
        bijective
        analysis~[SODA
        2007]
        and
        analyze
        well
        known
        online
        algorithms
        for
        list
        update. Using this framework, we prove that Move-to-Front (MTF) is the unique optimal algorithm for list update. This holds for both the standard cost function of Sleator and Tarjan~[C. ACM 1985] and the refined cost function proposed independently by Mart\'{\i}nez and Roura~[TCS 2000] and Munro~[ESA 2000]. Our work resolves an open conjecture of Mart\'{\i}nez and Roura, namely proposing a measure which can successfully separate MTF from all other algorithms. | |||
| Date | November 22, 2006 | |||
| Report | List Update with Locality of Reference: MTF Outperforms All Other Algorithms (PDF) | |||
| CS-2006-47 | ||||
|---|---|---|---|---|
| Title | MAXSM: A MultiHeuristic Approach to XML Schema Matching | |||
| Authors | Mirza Beg, Laurent Charlin, Joel So | |||
| Abstract | In this paper, we propose an automatic schema matching approach called MAXSM, designed specifically for matching schemas in the context of enterprise data integration. By focusing on enterprise data integration scenarios, it becomes viable to maintain a repository of established schema mappings and reuse that repository to enhance the accuracy of mapping candidates produced by MAXSM. As a consequence of focusing on the enterprise integration space, we further focus MAXSM on XML schema matching, specifically, we consider XML Schema Definition (XSD) schemas. MAXSM is a multi-heuristic schema matching approach which employs, amongst other heuristics, a novel heuristic using WordNet for determining natural-language semantic similarity between node labels. MAXSM defines and describes how transitive mappings can be discovered from known mappings and used to seed production of candidate mappings. We present a cogent tree spanning approach to traverse and search the two schemas more effectively. While this paper does not document an implementation of MAXSM, we do discuss a high-level implementation architecture as well as design of experiments to verify its efficacy. | |||
| Date | December 11, 2006 | |||
| Report | MAXSM: A MultiHeuristic Approach to XML Schema Matching (PDF) | |||
| CS-2006-48 | ||||
|---|---|---|---|---|
| Title | Succinct Data-Structures for Labeled Graphs | |||
| Authors | Jérémy Barbay | |||
| Abstract | Succinct
      data-structures
      support
      efficient
      search
      queries
      while using space asymptotically optimal. Such data-structures are known for structures such as binary strings, ordinal and cardinal trees, graphs, binary relations, labeled and multi-labeled trees. We consider graphs where each node is associated to an arbitrary number of labels. We show that some categories of such graphs have a succinct encoding which supports efficiently label-based navigation and search operators. | |||
| Date | December 11, 2006 | |||
| Report | Succinct Data-Structures for Labeled Graphs (PS) | Succinct Data-Structures for Labeled Graphs (PDF) | ||