AF:Small:Efficient Algorithms for Querying Noisy Distributed/Streaming Datasets


PI : Qin Zhang
NSF CCF-1525024 (June 2015 - May 2019)


Abstract

This project aims to study the design of efficient query algorithms for noisy datasets in distributed and streaming applications. Noisy data is universal in today's world. Imprecise and varying references to the same real-world entities are ubiquitous in scientific and commercial databases. This noise poses significant obstructions to accurate data analytics. As an example of "noisy data," consider YouTube videos. YouTube tracks the views of individual videos. However, there are frequently many similar versions of the same event and answering a basic question such as "How many people viewed this event?" is challenging using current techniques. This project will provide new techniques and insights to combat the noisy nature of large datasets, and hence will enhance our ability to process the ever-increasing quantity of business and scientific data. The products of this project will be integrated into a trilogy of graduate and undergraduate courses on algorithms, databases, and data mining. The PI will disseminate research outcomes by giving talks at conferences/workshops, universities, industrial labs, as well as online media.

More technically, this project tries to answer the following question: can we run distributed and streaming algorithms directly on the noisy datasets, resolve the noise "on the fly", and retain communication and space efficiency compared with the noise-free setting? The PI plans to study statistical, relational and graph problems. This project has the potential to impact a wide range of active research areas in theoretical computer science, including distributed and streaming algorithms, group testing, compressed sensing, communication complexity, clustering, and locality sensitive hashing.


Papers

  1. MinJoin: Efficient Edit Similarity Joins via Local Hash Minima
    with H. Zhang
    Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 19) (oral presentation), pages 1093-1103. Anchorage, AK, U.S.A., August 2019.

  2. A Practical Algorithm for Distributed Clustering and Outlier Detection [poster]
    with J. Chen and E. Sadeqi-Azer
    Proc. Annual Conference on Neural Information Processing Systems (NeurIPS 18), pages 2253-2262. Montreal, Canada, December 2018.

  3. Smooth q-Gram, and Its Applications to Detection of Overlaps among Long, Error-Prone Sequencing Reads [conf. talk]
    H. Zhang, Q. Zhang and H. Tang   (by contribution)
    Proc. ACM International Conference on Information and Knowledge Management (CIKM 18), pages 267-276. Turin, Italy, October 2018.

  4. Distinct Sampling on Streaming Data with Near-Duplicates [conf. talk] [poster]
    with J. Chen
    Proc. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 18), pages 369-382. Houston, TX, U.S.A., June 2018.

  5. Distributed Statistical Estimation of Matrix Products with Applications
    with D. P. Woodruff
    Proc. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS 18), pages 383-394. Houston, TX, U.S.A., June 2018.

  6. EmbedJoin: Efficient Edit Similarity Joins via Embeddings [conf. talk]
    with H. Zhang
    Proc. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 17) (oral presentation), pages 585-594. Halifax, Nova Scotia, Canada, August 2017.

  7. Distributed Partial Clustering (preliminary full version, 19 pages) [conf. talk]
    with S. Guha and Y. Li
    Proc. ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 17), pages 143-152. Washington D.C., U.S.A., July 2017.
    Best Paper Award
    Invited to special issue for SPAA 2017 papers in ACM Transactions on Parallel Computing (TOPC)

  8. Bias-Aware Sketches [conf. talk] [poster]
    with J. Chen
    Proc. International Conference on Very Large Databases (VLDB 17), pages 961-972. Munich, Germany, August-September 2017.

  9. Communication-Optimal Distributed Clustering [talk] [poster]
    with J. Chen, H. Sun and D. Woodruff
    Proc. Annual Conference on Neural Information Processing Systems (NIPS 16), pages 3720-3728. Barcelona, Spain, December 2016.

  10. Edit Distance: Sketching, Streaming and Document Exchange (preliminary full version, 30 pages) [conf. talk]
    with D. Belazzougui
    Proc. IEEE Symposium on Foundations of Computer Science (FOCS 16), pages 51-60. New Brunswick, NJ, U.S.A., October, 2016.

  11. Streaming Algorithms for Robust Distinct Elements [conf. talk] [poster]
    with D. Chen
    Proc. ACM SIGMOD International Conference on Management of Data (SIGMOD 16), pages 1433-1447. San Francisco, CA, U.S.A., June 2016.

  12. On Sketching Quadratic Forms (preliminary full version, 47 pages)
    with A. Andoni, J. Chen, R. Krauthgamer, B. Qin and D. P. Woodruff
    Proc. Innovations in Theoretical Computer Science (ITCS 16), pages 311-319. Cambridge, MA, USA, January 2016.

  13. Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy
    with J. M. Phillips, E. Verbin
    SIAM Journal of Computing (SICOMP), volume 45, issue 1, pages 174-196, February 2016 [Link].

  14. Communication-Efficient Computation on Distributed Noisy Datasets [conf. talk]
    Q. Zhang
    Proc. ACM Symposium on Parallelism in Algorithms and Architectures (SPAA 15), pages 313-322. Portland, Oregon, U.S.A., June 2015.

Educational and Other Development

A Trilogy of Courses: Code libraries for streaming/distributed algorithms:

Personnel


Acknowledgement

This project is supported by the National Science Foundation (NSF) under the project: AF: Small: Efficient Algorithms for Querying Noisy Distributed/Streaming Datasets. Any opinions, findings, and conclusions or recommendations expressed in this project are those of author(s) and do not necessarily reflect the views of the National Science Foundation.