Messages
- July. 5, 2018: Webpage created
- Sept. 5, 2018: Homework 1 is out. Due on Oct. 14, 2018, 11:59pm EST
- Sept. 5, 2018: Project 1 Report due on Oct. 21, 2018, 11:59pm EST
- Oct. 22, 2018: Homework 2 is out. Due on Nov. 18, 2018, 5pm EST
- Oct. 22, 2018: Project 2 Report due on Dec. 7, 2018, 11:59pm EST
Course summary
In this course we will talk about sublinear algorithms, which has its roots in the study of Big Data that occur more and more frequently in various applications, e.g., analyses of financial transactions, internet traffic, social networks, genome sequences, etc. Concretely, we will talk about:
1. Sublinear space algorithms. In particular, data stream algorithms, namely, algorithms that solve a problem by making one pass over the data set while using small memory. These algorithms are important in many application areas such as databases and networking, where data arrives at a high speed and there is no time and/or need to store it for offline processing.
2. Sublinear time algorithms, that is, algorithms that do not even read the whole input when outputting the answers.
3. Sublinear communication algorithms. The data is stored in multiple machines, who want to jointly compute functions defined on the union of the data sets via communication.
4. Random topics.
Participants are expected to have a good background in algorithm design and probability, and have good programming skills.
The evaluation will be based on homework assignments and individual project/presentation. The list of questions will be handed out in the middle of the course.
Detailed list of topics is available in the course plan below.
Lecturer
Qin Zhang
Email: qzhangcs@indiana.edu
Office hours: by appointment
Time and place
2:30pm - 3:45pm Monday/Wednesday
Luddy Hall 002.
Textbooks
-
There is no textbook for the class.
Lectures are based on this notes by Chakrabarti, and recent papers.
-
Background on Randomized Algorithms:
- [MR] Randomized Algorithms by Motwani and Raghavan
- [MU] Probability and Computing by Mitzenmacher and Upfal
-
Similar courses:
- Algorithmic Techniques for Massive Data (Andoni, Columbia)
- Sublinear (and Streaming) Algorithms (Beame, U. Washington)
- Data Stream Algorithms (Chakrabarti, Dartmouth)
- Algorithms for Big Data (Chekuri, UIUC)
- Algorithms for Modern Data Models (Goal, Stanford)
- Sketching, Streaming and Sub-linear Space Algorithms (Indyk, MIT)
- Sub-linear Algorithms (Indyk and Rubinfeld, MIT)
- Algorithms for Big Data (Nelson, Harvard)
- Data Streams and Massive Data (McGregor, UMass)
- Algorithms for Big Data (Woodruff, CMU)
Course schedule
1 |
Aug. 20 |
0: Introduction |
New models for Big Data |
|
slides |
|
1 |
Aug. 22 |
|
New models for Big Data (cont.) |
|
same |
|
2 |
Aug. 27 |
|
Interesting problems |
|
slides |
|
2 |
Aug. 29 |
|
Basic probabilistic tools |
|
slides |
Read Chapter 3, 4 of [MU] |
3 |
Sep. 3 |
|
|
|
|
Labor day. No class |
3 |
Sep. 5 |
1: Sublinear in space |
Point Queries |
|
slides |
Read Lecture 1 of these notes |
4 |
Sep. 10 |
|
Misra-Gries, Space-Saving |
[MAA] |
same |
Space-saving in this paper |
4 |
Sep. 12 |
|
Space-Saving, FM-sketch |
[FM] |
same |
Read Lecture 2 of these notes |
5 |
Sep. 17 |
|
Improvement on FM sketch |
[BYJKST] |
same |
Read Section 3 of these notes |
5 |
Sep. 19 |
|
Linear sketch |
|
same |
Read these slides |
6 |
Sep. 24 |
|
Linear sketch |
|
same |
Read these slides |
6 |
Sep. 26 |
|
Count-Min, Count-sketch |
[CM] [CCF] |
same |
Read Section 4 of these notes |
7 |
Oct. 1 |
|
Alternative for L2 point query |
[GKMS] |
same |
Read these slides |
7 |
Oct. 3 |
|
L0 sampling |
[JST] |
same |
Read these slides |
8 |
Oct. 8 |
Project 1 presentation |
|
|
|
|
8 |
Oct. 10 |
Project 1 presentation |
|
|
|
|
9 |
Oct. 15 |
Guest talk |
|
|
|
|
9 |
Oct. 17 |
Guest talk |
|
|
|
|
10 |
Oct. 22 |
2. Sublinear in comm. |
Connectivity |
[AGM-1] |
slides |
Read [AGM-1] for details |
10 |
Oct. 24 |
|
Min-cut, Bipartiteness, MST |
[AGM-2] |
same |
Read [AGM-2] for details |
11 |
Oct. 29 |
|
Sparsification |
[AGM-2] |
same |
Read [AGM-2] for details |
11 |
Otc. 31 |
3. Sublinear in time |
Average degree |
[Fei] |
slides |
Read Section 3.1 in [CS] |
12 |
Nov. 5 |
|
Average degree (cont.) |
[GR] |
same |
See this hand-written note by Ronitt Rubinfeld |
12 |
Nov. 7 |
|
Connected Components Minimum spanning tree |
[CRT] |
same |
Read Section 3.2 in [CS] |
13 |
Nov. 12 |
4. Selected topics |
Distributed Join |
[GWWZ] |
slides |
|
13 |
Nov. 14 |
|
Sampling on distributed streams |
[CMYZ] |
slides |
|
14 |
Nov. 19 |
|
|
|
|
Thanksgiving Break. No classes |
14 |
Nov. 21 |
|
|
|
|
Thanksgiving Break. No classes |
15 |
Nov. 26 |
|
Distributed Statistical Estimation of Matrix Products |
|
|
|
15 |
Nov. 28 |
|
Randomized Composable Coresets for Matching and Vertex Cover |
|
|
|
16 |
Dec. 3 |
Project 2 presentation |
|
|
|
|
16 |
Dec. 5 |
Project 2 presentation |
|
|
|
|
Grading
The final grade will be curved.
Assignments 30%
There will be two homework assignments. Assignments will be posted in the middle of the course. The answers should be typeset in LaTeX and submitted via Canvas; here is a template to start with.
Projects 70%
There are two programming assignments. One for topics in Part 1 of this course, and one for topics in Part 2 and Part 3.
For the first project, one needs to implement at least 2 streaming algorithms for one of the two problems: distinct elements or heavy hitters (point query). One can implement the algorithms taught in class, or any other algorithms in the literature (please cite the reference papers). Next, one needs to design experiments to compare the performance of the implemented algorithms, in terms of space usages, accuracy of the answer, and the running time.
For the second project, one can pick any one of the algorithms discussed in Part 2&3 of the class, implement and test it. One should also compare the performance of the algorithm with a "baseline algorithm".
Please see below a set of datasets that can be used for the testing.
One needs to give a presentation of the experimental results, and write a report. Each programming assignment counts 35% of the grade: the presentation counts 15%, and the report counts 20%.
Here is a list of datasets that you can use:
Stanford Large Network Dataset Collection
US Census
Data dumps
UCI datasets
SuiteSparse Matrix Collection
Yahoo Webscope Program
StatLib
Bibliography
Literature
(will add more as we go along)
-
Book chapters, Surveys:
-
Related papers:
Note: [...] will be referenced in the course plan. (...) is one of problems discussed in the paper.
-
Part 1
- [AMS] (Frequent moments) The space complexity of approximating the frequency moments by Alon, Matias and Szegedy.
- [Ind-1] (Frequent moments) Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation by Indyk.
- [CM] (Heavy-hitters) An Improved Data Stream Summary: The Count-Min Sketch and its Applications by Cormode and Muthukrishnan.
- [JST] (Lp sampling) Tight Bounds for Lp Samplers, Finding Duplicates in Streams, and Related Problems by Jowhari, Saglam and Tardos.
- [CCM] (Entropy) A Near-Optimal Algorithm for Estimating the Entropy of a Stream. by Chakrabarti, Cormode and McGregor.
- [GK] (Quantile) Space-Efficient Online Computation of Quantile Summaries. by Greenwald and Khanna.
- [GM] (Heavy-hitters) CR-Precis: A Deterministic Summary Structure for Update Data Streams. by Ganguly and Majumder.
- [BYJKST] (Distinct elements) Counting distinct elements in a data stream by Bar-Yossef et al.
- [MAA] (Heavy-hitters) An integrated effcient solution for computing frequent and top-k elements in data streams. by Metwally, Agrawal and Abbadi.
- [CCF] (Heavy-hitters)Finding Frequent Items in Data Streams. by Charikar, Chen and Farach-Colton.
- [FM] (Distinct elements) Probabilistic counting algorithms for data base applications. by Flajolet and Martin.
- [GKMS] (Heavy-hitters) Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. by Gilbert, Kotidis, Muthukrishnan and Strauss
-
Part 2,3
- [SGP] (Page-rank) Estimating PageRank on Graph Streams. by Das Sarma, Gollapudi, Panigrahy.
- [AGM-1] (Connectivity, etc.) Analyzing Graph Structure via Linear Measurements by Ahn, Guha and McGregor.
- [AGM-2] (Graph sparcification, etc.) Graph Sketches: Sparsfiers, Spanners, and Subgraphs by Ahn, Guha and McGregor.
- [FKGSZ] (Matching, etc.) On Graph Problems in a Semi-streaming Model by Feigenbaum et al.
- [McG] (Matching) Finding Matchings in the Streaming Model by McGregor
- [Bas] (Spanner) Streaming Algorithm for Graph Spanners - Single Pass and Constant Processing Time per Edge by Baswana
- [BFLMS] (Counting triangle) Counting Triangles in Data Streams by Buriol et. al.
- [Fei] (Average degree) On Sums of Independent Random Variables with Unbounded Variance, and Estimating the Average Degree in a Graph by Feige
- [GR] (Average degree) Approximating Average Parameters of Graphs by Goldreich and Ron
- [CRT] (Minimum spanning tree) Approximating the minimum spanning tree weight in sublinear time. by Chazelle, Rubinfeld and Trevisan
- [GLZ] (Clustering) Distributed Partial Clustering by Guha, Yi and Zhang.
- [CSZ] (Clustering) A Practical Algorithm for Distributed Clustering and Outlier Detection. by Chen, Sadeqi Azer and Zhang.
Course policies
For assignments, students may discuss answers with anyone, including problem approaches and proofs. But all students must write their own proofs, and write-ups. The names of all people that you have talked to should be listed at the beginning of the first page. If a solution comes from existing papers/web/books, they must be properly cited, and you must write the solution in a way that demonstrates your understanding (simply copying the solution will be considered as plagiarism). All deadlines are firm. No late assignments will be accepted unless there are legitimate circumstances.
For projects, you may discuss your project with anyone as well, but if this contributes to your final product, they must be acknowledged. Any outside materials used must be referenced appropriately.
For more details, see Indiana University Code of Student Rights, Responsibilities, and Conduct.
Prerequisites
One is expected to know basics on algorithm design and analysis as well as probability. E.g., have taken B403 ``Introduction to Algorithm Design and Analysis" or equivalent courses.