Messages
 Feb. 23, 2017: Webpage created
 Aug. 28, 2017: Project instruction posted
 Aug. 28, 2017: Project proposal due on Sept. 22, 2017, 11:59pm EST
 Aug. 28, 2017: Data Collection Report (if you choose direction 2) due on Oct. 2, 2017, 11:59pm EST
 Sept. 6, 2017: Homework 1 is out. Due on Oct. 15, 2017, 11:59pm EST
 Aug. 28, 2017: Intermediate Report due on Oct. 22, 2017, 11:59pm EST
 Oct. 3, 2017: Homework 2 is out. Due on Nov. 5, 2017, 5pm EST
 Aug. 28, 2017: Final Report due on Nov. 26, 2017, 11:59pm EST
 Aug. 28, 2017: Project presentations on Nov. 27, Nov. 29, Dec. 4, Dec 6 2017, in class
Course summary
In this course we will talk about sublinear algorithms, which has its roots in the study of Big Data that occur more and more frequently in various applications, e.g., analyses of financial transactions, internet traffic, social networks, genome sequences, etc. Concretely, we will talk about:
1. Sublinear space algorithms. In particular, data stream algorithms, namely, algorithms that solve a problem by making one pass over the data set while using small memory. These algorithms are important in many application areas such as databases and networking, where data arrives at a high speed and there is no time and/or need to store it for offline processing.
2. Sublinear time algorithms, that is, algorithms that do not even read the whole input when outputting the answers.
3. Sublinear communication algorithms. The data is stored in multiple machines, who want to jointly compute functions defined on the union of the data sets via communication.
4. Random topics.
Participants are expected to have a good background in algorithm design and probability, and have good programming skills.
The evaluation will be based on homework assignments and individual project/presentation. The list of questions will be handed out in the middle of the course.
Detailed list of topics is available in the course plan below.
Lecturer
Qin Zhang
Email: qzhangcs@indiana.edu
Office hours: Wed. 45pm at LH430A
Associate Instructor:
Ruiyu Zhu
Email: rynzhu@gmail.com
Office hours: Mon. 45pm at Lindley Abyss
Time and place
2:30pm  3:45pm Monday/Wednesday
BH (Ballantine Hall) 005.
Textbooks

There is no textbook for the class.
Lectures are based on this notes by Chakrabarti, and recent papers.

Background on Randomized Algorithms:
 [MR] Randomized Algorithms by Motwani and Raghavan
 [MU] Probability and Computing by Mitzenmacher and Upfal

Similar courses:
 Algorithmic Techniques for Massive Data (Andoni, Columbia)
 Sublinear (and Streaming) Algorithms (Beame, U. Washington)
 Data Stream Algorithms (Chakrabarti, Dartmouth)
 Algorithms for Big Data (Chekuri, UIUC)
 Algorithms for Modern Data Models (Goal, Stanford)
 Sketching, Streaming and Sublinear Space Algorithms (Indyk, MIT)
 Sublinear Algorithms (Indyk and Rubinfeld, MIT)
 Algorithms for Big Data (Nelson, Harvard)
 Data Streams and Massive Data (McGregor, UMass)
 Algorithms for Big Data (Woodruff, CMU)
Course schedule
1 
Aug. 21 
0. Introduction 
New models for Big Data 

slides 

1 
Aug. 23 

New models for Big Data (cont.) 

same 

2 
Aug. 28 

Interesting problems 

slides 

2 
Aug. 30 

Basic probabilistic tools 

slides 
Read Chapter 3, 4 of [MU] 
3 
Sep. 4 




Labor day. No class 
3 
Sep. 6 
1. Sublinear in space 
Poiny Queries 

slides 
Read Lecture 1 of these notes 
4 
Sep. 11 

MisraGries, SpaceSaving 
[MAA] 
same 
Spacesaving in this paper 
4 
Sep. 13 

SpaceSaving, FMsketch 
[FM] 
same 
Read Lecture 2 of these notes 
5 
Sep. 18 

Improvement on FM sketch 
[BYJKST] 
same 
Read Section 3 of these notes 
5 
Sep. 20 

Linear sketch 

same 
Read these slides 
6 
Sep. 25 

Linear sketch 

same 
Read these slides 
6 
Sep. 27 

CountMin, Countsketch 
[CM] [CCF] 
same 
Read Section 4 of these notes 
7 
Oct. 2 

Alternative for L2 point query 
[GKMS] 
same 
Read these slides 
7 
Oct. 4 

L0 sampling 
[JST] 
same 
Read these slides 
8 
Oct. 9 
2. Sublinear in comm. 
Connectivity 
[AGM1] 
slides 
Read [AGM1] for details 
8 
Oct. 11 

Mincut, Bipartiteness, MST 
[AGM2] 
same 
Read [AGM2] for details 
9 
Oct. 16 

Sparsification 
[AGM2] 
same 
Read [AGM2] for details 
9 
Oct. 18 

Clustering 
[GLZ] 
slides 
Read [GLZ] for details 
10 
Oct. 23 
3. Sublinear in time 
Average degree 
[Fei] 
slides 
Read Section 3.1 in [CS] 
10 
Oct. 25 

Average degree (cont.) 
[GR] 
same 
See this handwritten note by Ronitt Rubinfeld 
11 
Oct. 30 

Minimum spanning tree 
[CRT] 
same 
Read Section 3.2 in [CS] 
11 
Nov. 1 
4. Random topics 
Distributed Join 
[GWWZ] 
slides 

12 
Nov. 6 

Sampling on distributed streams 
[CMYZ] 
slides 

12 
Nov. 8 

Distinct elements on noisy streams 
[CZ] 
slides 

13 
Nov. 13 
5. Homework discussion 




13 
Nov. 15 
6. Guest presentations 
Haoyu and Chao 



14 
Nov. 20 




Thanksgiving Break. No classes 
14 
Nov. 22 




Thanksgiving Break. No classes 
15 
Nov. 27 
7. Student presentations 
Dmitrii and Ramji 



15 
Nov. 29 

Boli and Nikolai 



16 
Dec. 4 

Gotit and Yan 



16 
Dec. 6 

Adithya and Yadi 



Grading
The final grade will be curved.
Assignments 40%
There will be 2 homework assignments. Assignments will be posted in the middle of the course. The answers should be typeset in LaTeX and submitted via Canvas; here is a template to start with.
Projects 60%
The project consists of three components: 1. Write a proposal. 2. Write a report. 3. Make a presentation. 4. Grade others' presentations. The proposal and report should be typeset in LaTeX.
The specifics of the project will be very flexible. During the course many problems will be introduced in various computational models for Big Data. A few of them will be discussed in detail, and the rest will only be mentioned briefly. For the project, you can for example:
1. Pick a problem that is only briefly mentioned in the class and make a survey of its stateofart results.
2. Pick some algorithms that are mentioned in the class, implement them and compare with other algorithms that you can think of. (Some datasets that you can use will be posted soon)
3. Propose new algorithms for problems in models that are discussed in the course. You can either analyze them theoretically (that is, prove some bounds on space/time/communication), or implement them and compare with existing algorithms.
The grade of the projects will depend on how difficult the task is (e.g., proposing good new algorithms will generally be more difficult than understanding/implementing existing ones), and how well it is done.
See here for some detailed instructions.
Literature
(will add more as we go along)

Book chapters, Surveys:

Related papers:
Note: [...] will be referenced in the course plan. (...) is one of problems discussed in the paper.

Statistics
 [AMS] (Frequent moments) The space complexity of approximating the frequency moments by Alon, Matias and Szegedy.
 [Ind1] (Frequent moments) Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation by Indyk.
 [CM] (Heavyhitters) An Improved Data Stream Summary: The CountMin Sketch and its Applications by Cormode and Muthukrishnan.
 [JST] (Lp sampling) Tight Bounds for Lp Samplers, Finding Duplicates in Streams, and Related Problems by Jowhari, Saglam and Tardos.
 [CCM] (Entropy) A NearOptimal Algorithm for Estimating the Entropy of a Stream. by Chakrabarti, Cormode and McGregor.
 [GK] (Quantile) SpaceEfficient Online Computation of Quantile Summaries. by Greenwald and Khanna.
 [GM] (Heavyhitters) CRPrecis: A Deterministic Summary Structure for Update Data Streams. by Ganguly and Majumder.
 [BYJKST] (Distinct elements) Counting distinct elements in a data stream by BarYossef et al.
 [MAA] (Heavyhitters) An integrated effcient solution for computing frequent and topk elements in data streams. by Metwally, Agrawal and Abbadi.
 [CCF] (Heavyhitters)Finding Frequent Items in Data Streams. by Charikar, Chen and FarachColton.
 [FM] (Distinct elements) Probabilistic counting algorithms for data base applications. by Flajolet and Martin.
 [GKMS] (Heavyhitters) Surfing wavelets on streams: Onepass summaries for approximate aggregate queries. by Gilbert, Kotidis, Muthukrishnan and Strauss
 [CZ] (Noisy datasets) Streaming Algorithms for Robust Distinct Elements. by Chen and Zhang

Graphs
 [SGP] (Pagerank) Estimating PageRank on Graph Streams. by Das Sarma, Gollapudi, Panigrahy.
 [AGM1] (Connectivity, etc.) Analyzing Graph Structure via Linear Measurements by Ahn, Guha and McGregor.
 [AGM2] (Graph sparcification, etc.) Graph Sketches: Sparsfiers, Spanners, and Subgraphs by Ahn, Guha and McGregor.
 [FKGSZ] (Matching, etc.) On Graph Problems in a Semistreaming Model by Feigenbaum et al.
 [McG] (Matching) Finding Matchings in the Streaming Model by McGregor
 [Bas] (Spanner) Streaming Algorithm for Graph Spanners  Single Pass and Constant Processing Time per Edge by Baswana
 [BFLMS] (Counting triangle) Counting Triangles in Data Streams by Buriol et. al.
 [Fei] (Average degree) On Sums of Independent Random Variables with Unbounded Variance, and Estimating the Average Degree in a Graph by Feige
 [GR] (Average degree) Approximating Average Parameters of Graphs by Goldreich and Ron
 [CRT] (Minimum spanning tree) Approximating the minimum spanning tree weight in sublinear time. by Chazelle, Rubinfeld and Trevisan

Geometry

Strings

Numerical linear algebra

Others (sliding windows, distributed streaming)
 [DGIM] (Sliding windows) Maintaining Stream Statistics over Sliding Windows by Datar, Gionis, Indyk, and Motwani
 [CMYZ] (Distributed sampling)
Optimal Sampling From Distributed Streams
by Cormode, Muthukrishnan, Yi and Zhang
 [CCG] (Data outsourcing and verification)
Annotations in Data Streams
by Chakrabarti, Cormode, and McGregor
 [CMY] (Distributed monitoring)
Algorithms for Distributed Functional Monitoring
by Cormode, Muthukrishnan and Yi
 [YZ] (Distributed monitoring)
Optimal Tracking of Distributed Heavy Hitters and Quantiles
by Yi and Zhang
 [GWWZ] (Distributed join)
The Communication Complexity of Distributed SetJoins
by Van Gucht, Williams, Woodruff and Zhang
Course policies
For assignments, students may discuss answers with anyone, including problem approaches and proofs. But all students must write their own proofs, and writeups. The names of all people that you have talked to should be listed at the beginning of the first page. If a solution comes from existing papers/web/books, they must be properly cited, and you must write the solution in a way that demonstrates your understanding (simply copying the solution will be considered as plagiarism). All deadlines are firm. No late assignments will be accepted unless there are legitimate circumstances.
For projects, you may discuss your project with anyone as well, but if this contributes to your final product, they must be acknowledged. Any outside materials used must be referenced appropriately.
For more details, see Indiana University Code of Student Rights, Responsibilities, and Conduct.
Prerequisites
One is expected to know basics on algorithm design and analysis as well as probability. E.g., have taken B403 ``Introduction to Algorithm Design and Analysis" or equivalent courses.