• Oct. 3, 2013: Webpage draft created
  • Jan. 2, 2014: Project instruction posted
  • Jan. 15, 2014: Assignment 1 is out, due on Feb. 5, 2014, 4pm EST
  • Jan. 29, 2014: Project proposal due on Feb. 12, 2014, 4pm EST (You are allowed to modify/change the project proposal after that, but shound be before Feb. 26, 2013).
  • Jan. 29, 2014: Data Collection Report due on Mar. 5, 2014, 4pm EST
  • Jan. 29, 2014: Intermediate Report due on Mar. 26, 2014, 4pm EST
  • Jan. 29, 2014: Final Report due on Apr. 30, 2014, 4pm EST
  • Jan. 29, 2014: Project presentations on Apr. 21, 23, 28, 30, 2014, in class
  • Feb. 9, 2014: Assignment 2 is out, due on Feb. 24, 2014, 4pm EST
  • Feb. 25, 2014: Assignment 3 is out, due on Mar. 19, 2014, 4pm EST
  • Mar. 25, 2014: Assignment 4 is out, due on Apr. 9, 2014, 4pm EST

Course summary

In this course we will discuss state-of-the-art algorithmic tools for data mining, including:
1. Finding Similar items (e.g., Locality Sensitive Hashing)
2. Mining Frequent items (e.g., Count-Min Sketch)
3. Clustering (e.g., Spectral Clustering)
4. Link Analysis (e.g., PageRank)
We will also discuss how to implement solutions in modern computational models for Big Data, including:
1. Data streams
2. MapReduce
3. ActiveDHT (DHT: distributed hash table)
4. Distributed monitoring*
Detailed list of topics is available in the course schedule below.

This is an advanced undergraduate course. Participants are expected to have a background in basic probability, big-O analysis, and programming (no specific language requirement).

The evaluation will be based on a set of exercises and individual project/presentation.


Qin Zhang
Office hours: By email appointment

AI: Prasanth Velamala
Office hours: Thursdays, 2pm-3pm, LH112

Time and place

4:00PM-5:15PM Mon, Wed
Lindley Hall, Room 008.


    There is no official textbook. Below are a few reference books that you can use.

  • Main reference book:
      [MMDS] Mining Massive Data Sets by Anand Rajaraman and Jeff Ullman. The digital version of the book is free, but you may wish to purchase a hard copy.
  • Books on Randomized Algorithms:
    • [MR] Randomized Algorithms by Motwani and Raghavan
    • [MU] Probability and Computing by Mitzenmacher and Upfal
  • Related courses:

Course schedule

(subject to adjustments as we go along).

 Week   Date   Section   Content   Slides   Comments 
  1     Jan. 13     0. Introduction   Big Data, Data Mining,
  Problems and Computational Models,
  Course Plan
  1     Jan. 15       Basic Probability   same   A note  
  2   Jan. 22       Basic Probability (cont.)   same   same  
  3   Jan. 27     1. Finding Similar Items     Jaccard Similarty
  slides   Phillips' note-1
  3   Jan. 29         Locality Sensitive Hashing (LSH)     same   Phillips' note
  4   Feb. 3         LSH (cont.)   same  
  4   Feb. 5         LSH (cont.)   same  
  5   Feb. 10         Distance Functions   same   Phillips' note
  5   Feb. 12   2. Clustering     Hierachical Clustering     slides   Phillips' note
  6   Feb. 17     Assignment-based Clustering: k-center    same   Some proofs  
  6   Feb. 19     k-mean    same   Phillips' note
  7   Feb. 24     k-median    same  
  7   Feb. 26     Spectural Clustering     same   Phillips' note
  8   Mar. 3     Clustering Social Networks     same  
  8   Mar. 5   3. Mining Frequent Items     Frequent Items in Data Stream
  9   Mar. 10       Space-saving   same  
  9   Mar. 12       Count-Min
  Compressive Sensing
  10   Mar. 17           Spring break, no class  
  10   Mar. 19         Spring break, no class  
  11   Mar. 24     Frequent Itemsets     same   Phillips' note
  11   Mar. 26   4. Link Analysis   Markov Chain Basics   slides   Phillips' note
  12   Mar. 31         Class cancelled,
  instructor out of the town  
  12   Apr. 2     Metropolis Algorithm, Webpage Similarity   same  
  13   Apr. 7     PageRank, HITS, Salsa   same   Leskovec's slides
  13   Apr. 9   5. Models for Big Data   MapReduce, PageRank Implementation 
  slides   Phillips' slides (modified)  
  14   Apr. 14     ActiveDHT, PageRank Implementation    same   Stoica's slides about DHT  
  Goel's slides about Active DHT  
  14   April. 16   6. Student Presentations   Chen (Online Advertising),  
  Chen (LSH for NN Search),  
  Jazayeri (Clustering for Finding Communities)
  15   April. 21     Anderson, Biggs, Damico    
  15   April. 23     Denison, Kutkoski, Read    
  16   April. 28     Lowden (Spectual Clustering for Brain Data),  
  Monaco, Tadros  
  16   April. 30     Schnaitman (Movie Recommendation Sys.),  


  • Assignments 50%

    Assignments will be distributed via oncourse in the middle of the course. The answers should be typeset in LaTeX (highly recommended; here is a template to start with) or Word.

  • Projects 50%

    The project consists of three components: 1. Write a proposal. 2. Make a presentation. 3. Write a report. The proposal and report should be typeset in LaTeX (highly recommended) or Word.
    See here for detailed instructions.

Course policies

For assignments, students may discuss answers with anyone, including problem approaches and proofs. But all students must write their own proofs, and write-ups. The names of all people that you have talked to should be listed at the beginning of the first page. If a solution comes from existing papers/web/books, they must be properly cited, and you must write the solution in a way that demonstrates your understanding (simply copying the solution will be considered as plagiarism). All deadlines are firm. No late assignments will be accepted unless there are legitimate circumstances.
For projects, you may discuss your project with anyone as well, but if this contributes to your final product, they must be acknowledged. Any outside materials used must be referenced appropriately.
For more details, see Indiana University Code of Student Rights, Responsibilities, and Conduct.


    A student who is comfortable with basic probability, big-O analysis, and programming (no specific languange requirement) will be qualified for the class. For example, one has taken (Math) M365 "Introduction to Probability and Statistics", (Math) M301 "Linear Algebra and Applications", (CS) C241 "Discrete Structures for Computer Science", (CS) B403 "Introduction to Algorithm Design and Analysis", or equivalent courses.