A few days ago a friend asked me the following question: how to efficiently extract some specific lines from a large text file, possibily compressed by Gzip? He mentioned that he tried some R functions such as read.table(skip = ...), but found that reading the data was too slow. Hence he was looking for some alternative ways to extracting the data.
This is a common task in preprocessing large data sets, since in data exploration, very often we want to peek at a small subset of the whole data to gain some insights.
Dr. Hadley Wickham is the Chief Scientist of RStudio and Assistant Professor of Statistics at Rice University. He is the developer of the famous R package ggplot2 for data visualization and the author of many other widely used packages like plyr and reshape2. On Sep 13, 2013 he gave a talk at Department of Statistics, Purdue University, and later I (Yixuan) had a conversation with him (Hadley), talking about his own experience and interest on data visualization, data tidying, R programming and other related topics.