Digitally progressing towards a world with immense amount of data, businesses are constantly looking for a feasible and practical way to analyze the information so that this flood of the data can be utilized in a meaningful manner for growth and development. Data is being collected at the unprecedented pace, and it is coming from the gamut of resources, available as soon as it is generated. Big Data is a broad term involving initiatives and technologies that involve massive, diverse and continuously changing data. It can changed the way organizations are doing business, gaining insight, are dealing with their customers and are making decisions by offering a synergy and extension to the existing processes. Big data is also changing the way businesses are approaching product development, human resources and operations. It is touching every aspect of the society including retails, mobile services, life sciences, financial services and physical sciences. It can be touted as the biggest opportunity, as well as the biggest challenge for the statistical sciences because if the numbers are crunched accurately, Big Data can offer huge rewards.
Companies may know the types of result they are seeking but these might be difficult to obtain. Or, significant data mining might be required to obtain specific answers. For statisticians, the challenge is dealing with the data which is not only big, but also very different. They need to deal with “Look-everywhere effect” and extract meaningful information from a huge haystack of data. Additionally there are challenges with the algorithms as they often do not scale up as expected and can get extremely slow when gigabyte-scale dataset is involved. To improve the speed and theoretical accuracy, these algorithms need to be improved, or new algorithms need to be designed. The algorithm must be capable of handling next-generation functional data, and should be able to look through data for hidden relationships and patterns.
Another challenge is the analysis of too many correlations, several of which can be bogus that may appear statistically significant, and magnitude of the big data can amplify such errors. Additionally, big data is quite efficient in detecting subtle correlations, however, it is left to the imagination of the user which correlations are meaningful, and this may not always be an easy task. The statistical analysis cannot be a wholesale replacement to the scientific inquiry, and users must start with the basic understanding of the data. Also, once the users gain the understanding of the big data, it can easily be gamed. A good example could be “spamdexing” or “Google bombing” where companies can artificially elevate website search placement. At times, the results of the analysis may not be intentionally gamed, but they can be less robust than expected. Most of the big data comes from the web, which is a big data itself, and this increases the chances of reinforcing the error.
Undoubtedly, big data is a valuable tool and it has made a critical impact in selected few realms. However, it has proved its worth in analysing common things, falling short in the analysis of less commonly used information, not living up to the perceived hype. Big data should be here to stay, however it is not a silver bullet, and we need to be realistic about its potential and limitations.