What programming language for Big Data should I use?


What programming language for Big Data should I use?

It is common to have a project of Big Data, understand the problem domain, know what infrastructure use and to decide which framework to use to process all this data, but if faced with a difficult decision step: which language should I use? When it comes to handling large volumes of data, R, Python, Scala and Java meet your needs in most cases. Why you should choose and why or when? Here is an analysis of each to help guide your decision.


R is called a statistical language “. If you need a statistical model for their calculations, esoteric you will find on CRAN. For analysis and planning, you can not overcome the ggplot2. And, if you need to explore more power than your machine can offer, you can use the moorings SparkR to perform the Spark in the R.

However, you may need a bit of adjustment to be productive in r. as a language for data analysis, it is disabled for most general purposes. You would build a model in R, but would consider translating the model to the Scala or Python for production, and would be unlikely to want to write a control system of clusters using that language.


If your data scientists don’t like R, they probably know color Python Python is very popular in academia for over a decade, especially in areas such as Natural Language Processing (Natural Language Processing-NLP). As a result, if you have a project that requires work with NLP, you will face a lot of choices, including the classic NTLK, modeling it with GenSim, or the extremely fast and accurate spaCy. Similarly, Python has a very good performance out of his comfort zone when it comes to neural networks with Tensorflow and Theano, and then there is the scikit-learn to machine learning, as well as NumPy and Pandas for data analysis.

Compared with R, Python is a traditional object-oriented language, this means that most developers will be comfortable enough to work with him, where the first exposure to R or Scala can be pretty intimidating. A small problem is the requirement of correct indentation in your code. It divides people between “this is great for ensuring ease of reading” and those of us who believe in 2016 we shouldn’t need to fight with an interpreter in order to run a program by the fact that a row has a character out of place.


Of the four languages mentioned here, Scala is the most admired for its type system. Running on the JVM, Scala is the most successful marriage of functional and object-oriented paradigms, and is advancing the wide steps in the financial world and in companies that need to operate large data volumes, often so massive (like Twitter and LinkedIn). Also is the language behind the Spark and Kafka.

For run on JVM, she has immediate access to the Java ecosystem for free, but also has a wide variety of libraries “native” scaled data handling (in particular the Algebird of Twitter and the Summingbird). It also includes a very useful for developing interactive REPL and analysis as in the Python and R.

The downside: the Scala compiler is a little slow, to the point that he takes us back to the days of “build” with XKCD. Still, he has big support REPL date and web-based notebooks in the format of Jupyter and Zeppelin.


Java is complicated, but might be appropriate for your big date. Consider the Hadoop MapReduce-Java. HDFS? Written in Java. Until the Storm, Kafka and the Spark run on JVM (on Clojure and Scala), meaning that Java is a first class citizen of these projects. Then, there are new technologies like Google Cloud Dataflow (today Apache Beam), which until recently supported only the Java.

The main charges against Java are the harsh prolixity and the lack of a REPL (present in R, Python and Scala) for iterative development. However, the new lambda support in Java 8 is very useful to rectify this situation. Java will never be as compact as the Scala, but Java 8 really makes it less complicated to use.



Leave A Reply