Big data software consultants
Big data has been defined as having one or more of the following characteristics:
- Volume – sheer quantity of data
- Variety – Type and nature of the data
- Velocity – The speed at which the data is generated or needs to be processed
There are numerous tasks that big data software consultants might want to perform on big data collections that might include data capture and storage, analysis, visualization, and others. When it comes to big data applications, there are three languages that have emerged (or are in the process of emerging) to tackle big data problems: Python, Scala, and Julia.
Python is by far the most common of the three used by big data software consultants and has been around the longest (1991). Perhaps the second most popular language for building data-heavy projects, Scala was released to the world in 2004. It is a general-purpose programming language that offers a strong static type system and support for functional programming and runs on the JVM. Julia is the newest of the three, being released in 2012. While it offers some intriguing perks, widespread adaptability is still uncertain.
Here is a breakdown of some pros and cons for each language, starting in part I with Python.
Perhaps the biggest upside to Python is how widely it is used by big data software consultants. Likely due to its shallow learning curve and maturity, many have abandoned lower level languages (such as C or Java) or proprietary languages with a high price tag (i.e. Matlab) and embraced this language. Because of this popularity, there is a well-supported library for just about every data science task one may encounter, particularly when it comes to applications like data visualization and deep learning.
Python has several mature tools for visualizing and interacting with data including the following:
Matplotlib, perhaps the most commonly used, is very mature, well documented, and flexible enough to make practically any kind of static chart anyone would want to produce.
Plotly produces publication-quality online, interactive charts. It appears to have all the capabilities of Matplotlib, but is also interactive. The library has a pretty robust and well documented Python API, but does require users to register an account in order to use the free version. Figure 2 shows an example of an interactive scatter plot from Plotly. The figure contains tools in the upper righthand corner for zooming, panning and selecting points. Plotly provides functionality for specifying what happens when the user hovers over or selects a point.
The popular data visualization application Tableau includes a robust Python API, and the library TabPy that allows the use of Tableau within Python scripts.
Lastly, the visualization application ParaView has a robust Python API. ParaView was built to visually analyze massive data sets over a distributed memory system. The type of data visualized by this library is a little more niched, and is typically 3D data such as particle streams, magnetic fields, fire simulations and many more. Figure 3 shows an example of the types of visualizations that ParaView is capable of generating.
Part of the flexibility of Python and easy syntax is due to it being a dynamically typed language. This, however, adds to difficulties in debugging because many errors that would be found by a compiler now become run-time errors that are more difficult to track down.
Python is an interpreted language, and as such its performance is not as fast as a compiled language such as C or Java. For many applications this won’t matter. However, in applications where speed is crucial, Python may not be the best suited tool.
Two Major Branches
Python has two major branches, with many systems that seem to be forever stuck on version 2.7, while the rest of the Python world has moved on and is on version 3.7.
Python code can be difficult to parallelize due to something called the Global Interpreter Lock (GIL). The GIL prevents multiple threads trying to access or modify the same object. Because of this, programmers not experienced in parallelizing python code may try and implement a multi-threaded process and get something that is just as slow (or slower) than the serial code. There are a few work arounds, but it is debatable as to whether the work around is better (for example, see Jython).
Many programmers seem to have strong opinions about the syntax of python. In particular there are two thing that distinguish python from similar languages:
- Blocks of code, such as loops and functions, are set apart by indentations not brackets or other characters. This forces the programmer to make the code look pretty, but can lead to soft errors in the code if a line is out of place.
- Python has no real block comment. In most cases one can use triple quotation marks: “”” Example
string”””, but it is not a true block comment.
Python is a powerful tool for small or medium size data or for applications where speed isn’t critical. Its easy syntax makes it great for rapid prototyping, and it has definite strengths in data visualization. However, for applications where speed and scalability are crucial, it may not be the best choice.