Big data software consultants

Introduction

Big data has been defined as having one or more of the following characteristics: 

  • Volume – sheer quantity of data
  • Variety – Type and nature of the data
  • Velocity – The speed at which the data is generated or needs to be processed

There are numerous tasks that big data software consultants might want to perform on big data collections that might include data capture and storage, analysis, visualization, and others. When it comes to big data applications, there are three languages that have emerged (or are in the process of emerging) to tackle big data problems: Python, Scala, and Julia. 

Python is by far the most common of the three used by big data software consultants and has been around the longest (1991). Perhaps the second most popular language for building data-heavy projects, Scala was released to the world in 2004. It is a general-purpose programming language that offers a strong static type system and support for functional programming and runs on the JVM. Julia is the newest of the three, being released in 2012. While it offers some intriguing perks, widespread adaptability is still uncertain.

Here is a breakdown of some pros and cons for each language, starting in part I with Python. 

Python

Pros

Widely Used

Perhaps the biggest upside to Python is how widely it is used by big data software consultants. Likely due to its shallow learning curve and maturity, many have abandoned lower level languages (such as C or Java) or proprietary languages with a high price tag (i.e. Matlab) and embraced this language. Because of this popularity, there is a well-supported library for just about every data science task one may encounter, particularly when it comes to applications like data visualization and deep learning. 

Visualization

Python has several mature tools for visualizing and interacting with data including the following:

Matplotlib

Matplotlib, perhaps the most commonly used, is very mature, well documented, and flexible enough to make practically any kind of static chart anyone would want to produce. 

big data software consultant 3

Figure 1: Example Matplotlib from https://matplotlib.org/3.1.0/gallery/lines_bars_and_markers/filled_step.html#sphx-glr-gallery-lines-bars-and-markers-filled-step-py

Plotly

Plotly produces publication-quality online, interactive charts. It appears to have all the capabilities of Matplotlib, but is also interactive. The library has a pretty robust and well documented Python API, but does require users to register an account in order to use the free version. Figure 2 shows an example of an interactive scatter plot from Plotly. The figure contains tools in the upper righthand corner for zooming, panning and selecting points. Plotly provides functionality for specifying what happens when the user hovers over or selects a point.

big data software consultant 2
Figure 2: Example Plotly scatter plot

Tableau

The popular data visualization application Tableau includes a robust Python API, and the library TabPy that allows the use of Tableau within Python scripts.

ParaView

Lastly, the visualization application ParaView has a robust Python API. ParaView was built to visually analyze massive data sets over a distributed memory system. The type of data visualized by this library is a little more niched, and is typically 3D data such as particle streams, magnetic fields, fire simulations and many more. Figure 3 shows an example of the types of visualizations that ParaView is capable of generating.

big data software consultants 1
Figure 3: Example of visualization produced by ParaView, taken from: https://www.paraview.org/gallery/

Cons

Dynamic Typing

Part of the flexibility of Python and easy syntax is due to it being a dynamically typed language. This, however, adds to difficulties in debugging because many errors that would be found by a compiler now become run-time errors that are more difficult to track down.

Speed

Python is an interpreted language, and as such its performance is not as fast as a compiled language such as C or Java. For many applications this won’t matter. However, in applications where speed is crucial, Python may not be the best suited tool.

Two Major Branches

Python has two major branches, with many systems that seem to be forever stuck on version 2.7, while the rest of the Python world has moved on and is on version 3.7. 

GIL

Python code can be difficult to parallelize due to something called the Global Interpreter Lock (GIL). The GIL prevents multiple threads trying to access or modify the same object. Because of this, programmers not experienced in parallelizing python code may try and implement a multi-threaded process and get something that is just as slow (or slower) than the serial code. There are a few work arounds, but it is debatable as to whether the work around is better (for example, see Jython).

Other Thoughts

Many programmers seem to have strong opinions about the syntax of python. In particular there are two thing that distinguish python from similar languages: 

  1. Blocks of code, such as loops and functions, are set apart by indentations not brackets or other characters. This forces the programmer to make the code look pretty, but can lead to soft errors in the code if a line is out of place. 
  2. Python has no real block comment. In most cases one can use triple quotation marks: “”” Example
    of
    a
    muti-line
    string”””, but it is not a true block comment. 

Concluding Thoughts

Python is a powerful tool for small or medium size data or for applications where speed isn’t critical. Its easy syntax makes it great for rapid prototyping, and it has definite strengths in data visualization. However, for applications where speed and scalability are crucial, it may not be the best choice. 


0 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *