Originally posted in 2015, this article was updated on April 18th, 2018
There are thousands of Big Data tools out there. All of them promising to save you time, money and help you uncover never-before-seen business insights. And while all that may be true, navigating this world of possible tools can be tricky when there are so many options. This becomes especially difficult when you’re not certain how big data tools differentiate themselves from each other. As a result, many who are interested in utilizing big data tools have the same questions.
Which one is right for your skill set?
Which one is right for your project?
But you don’t have to fret over your decision. To save you some time and help you pick the right tool the first time, we’ve compiled a list of a few of our favorite big data tools in the areas of extraction, storage, cleaning, mining, visualizing, analyzing and integrating. Check it out.
Data Storage and Management
If you’re going to be working with types of Big Data, you need to be thinking about how you store it. Part of how Big Data got the distinction as “Big” is that it became too much for traditional systems to handle. What once required gigabytes now scales up even more to terabytes and larger. A good data storage provider should offer you an infrastructure on which to run all your other big data analytics tools as well as a place to store and query your data.
The name Hadoop has become synonymous with big data. Hadoop is an open-source big data analytics software framework for distributed storage of very large datasets on computer clusters. All that means you can scale your data up and down without having to worry about hardware failures. Hadoop provides massive amounts of storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Hadoop is not for the data beginner. To truly harness the software’s power, you really need to know Java. It might be a commitment, but Hadoop is certainly worth the effort – since tons of other companies and technologies run off of it or integrate with it. Getting familiar with the Hadoop ecosystem can prove valuable in more ways than one.
Get started with Hadoop: Cloudera has some great Hadoop training courses.
Speaking of which, Cloudera is essentially a brand name for Hadoop with some extra services stuck on. They can help your business build an enterprise data hub, to allow people in your organization better access to the data you are storing.
While it does have an open source element, Cloudera is mostly and enterprise solution to help businesses manage their Hadoop ecosystem. Essentially, they do a lot of the hard work of administering Hadoop for you. They will also deliver a certain amount of data security, which is highly important if you’re storing any sensitive or personal data.
Get started: Cloudera has a long list of webinars covering all different types of use.
MongoDB is the modern, start-up approach to databases. Think of them as an alternative to relational databases. It’s good for managing data that changes frequently or data that is unstructured or semi-structured.
Common use cases include storing data for mobile apps, product catalogs, real-time personalization, content management and applications delivering a single view across multiple systems. Again, MongoDB is not for the data newbie. As with any database, you do need to know how to query it using a programming language.
Getting started: MongoDB has their own “University” where you can learn to use their services and even get a certification.
Talend is another great open source company that offers a number of data products. Here we’re focusing on their Master Data Management (MDM) offering, which combines real-time data, applications, and process integration with embedded data quality and stewardship.
Because it’s open source, Talend is completely free making it a good option no matter what stage of business you are in. And it saves you having to build and maintain your own data management system – which is a tremendously complex and difficult task.
Get started: Talend have a good (if slightly 90’s looking) set of tutorials for getting started
Start from the beginning
Whether you’re an enterprise sized company or a small business, if you’re completely new to Big Data, databases might not be the best place to start. They are relatively complex and do require a certain amount of coding knowledge to operate (unlike many of the other tools mentioned below).
However, if you really want to work in or with Big Data analytics, knowing the basics of databases and being able to talk intelligently about them is a must. This General Assembly Class is a great place to start. You’ll get a comprehensive overview of the technologies powering big data, including the history of databases and storage, the difference between relational and document databases, the challenges of big data and the tools it necessitates, as well as an introduction to Hadoop. With this knowledge at your disposal, you’ll be able to tackle other challenges and move on to learning about other big data platforms, data analytics software, big data systems and more.
Big Data Cleaning Tools
Before you can really mine your business data for insights you need to clean it up. When first collected, data can appear quite disorganized and difficult to interpret. In other words, it’s messy. Even though it’s always good practice to create a clean, well-structured data set, sometimes it’s not always possible no matter the types of big data you have. Data sets can come in all shapes and sizes (some good, some not so good!), especially when you’re getting it from the web. The companies below will help you refine and reshape the data into a data set your business can use.
OpenRefine (formerly GoogleRefine) is an open source big data cleaning tool that is dedicated to cleaning messy data. You can explore huge data sets easily and quickly even if the data from your business is a little unstructured.
As far as big data analytics softwares go, OpenRefine is pretty user-friendly. Though, a good knowledge of data cleaning principles certainly helps you get the most out of it. The nice thing about OpenRefine is that it has a huge community with lots of contributors meaning that the analytics software is constantly getting better and better. And you can ask the (very helpful and patient) community questions if you get stuck. You can check out their Github repository where you can also find the OpenRefine wiki.
DataCleaner recognises that data manipulation is a long and drawn out task. Data visualization tools can only read nicely structured, “clean” data sets. DataCleaner does the hard work for you and transforms messy semi-structured data sets into clean readable data sets that all of the visualization companies can read.
DataCleaner also offers data warehousing and data management services for your business. The company offers a 30-day free trial and then after that a monthly subscription fee. You can find out more about their plans here.
Getting started: DataCleaner has a thorough set of documentation and videos. For their commercial plans, they also offer in-person or webinar training
Not to be confused with data extraction (which will be covered later), data mining is the process of discovering insights within a database as opposed to extracting data from web pages into databases. The aim of data mining is to make predictions and decisions on the data your business has at hand.
IBM SPSS Modeler
The IBM SPSS Modeler offers a whole suite of solutions dedicated to data mining. This includes text analysis, entity analytics, decision management and optimization. Their five products provide a range of advanced algorithms and techniques that include text analytics, entity analytics, decision management and optimization.
SPSS Modeler is a heavy-duty solution that is well suited for the needs of big businesses. It can run on virtually any type of database and you can integrate it with other IBM SPSS products such as SPSS collaboration and deployment services and the SPSS Analytic server.
Oracle data mining
Another big hitter in the data mining sphere is Oracle. As part of their Advanced Analytics Database option, Oracle data mining allows its users to discover insights, make predictions and leverage their Oracle data. You can build models to discover customer behavior, target best customers and develop profiles.
The Oracle Data Miner GUI enables data analysts, business analysts and data scientists to work with data inside a database using a rather elegant drag and drop solution. It can also create SQL and PL/SQL scripts for automation, scheduling and deployment throughout the enterprise.
Getting started: For all the resources you could ever need, head to their support page
Teradata recognizes the fact that, although big data is awesome, if you don’t actually know how to analyze and use it, it’s worthless. Imagine having millions upon millions of data points without the skills to query them. That’s where Teradata comes in. They provide end-to-end solutions and services in data warehousing, big data and analytics and marketing applications. This all means that you can truly become a data-driven business, which will lead to new levels of success.
Teradata also offers a whole host of services including implementation, business consulting, training and support.
Getting started: Have a look at their support documentation
If you’re stuck on a data mining problem or want to try solving the world’s toughest problems, check out Kaggle. Kaggle is the world’s largest data science community. Companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models.
While data mining is all about sifting through your data in search of previously unrecognized patterns, data analysis is about breaking that data down and assessing the impact of those patterns overtime. Analytics is about asking specific questions and finding the answers in big data. You can even ask questions about what will happen in the future! This is the step where big data shows just how valuable it can be.
Qubole is a big data platform which simplifies, speeds and scales big data analytics workloads against data stored on AWS, Google, or Azure clouds. They take the hassle out of infrastructure wrangling. Once the IT policies are in place, any number of data analysts can be set free to collaboratively “click to query” with the power of Hive, Spark, Presto and many others in a growing list of data processing engines.
Qubole is an enterprise level solution. They offer a free trial that you can sign up to at this page. The flexibility of the program really does set it apart from the rest as well as being the most accessible of the big data platforms.
Getting started: Learn more about Qubole on their resources page
BigML is attempting to simplify machine learning. They offer a powerful Machine Learning service with an easy-to-use interface for you to import your data and get predictions out of it. You can even use their models for predictive analytics.
A good understanding of modeling is certainly helpful, but not essential, if you want to get the most from BigML. They have a free version of the tool that allows you to create tasks that are under 16mb as well as having a pay as you go plan and a virtual private cloud that meet enterprise-grade requirements.
Getting started: You can quickly learn how BigML works in this quick four video series
Statwing takes data analysis to a new level providing everything from beautiful visuals to complex analysis. They have a particularly cool blog post on NFL data! It’s so simple to use that you can actually get started with Statwing in under 5 minutes.
Although it isn’t free to use, the pricing plan is rather elegant. The basic package is $50 a month which you can cancel at any time. This allows you to use unlimited datasets of up to 50mb in size each. There are other enterprise plans that give you the ability to upload bigger datasets.
Getting started: There are lots of cool tutorial videos on their homepage
Data visualization companies will make your data come to life. Part of the challenge for any data scientist is conveying the insights from that data to the rest of your business. For most of your colleagues, MySQL databases and spreadsheets aren’t going to cut it. Visualizations are a bright and easy way to convey complex data insights. And the best part is that most of them require no coding whatsoever!
Tableau is a data visualization tool with a primary focus on business intelligence. You can create maps, bar charts, scatter plots and more without the need for programming. They recently released a web connector that allows you to connect to a database or API thus giving you the ability to get live data in a visualisation.
Tableau has five products available with varying degrees of support and functionality. If you’re new to vizzing (as they call it) we recommend Tableau Public, the free version of their visualization tooling. Exploring that tool should give you an idea of which of the other Tableau products you’d rather pay for.
Get started: Tableau has a lot of functionality, so definitely check out their tutorials before diving in.
CartoDB is a data visualization tool that specialises in making maps. They make it easy for anyone you to visualize location data – without the need for any coding. CartoDB can manage a myriad of data files and types, they even have sample datasets that you can play around with while you’re getting the hang of it.
If you have location data, CartoDB is definitely worth a look. It may not be the easiest system to use, but once you get the hang of it, it is incredibly powerful. They offer an enterprise package which allows for collaboration on projects as well as controlled access.
Getting started: They have an extensive library of documentation to help you become a mapping expert.
Chartio allows you to combine data sources and execute queries in-browser. You can create powerful dashboards in just a few clicks. Chartio’s visual query language allows anyone to grab data from anywhere without having to know SQL or other complicated model languages. They also let you schedule PDF reports so you can export and email your dashboard as a PDF file to anyone you want.
The other cool thing about Chartio is that it often doesn’t require a data warehouse. This means that you’re going to get up and running faster and that your cost of implementation is going to be lower and more predictable.
Getting started: Check out the Chartio tutorials to get started
If you are wanting to build a graph, Plot.ly is the place to go. This handy platform allows you to create stunning 2d and 3d charts (you really need to see it to believe it!). Again, all without needing programming knowledge.
The free version allows you create one private chart and unlimited public charts or you can upgrade to the enterprise packages to make unlimited private and public charts as well as giving you the option for Vector exports and saving of custom themes.
Getting started: You find everything you need to get started in Plotly’s full range of tutorials
Our final visualization tool is Datawrapper. It’s an open source tool that creates embeddable charts in minutes. Because it’s open source, it will be constantly evolving as anyone can contribute to it. They have an awesome chart gallery where you can check out the kind of stuff people are doing with Datawrapper.
Similar to many of the other companies in this section, it has both a free tool as well as a paid option with the paid option being a pre-set up, customised package of Datawrapper.
Data integration platforms are the glue between each program. If you want to connect the data you’ve extracted using Import.io with Twitter or you want to share on Facebook the visualisation you’ve made with Tableau automatically, then the integration services below are the tools for you.
Blockspring is a unique program in the way that they harness all of the power of services such as IFTTT and Zapier in familiar platforms such as Excel and Google Sheets. You can connect to a whole host of 3rd party programs by simply writing a Google Sheet formula. You can post Tweets from a spreadsheet, look to see who your followers are following as well as connecting to AWS, Import.io and Tableau to name a few.
Blockspring is free to use, but they also have an organization package that allows you to create and share private functions, add custom tags for easy search and discovery and set API tokens for your whole organization at once. It’s a great fit for any enterprise.
Getting started: Blockspring has some really good help documentation to get you up and running
Pentaho offers big data integration with zero coding required. Using a simple drag and drop UI you can integrate a number of tools with minimal coding. They also offer embedded analytics and business analytics services too.
Getting started: You can check out the help documentation to learn more and get a better feel for how it works
Getting started: You can check out the help documentation to get a better feel for how it works
There will be times in your data career when a tool simply won’t cut it. While today’s big data tools are becoming more powerful and easier to use, sometimes it is just better to code it yourself. Even if you’re not a programmer, understanding the basics of how these languages work will give you a better understanding of how many of these tools function and how best to use them. Here are some of the most common languages associated with big data analytics software.
R is a language for statistical computing and graphics. If the data mining and statistical software listed above doesn’t quite do what you want it to, learning R is the way forward. In fact, if you’re planning on being a data scientist, knowing R is a requirement.
It can run on Linux, Windows and MacOS and you can download R at this page. There is a huge community of statisticians using R nowadays and its popularity is always growing.
Getting started: Once downloaded, you can check out the documentation to learn more
Another language that is gaining popularity in the data community is Python. Created in the 1980s and named from Monty Python’s Flying Circus, it has consistently ranked in the top ten most popular programming languages in the world. Many journalists use Python to write custom scrapers if data collection tools fail to get the data that they need.
People like it because of the similarities with the English language. It uses words such as ‘if’ and ‘in’ meaning you can read a script very easily. It offers a wide range of libraries designed for different types of tasks.
Getting started: Check out the homepage to learn more about Python
RegEx or Regular Expressions are a set of characters that can manipulate and change data. It’s used mainly for pattern matching with strings, or string matching. At Import.io, you can use RegEx while extracting data to delete parts of a string or keep particular parts of a string.
It is an incredibly useful tool to use when doing data extraction as you can get exactly what you want when you extract data meaning you don’t need to rely on those data manipulation companies mentioned above!
Getting started: There are many cool tutorials to help you learn RegEx online
XPath is a query language used for selecting certain nodes from an XML document. Whereas RegEx manipulates and changes the data makeup, XPath will extract the raw data ready for RegEx.
XPath is most commonly used in data extraction. Import.io actually automatically creates XPaths everytime you click on a piece of data – you just don’t see them! It is also possible to insert your own XPath to get data from drop down menus and data that is in tabs on a webpage. Put simply, an XPath is a path, a set of directions to a certain part of the HTML of a webpage.
Before you can store, analyze or visualize your data, you’ve got to have some. Data extraction is all about taking something that is unstructured, like a webpage, and turning it into a structured table. Once you’ve got it structured, you can manipulate it in all sorts of ways, using the tools we’ve covered, to find insights.
Import.io is the number one tool for data extraction. Import.io enables users to convert websites into structured, machine readable data with no coding required. Using a simple point and click UI, we take a webpage and transform it into an easy to use spreadsheet that you can then analyze, visualize, and use to make data-driven decisions. Features include Authenticated Extractions behind a login, flexible scheduling, and fully documented public APIs. Customers use the data for machine learning, market and academic research, lead generation, app development, and price monitoring.