Originally posted in 2015, this article was updated on April 18th, 2018.
There are thousands of Big Data tools out there. They all promise to save you time, money, and help you uncover never-before-seen business insights. And while this may all be true, the array of possible tools can make it tricky to navigate your options. In addition, if you don’t know what differentiates each of the big data tools, picking one proves to be an even bigger challenge. To discover which big data tool is right for you, these two questions must be answered:
Which one is right for your skill set?
Which one is right for your project?
Choosing the right tool, the first time around, will help save you time and lessen hiccups, but this decision doesn’t have to be made blindly. Keep in mind, there is no “best” big data platform. Each of these programs cater to different needs, so it is important that you choose the big data tool that best answers that best fits your situation (the above questions can help with this). To make your choice easier, we’ve compiled some common big data tools that are used to improve extraction, storage, cleaning, mining, visualization, analysis and integration processes. Check it out below.
Data Storage and Management
If you’re going to be working with types of Big Data, you need to be thinking about how you store it. Part of how Big Data got classified as “Big” is that it became too much for traditional systems to handle. What once required gigabytes now scales even further, to terabytes and larger. Big data analytic tools are the programs that are used to make gathering/extracting insights from big data, easier. A good data storage provider should offer you an infrastructure to run all of your various big data tools, as well as provide a place to store, query, and analyze your data.
The name Hadoop has become synonymous with big data. Hadoop is an open-source big data analytics software framework, used for distributed storage of very large datasets on computer clusters. All that means you can scale your data up and down without having to worry about hardware failures. Hadoop provides massive amounts of storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Hadoop is not for the data beginner. To truly harness the software’s power, you really need to know Java. It might be a commitment, but Hadoop is certainly worth the effort – since tons of other companies and technologies run or integrate it. Getting familiar with the Hadoop ecosystem can prove valuable in more ways than one.
Get started with Hadoop: Cloudera has some great Hadoop training courses.
Speaking of which, Cloudera is essentially a brand name for Hadoop with some extra services added on. They can help your business build an enterprise data hub to grant better access of stored data to those in your organization.
While it does have an open source element, Cloudera is mostly an enterprise solution to help businesses manage their Hadoop ecosystem. Essentially, they do a lot of the hard work of administering Hadoop for you. They will also deliver a certain amount of data security, which is highly important if you’re storing any sensitive or personal data.
Get started: Cloudera has a long list of webinars covering all different types of use.
MongoDB is the modern, start-up approach to databases. Think of them as an alternative to relational databases. It’s good for managing data that changes frequently or data that is unstructured or semi-structured.
Common use cases include storing data for mobile apps, product catalogs, real-time personalization, content management, and applications, delivering a single view across multiple systems. Again, MongoDB is not for the data newbie. As with any database, you do need to know how to query it using a programming language.
Getting started: MongoDB has their own “University” where you can learn to use their services and even get a certification.
Talend is another great open source company that offers a number of data products. Here we’re focusing on their Master Data Management (MDM) offering, which combines real-time data, applications, and process integration with embedded data quality and stewardship.
Because it’s open source, Talend is completely free making it a good option no matter what stage of business you are in. And it saves you having to build and maintain your own data management system – which is a tremendously complex and difficult task.
Get started: Talend have a good (if slightly 90’s looking) set of tutorials for getting started
Start from the beginning
Whether you’re an enterprise sized company or a small business, if you’re completely new to Big Data, databases might not be the best place to start. They are relatively complex and do require a certain amount of coding knowledge to operate (unlike many of the other tools mentioned below).
However, if you really want to work in or with Big Data analytics, knowing the basics of databases and being able to talk intelligently about them is a must. This General Assembly Class is a great place to start. You’ll get a comprehensive overview of the technologies powering big data, including the history of databases and storage, the difference between relational and document databases, the challenges of big data, the tools it necessitates, and an introduction to Hadoop. With this knowledge at your disposal, you’ll be able to tackle other challenges and move on to learning about other big data platforms, data analytics software, big data systems and more.
Big Data Cleaning Tools
Before you can really mine your business data for insights, you need to clean it up. When first collected, data can appear quite disorganized and difficult to interpret. In other words, it’s messy. Even though it’s always good practice to create a clean, well-structured data set, this isn’t always possible– regardless of the types of big data you have. Data sets can come in all shapes and sizes (some good, some not so good!), especially when you’re getting it from the web. The companies below will help you refine and reshape the data into a data set your business can use.
OpenRefine (formerly GoogleRefine) is an open source big data tool that is dedicated to cleaning messy data. You can explore huge data sets quickly and easily, even if the data from your business is a little unstructured.
As far as big data analytics softwares go, OpenRefine is pretty user-friendly. Though, a good knowledge of data cleaning principles certainly helps you get the most out of it. The nice thing about OpenRefine is that it has a huge community with lots of contributors. This means that the analytics software is constantly improving, and the helpful/patient community can answer questions whenever you get stuck. You can check out their Github repository where you can also find the OpenRefine wiki.
DataCleaner recognizes that data manipulation is a long and drawn out task. Data visualization tools can only read nicely structured, “clean” data sets. DataCleaner does the hard work for you and transforms messy semi-structured data sets into clean readable data sets that all of the visualization companies can read.
DataCleaner also offers data warehousing and data management services for your business. The company offers a 30-day free trial and then after that a monthly subscription fee. You can find out more about their plans here.
Getting started: DataCleaner has a thorough set of documentation and videos. For their commercial plans, they also offer in-person or webinar training
Not to be confused with data extraction (which will be covered later), data mining is the process of discovering insights within a database as opposed to extracting data from web pages into databases. The aim of data mining is to make predictions and decisions on the data your business has at hand.
IBM SPSS Modeler
The IBM SPSS Modeler offers a whole suite of solutions dedicated to data mining. This includes text analysis, entity analytics, decision management and optimization. Their five products provide a range of advanced algorithms and techniques that include text analytics, entity analytics, decision management and optimization.
SPSS Modeler is a heavy-duty solution that is well suited for the needs of big businesses. It can run on virtually any type of database and you can integrate it with other IBM SPSS products such as SPSS collaboration and deployment services and the SPSS Analytic server.
Oracle data mining
Another big hitter in the data mining sphere is Oracle. As part of their Advanced Analytics Database option, Oracle data mining allows its users to discover insights, make predictions and leverage their Oracle data. You can build models to discover customer behavior, target best customers and develop profiles.
The Oracle Data Miner GUI enables data analysts, business analysts and data scientists to work with data inside a database using a rather elegant drag and drop solution. It can also create SQL and PL/SQL scripts for automation, scheduling and deployment throughout the enterprise.
Getting started: For all the resources you could ever need, head to their support page
Teradata recognizes the fact that, although big data is awesome, if you don’t actually know how to analyze and use it, it’s worthless. Imagine having millions upon millions of data points without the skills to query them. That’s where Teradata comes in. They provide end-to-end solutions and services in data warehousing, big data and analytics and marketing applications. This all means that you can truly become a data-driven business, which will lead to new levels of success.
Teradata also offers a whole host of services including implementation, business consulting, training, and support.
Getting started: Have a look at their support documentation
If you’re stuck on a data mining problem or want to try solving the world’s toughest problems, check out Kaggle. Kaggle is the world’s largest data science community. Companies and researchers post their data, and statisticians and data miners from all over the world compete to produce the best models.
While data mining is all about sifting through your data in search of previously unrecognized patterns, data analysis is about breaking that data down and assessing the impact of those patterns overtime. Analytics is about asking specific questions and finding the answers in big data. You can even ask questions about what will happen in the future! This is the step where big data shows just how valuable it can be.
Qubole is a big data platform which simplifies, speeds and scales big data analytics workloads against data stored on AWS, Google, or Azure clouds. They take the hassle out of infrastructure wrangling. Once the IT policies are in place, any number of data analysts can be set free to collaboratively “click to query” with the power of Hive, Spark, Presto and many others in a growing list of data processing engines.
Qubole is an enterprise level solution. They offer a free trial that you can sign up for at this page. The flexibility of the program really does set it apart from the rest, as well as being the most accessible of the big data platforms.
Getting started: Learn more about Qubole on their resources page
BigML is attempting to simplify machine learning. They offer a powerful Machine Learning service with an easy-to-use interface for you to import your data and get predictions out of it. You can even use their models for predictive analytics.
A good understanding of modeling is certainly helpful, but not essential, if you want to get the most from BigML. They have a free version of the tool that allows you to create tasks that are under 16mb as well as offering a pay as you go plan with a virtual private cloud that meets enterprise-grade requirements.
Getting started: You can quickly learn how BigML works in this quick four video series
Statwing takes data analysis to a new level, providing everything from beautiful visuals to complex analysis. They have a particularly cool blog post on NFL data! It’s so simple to use that you can actually get started with Statwing in under 5 minutes.
Although it isn’t free to use, the pricing plan is rather elegant. The basic package is $50 a month which you can cancel at any time. This allows you to use unlimited datasets of up to 50mb each in size. There are other enterprise plans as well that give you the ability to upload bigger datasets.
Getting started: There are lots of cool tutorial videos on their homepage
Data visualization companies will make your data come to life. Part of the challenge for any data scientist is conveying the insights from that data to the rest of their business. For most of your colleagues, MySQL databases and spreadsheets aren’t going to cut it. Visualizations are a bright and easy way to convey complex data insights. And the best part is that most of them require no coding whatsoever!
Tableau is a data visualization tool with a primary focus on business intelligence. You can create maps, bar charts, scatter plots, and more without the need for programming. They recently released a web connector that allows you to connect to a database or API, thus giving you the ability to get live data in a visualisation.
Tableau has five products available with varying degrees of support and functionality. If you’re new to vizzing (as they call it) we recommend Tableau Public, the free version of their visualization tooling. Exploring that tool should give you an idea of which of the other Tableau products you’d rather pay for.
Get started: Tableau has a lot of functionality, so definitely check out their tutorials before diving in.
CartoDB is a data visualization tool that specialises in making maps. They make it easy for anyone to visualize location data – without the need for any coding. CartoDB can manage a myriad of data files and types, and even have sample datasets that you can play around with while you’re getting the hang of it.
If you have location data, CartoDB is definitely worth taking a look at. It may not be the easiest system to use, but it is incredibly powerful once you get the hang of it. They offer an enterprise package which allows for collaboration on projects as well as controlled access.
Getting started: They have an extensive library of documentation to help you become a mapping expert.
Chartio allows you to combine data sources and execute queries in-browser. You can create powerful dashboards in just a few clicks. Chartio’s visual query language allows anyone to grab data from anywhere, without having to know SQL or other complicated model languages. They also let you schedule PDF reports so you can export and email your dashboard as a PDF file to anyone you want.
The other cool thing about Chartio is that it often doesn’t require a data warehouse. This means that you’ll get up and running faster, and your cost of implementation is going to be both lower and more predictable.
Getting started: Check out the Chartio tutorials to get started
If you are wanting to build a graph, Plot.ly is the place to go. This handy platform allows you to create stunning 2d and 3d charts (you really need to see it to believe it!). Again, all without needing programming knowledge.
The free version allows you create one private chart and unlimited public charts or you can upgrade to the enterprise packages to make unlimited private and public charts as well as giving you the option for Vector exports and saving of custom themes.
Getting started: You’ll find everything you need to get started in Plotly’s full range of tutorials
Our final visualization tool is Datawrapper. It’s an open source tool that creates embeddable charts in minutes. Because it’s open source, it will be constantly evolving as anyone can contribute to it. They have an awesome chart gallery where you can check out the kind of stuff people are doing with Datawrapper.
Similar to many of the other companies in this section, it has both a free tool as well as a paid option, with the paid option being a pre-set up, customized package of Datawrapper.
Data integration platforms are the glue between each program. Whether you want to connect the data you’ve extracted using Import.io with Twitter, or you want to automatically share your Tableau visualization on Facebook, the integration services below are the tools for you.
Blockspring is a unique program in the way that they harness all of the power of services such as IFTTT and Zapier in familiar platforms such as Excel and Google Sheets. You can connect to a whole host of 3rd party programs by simply writing a Google Sheet formula. You can post Tweets from a spreadsheet, look to see who your followers are following as well as connecting to AWS, Import.io and Tableau to name a few.
Blockspring is free to use, but they also have an organization package that allows you to create and share private functions, add custom tags for easy search and discovery and set API tokens for your whole organization at once. It’s a great fit for any enterprise.
Getting started: Blockspring has some really good help documentation to get you up and running
Pentaho offers big data integration with zero coding required. Using a simple drag and drop UI you can integrate a number of tools with minimal coding. They also offer embedded analytics and business analytics services too.
Getting started: You can check out the help documentation to get a better feel for how it works
There will be times in your data career when a tool simply won’t cut it. While today’s big data tools are becoming more powerful and easier to use, sometimes it is just better to code it yourself. Even if you’re not a programmer, understanding the basics of how these languages work will give you a better understanding of how many of these tools function and how best to use them. Here are some of the most common languages associated with big data analytics software.
R is a language for statistical computing and graphics. If the data mining and statistical software listed above doesn’t quite do what you want it to, learning R is the way forward. In fact, if you’re planning on being a data scientist, knowing R is a requirement.
It can run on Linux, Windows and MacOS and you can download R at this page. There is a huge community of statisticians using R nowadays and its popularity is always growing.
Getting started: Once downloaded, you can check out the documentation to learn more
Another language that is gaining popularity in the data community is Python. Created in the 1980s and named after Monty Python’s Flying Circus, it has consistently ranked in the top ten most popular programming languages in the world. Many journalists use Python to write custom scrapers if data collection tools fail to get the data that they need.
People like it because of the similarities with the English language. It uses words such as ‘if’ and ‘in’ meaning you can read a script very easily. It offers a wide range of libraries designed for different types of tasks.
Getting started: Check out the homepage to learn more about Python
RegEx or Regular Expressions are a set of characters that can manipulate and change data. It’s used mainly for pattern matching with strings, or string matching. At Import.io, you can use RegEx while extracting data to delete parts of a string or keep particular parts of a string.
It is an incredibly useful tool when doing data extraction, as you don’t need to rely on those data manipulation companies mentioned above. Instead, you get exactly what you want when extracting data.
Getting started: There are many cool tutorials to help you learn RegEx online
XPath is a query language used for selecting certain nodes from an XML document. Whereas RegEx manipulates and changes the data makeup, XPath will extract the raw data ready for RegEx.
XPath is most commonly used in data extraction. Import.io actually automatically creates XPaths everytime you click on a piece of data – you just don’t see them! It is also possible to insert your own XPath to get data from drop down menus and data that is in tabs on a webpage. Put simply, an XPath is a path, a set of directions to a certain part of the HTML of a webpage.
Before you can store, analyze, or visualize your data, you’ve got to have some. Data collection is the process of gathering relevant unstructured information, to then be followed up by data extraction, allowing you to turn this data into a structured table. Once it has been structured, you can then manipulate it in all sorts of ways, using the tools we’ve covered to find insights.
Import.io is the number one tool for data extraction. Import.io enables users to convert websites into structured, machine readable data with no coding required. Using a simple point and click UI, we take a webpage and transform it into an easy to use spreadsheet that you can then analyze, visualize, and use to make data-driven decisions. Features include Authenticated Extractions behind a login, flexible scheduling, and fully documented public APIs. Customers use the data for machine learning, market and academic research, lead generation, app development, and price monitoring.
Click below to chat with one of our data experts and find out how Import.io can help with your data extraction.