There are thousands of Big Data tools out there, all of them promising to save you time, money and help you uncover never-before-seen business insights. And while all that may be true, navigating this world of possible tools can be tricky when there are so many options.
Which one is right for your skill set?
Which one is right for your project?
To save you some time and help you pick the right tool the first time, we’ve compiled a list of a few of our favorite data tools in the areas of extraction, storage, cleaning, mining, visualizing, analyzing and integrating.
Data Storage and Management
If you’re going to be working with Big Data, you need to be thinking about how you store it. Part of how Big Data got the distinction as “Big” is that it became too much for traditional systems to handle. A good data storage provider should offer you an infrastructure on which to run all your other analytics tools as well as a place to store and query your data.
The name Hadoop has become synonymous with big data. It’s an open-source software framework for distributed storage of very large datasets on computer clusters. All that means you can scale your data up and down without having to worry about hardware failures. Hadoop provides massive amounts of storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs.
Hadoop is not for the data beginner. To truly harness its power, you really need to know Java. It might be a commitment, but Hadoop is certainly worth the effort – since tons of other companies and technologies run off of it or integrate with it.
Getting started: Cloudera has some great Hadoop training courses.
Speaking of which, Cloudera is essentially a brand name for Hadoop with some extra services stuck on. They can help your business build an enterprise data hub to allow people in your organization better access to the data you are storing.
While it does have an open source element, Cloudera is mostly an enterprise solution to help businesses manage their Hadoop ecosystem. Essentially, they do a lot of the hard work of administering Hadoop for you. They will also deliver a certain amount of data security, which is highly important if you’re storing any sensitive or personal data.
MongoDB is the modern, start-up approach to databases. Think of them as an alternative to relational databases. It’s good for managing data that changes frequently or data that is unstructured or semi-structured.
Common use cases include storing data for mobile apps, product catalogs, real-time personalization, content management and applications delivering a single view across multiple systems. Again, MongoDB is not for the data newbie. As with any database, you do need to know how to query it using a programming language.
Talend is another great open source company that offers a number of data products. Here we’re focusing on their Master Data Management (MDM) offering, which combines real-time data, applications, and process integration with embedded data quality and stewardship.
Because it’s open source, Talend is completely free, making it a good option no matter what stage of business you are in. And it saves you having to build and maintain your own data management system – which is a tremendously complex and difficult task.
Learn the ins and outs of how to create jobs, job components, and tasks in Talend’s data center.
Quantum, one of the oldest data storage companies in the U.S., has overhauled their business model for the internet era. Check out their robust tiered, scalable cloud storage plans for businesses that need automated, secure data management solutions. If you have complex data storage and management needs, their StorNext Data Management software can make these time-sensitive tasks easier to navigate.
Getting started: Contact the sales team to see if StorNext Data Management is a good fit for your company.
Start from the beginning
If you’re completely new to Big Data, databases might not be the best place to start. They are relatively complex and do require a certain amount of coding knowledge to operate (unlike many of the other tools mentioned below).
However, if you really want to work in or with Big Data, knowing the basics of databases and being able to talk intelligently about them is a must. This General Assembly Class is a great place to start. You’ll get a comprehensive overview of the technologies powering big data, including the history of databases and storage, the difference between relational and document databases, and the challenges of big data and the tools it necessitates, as well as an introduction to Hadoop.
Before you can really mine your data for insights, you need to clean it up. Even though it’s always good practice to create a clean, well-structured data set, sometimes it’s not always possible. Data sets can come in all shapes and sizes (some good, some not so good!), especially when you’re getting it from the web. The companies below will help you refine and reshape the data into a useable data set.
As far as data software goes, OpenRefine is pretty user-friendly, though a good knowledge of data cleaning principles certainly helps. The nice thing about OpenRefine is that it has a huge community with lots of contributors, meaning that the software is constantly getting better and better. And you can ask the (very helpful and patient) community questions if you get stuck. You can check out their Github repository where you can also find the OpenRefine wiki.
DataCleaner recognizes that data manipulation is a long and drawn out task. Data visualization tools can only read nicely structured, “clean” data sets. DataCleaner does the hard work for you and transforms messy semi-structured data sets into clean readable data sets that all of the visualization companies can read.
DataCleaner also offers data warehousing and data management services. The company offers a 30-day free trial, and then after that you’ll need to pay a monthly subscription fee. You can find out more about their plans here.
Price: Free trial; prices for other services and products vary.
Getting started: DataCleaner has a thorough set of documentation and videos. For their commercial plans, they also offer in-person or webinar training.
Trifacta’s Wrangler makes it easier for you to zip through multiple datasets from multiple sources. Spend less time searching for and cleaning data, and more time analyzing with their clean, easy-to-use interface. They’ve even made it easy for users to avoid making errors that will throw off insights, and to publish results. It also integrates with Tableau.
Price: Free download.
Getting started: Sift through the company’s comprehensive video library for tutorials and insights.
Read More: Brush up on data cleaning with this course from Johns Hopkins.
Not to be confused with data extraction (covered later), data mining is the process of discovering insights within a database as opposed to extracting data from web pages into databases. The aim of data mining is to make predictions and decisions on the data you have at hand.
With a hefty client list that includes PayPal, Deloitte, eBay and Cisco, RapidMiner is a fantastic tool for predictive analysis. It’s powerful, easy to use and has a great open source community behind it. You can even integrate your own specialized algorithms into RapidMiner through their APIs.
Their graphical interface (reminiscent of Yahoo! Pipes), means that you don’t need to know how to code or have a PhD to operate any of their four analytics products.
Price: Varies from free to $10,000/year, depending on your needs. A small package starts at $2,500/year.
Getting started: Check out the documentation, forum and support community to learn how to get started.
IBM SPSS Modeler
The IBM SPSS Modeler offers a whole suite of solutions dedicated to data mining. This includes text analysis, entity analytics, decision management and optimization. Their five products provide a range of advanced algorithms and techniques that include text analytics, entity analytics, decision management and optimization.
SPSS Modeler is a heavy-duty solution that is well suited for the needs of big companies. It can run on virtually any type of database and you can integrate it with other IBM SPSS products such as SPSS collaboration and deployment services, and the SPSS Analytic server.
Price: Starts at $4670/year.
Getting started: Being IBM, the support documentation is second to none.
Oracle data mining
Another big hitter in the data mining sphere is Oracle. As part of their Advanced Analytics Database option, Oracle data mining allows its users to discover insights, make predictions and leverage their Oracle data. You can build models to discover customer behavior, target best customers and develop profiles.
The Oracle Data Miner GUI enables data analysts, business analysts and data scientists to work with data inside a database using a rather elegant drag-and-drop solution. It can also create SQL and PL/SQL scripts for automation, scheduling and deployment throughout the enterprise.
Teradata recognizes the fact that, although big data is awesome, if you don’t actually know how to analyze and use it, it’s worthless. Imagine having millions upon millions of data points without the skills to query them. That’s where Teradata comes in. They provide end-to-end solutions and services in data warehousing, big data and analytics and marketing applications. This all means that you can truly become a data-driven business.
Teradata also offers a whole host of services including implementation, business consulting, training and support.
Written in R (more about that programming language in a moment), this open-source program can help you mine the data you need to build statistical software or run sophisticated data analysis. It also has plenty of modeling tools, including graphs, classification, clustering, and more.
Getting started: Read through their manuals for a play-by-play to set up and start mining.
If you’re after a specific type of data mining, there are a bunch of startups which specialize in helping businesses answer tough questions with data. If you’re worried about user churn, we recommend OptiMove, a startup which analyzes your analytics and tells you how to predict customer behavior. (Not a bad replacement for FramedData, which was recently acquired by Square.)
It’s a fully managed solution which means you don’t need to do anything but sit back and wait for insights.
Getting started: if you’re interested, the best thing to do is request a demo.
If you’re stuck on a data mining problem or want to try solving the world’s toughest problems, check out Kaggle. Kaggle is the world’s largest data science community. Companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models.
While data mining is all about sifting through your data in search of previously unrecognized patterns, data analysis is about breaking that data down and assessing the impact of those patterns over time. Analytics is about asking specific questions and finding the answers in data. You can even ask questions about what will happen in the future!
Qubole simplifies, speeds and scales big data analytics workloads against data stored on AWS, Google, or Azure clouds. They take the hassle out of infrastructure wrangling. Once the IT policies are in place, any number of data analysts can be set free to collaboratively “click to query” with the power of Hive, Spark, Presto and many others in a growing list of data processing engines.
Qubole is an enterprise-level solution. They offer a free trial that you can sign up for at this page. The flexibility of the program really does set it apart from the rest, as well as being the most accessible of the platforms.
Get a basic overview of how your data teams can use Qubole to automate tasks and make their lives easier.
BigML is attempting to simplify machine learning. They offer a powerful Machine Learning service with an easy-to-use interface for you to import your data and get predictions out of it. You can even use their models for predictive analytics.
A good understanding of modeling is certainly helpful, but not essential, if you want to get the most from BigML. They have a free version of the tool that allows you to create tasks that are under 16mb as well as having a pay-as-you-go plan and a virtual private cloud that meet enterprise-grade requirements.
Statwing takes data analysis to a new level, providing everything from beautiful visuals to complex analysis. They have a particularly cool blog post on NFL data! It’s so simple to use that you can actually get started with Statwing in under 5 minutes.
Although it isn’t free to use, the pricing plan is rather elegant. The basic package is $50 a month, which you can cancel at any time. This allows you to use unlimited datasets of up to 50mb in size each. There are other enterprise plans that give you the ability to upload bigger datasets.
Price: Starts at $50/month.
Getting started: There are lots of cool tutorial videos on their homepage.
Learn the ins and outs of Statwing’s multi-variable analysis platform.
Domo helps you put your data in one place, so you have access to the numbers you need to generate analysis, visualize changes, and make decisions. With plenty of visualization and collaboration tools, Domo is trying to make spreadsheets a thing of the past – great news for everyone who’s tired of tracking down and replicating data from multiple sources.
Getting started: Crack open Domo’s Resource Library for articles, webinars, and more.
With their easy-to-navigate interface, ThoughtSpot wants to make it simpler for the other members of your team to search and analyze data – without waiting ages for complex reports from data analysts.
By using the power of relational search – like the AI behind Google – anyone can search for terms and instantly find the data they need. ThoughtSpot will even help a user visualize and share that data to inform decision making.
Price: Varies. Ask for a demo.
Getting started: Use ThoughtSpot’s resources, from webinars, videos, and white papers, to maximize their service.
Related Post: 5 Ways to Accelerate Data Wrangling
Data visualization companies will make your data come to life. Part of the challenge for any data scientist is conveying the insights from that data to the rest of your company. For most of your colleagues, MySQL databases and spreadsheets aren’t going to cut it. Visualizations are a bright and easy way to convey complex data insights. And the best part is that most of them require no coding whatsoever!
Tableau is a data visualization tool with a primary focus on business intelligence. You can create maps, bar charts, scatter plots and more without the need for programming. They recently released a web connector that allows you to connect to a database or API, thus giving you the ability to get live data in a visualization.
Tableau has five products available with varying degrees of support and functionality. If you’re new to vizzing (as they call it) we recommend Tableau Public, the free version of their visualization tooling. Exploring that tool should give you an idea of which of the other Tableau products you’d rather pay for.
Price: Starts at $35/month.
Getting started: Tableau has a lot of functionality, so definitely check out their tutorials before diving in.
Use the Get Started tutorial – including provided practice data sets – to learn the ins and outs of all their tools.
CartoDB is a data visualization tool that specializes in making maps. They make it easy for anyone to visualize location data – without the need for any coding. CartoDB can manage a myriad of data files and types; they even have sample data sets that you can play around with while you’re getting the hang of it.
If you have location data, CartoDB is definitely worth a look. It may not be the easiest system to use, but once you get the hang of it, it is incredibly powerful. They offer an enterprise package which allows for collaboration on projects as well as controlled access.
Learn how to import your geospatial data and get started with your map.
Chartio allows you to combine data sources and execute queries in-browser. You can create powerful dashboards in just a few clicks. Chartio’s visual query language allows anyone to grab data from anywhere without having to know SQL or other complicated model languages. They also let you schedule PDF reports so you can export and email your dashboard as a PDF file to anyone you want.
The other cool thing about Chartio is that it often doesn’t require a data warehouse. This means that you’re going to get up and running faster and that your cost of implementation is going to be lower and more predictable.
Use Chartio’s tutorials to build your chart and analyze your data.
If you are wanting to build a graph, Plot.ly is the place to go. This handy platform allows you to create stunning 2d and 3d charts (you really need to see it to believe it!). Again, all without needing programming knowledge.
The free version allows you create one private chart and unlimited public charts, or you can upgrade to the enterprise packages to make unlimited private and public charts. You also have the option of Vector exports and saving of custom themes.
Getting started: You’ll find everything you need to get started in Plotly’s full range of tutorials.
Enter your data into Plot.ly to make stunning visualizations and charts of your data.
Datawrapper is an open source tool that creates embeddable charts in minutes. Because it’s open source, it will be constantly evolving, as anyone can contribute to it. They have an awesome chart gallery where you can check out the kind of stuff people are doing with Datawrapper.
Similar to many of the other companies in this section, it has both a free tool as well as a paid option, with the paid option being a pre-set up, customized package of Datawrapper.
Price: Between €19 and €499/month.
Getting started: Check out these awesome tutorials to get started with Datawrapper.
Upload data from Excel right into Datawrapper to make your chart.
With easy-to-use free trials, Ideata offers solutions for preparing, analyzing, and visualizing data. As long as you know the ins and outs of the data you need to research, Ideata makes the rest easy, with drag-and-drop tools that make compelling graphs and charts.
Price: Varies. Free trial.
Getting Started: Check out Ideata’s extensive library of video tutorials for more information.
One of the most powerful free visualization tools around, Google Chart allows you to create everything from a basic line chart or pie graph to a customizable tree map or timeline. Because the service is from Google, your charts are easy to upload onto the web and sync with other services, like Google Maps. There’s a built-in data table tool to make it even easier to identify and sort data and customize the chart’s appearance.
Getting started: Check out Google’s extensive documentation on making and using charts.
Related Post: How to Choose the Right Visualization for Your Data
Data integration platforms are the glue between each program. If you want to connect the data you’ve extracted using Import.io with Twitter or you want to share on Facebook the visualisation you’ve made with Tableau or Silk automatically, then the integration services below are the tools for you.
Blockspring is a unique program in the way that it harnesses all of the power of services such as IFTTT and Zapier in familiar platforms such as Excel and Google Sheets. You can connect to a whole host of 3rd party programs by simply writing a Google Sheet formula. You can post tweets from a spreadsheet, look to see who your followers are following as well as connecting to AWS, Import.io and Tableau to name a few.
Blockspring is free to use, but they also have an organization package that allows you to create and share private functions, add custom tags for easy search and discovery and set API tokens for your whole organization at once.
Getting started: Blockspring has some really good help documentation to get you up and running.
Pentaho offers big data integration with zero coding required. Using a simple drag and drop UI you can integrate a number of tools with minimal coding. They also offer embedded analytics and business analytics services.
Pentaho is an enterprise solution. You can request a free trial of the data integration product, after which a payment will be required.
Price: Free trial; varied pricing depending on needs.
Getting started: You can check out the help documentation to get a better feel for how it works.
Their team started in analytics, so Stitch is compatible with analytics services like Tableau, too. They also provide support around identifying the right data warehouse for your business.
Try it out for free for 14 days, and inquire about their customized packages if you’re hooked.
Price: Varies; up to $1,000/month.
Getting started: Check out these in-depth tutorials for managing your data pipeline.
Write your own script with Stitch’s open-source ETL, Singer.
Depend on Salesforce to research, track, and interact with leads and customers? Use Magento to track sales data? The Magento Salesforce Integration will help you automate tasks to integrate data and eliminate repetition.
Set up customized automation triggers using the simple “if:then” logic, so you can update and track all of your data instantly across both platforms.
Getting started: View their online demo or check out their documentation to make Magento work for you.
Informatica does a lot more than integrate your data into an easy-to-use dashboard. It also offers management, analytics, and administration services. This means you can search, organize, and map data from common platforms like Salesforce, Amazon Web Services, and Google Cloud – all in one place.
Their point-and-click interface eliminates the need for complex code, making it easier for multiple users to get the data they need. Data integration developers and IT can also instantly share data to increase efficiency and minimize wait times.
Price: Free trial. Varies by service.
Getting started: Since Informatica offers a whole host of products, it’s only right they should have their own “university” to get you up and running.
There will be times in your data career when a tool simply won’t cut it. While today’s tools are becoming more powerful and easier to use, sometimes it is just better to code it yourself. Even if you’re not a programmer, understanding the basics of how these languages work will give you a better understanding of how many of these tools function and how best to use them.
R is a language for statistical computing and graphics. If the data mining and statistical software listed above doesn’t quite do what you want it to, learning R is the way forward. In fact, if you’re planning on being a data scientist, knowing R is a requirement.
It can run on Linux, Windows and MacOS and you can download R at this page. There is a huge community of statisticians using R nowadays and its popularity is always growing.
Getting started: Once downloaded, you can check out the documentation.
Another language that is gaining popularity in the data community is Python. Created in the 1980s and named from Monty Python’s Flying Circus, it has consistently ranked in the top ten most popular programming languages in the world. Many journalists use Python to write custom scrapers if data collection tools fail to get the data that they need.
People like it because of the similarities with the English language. It uses words such as ‘if’ and ‘in’ meaning you can read a script very easily. It offers a wide range of libraries designed for different types of tasks.
Getting started: Check out the homepage to learn more about Python.
RegEx or Regular Expressions are a set of characters that can manipulate and change data. It’s used mainly for pattern matching with strings, or string matching. At Import.io, you can use RegEx while extracting data to delete parts of a string or keep particular parts of a string.
It is an incredibly useful tool to use when doing data extraction, as you can get exactly what you want when you extract data (meaning you don’t need to rely on those data manipulation companies mentioned above)!
Getting started: There are many cool tutorials for RegEx online.
XPath is most commonly used in data extraction. Import.io actually automatically creates XPaths every time you click on a piece of data – you just don’t see them! It is also possible to insert your own XPath to get data from drop-down menus and data that is in tabs on a webpage. Put simply, an XPath is a path, a set of directions to a certain part of the HTML of a webpage.
Getting started: The best XPath tutorial is the w3schools tutorial.
Before you can store, analyze or visualize your data, you’ve got to have some. Data extraction is all about taking something that is unstructured, like a webpage, and turning it into a structured table. Once you’ve got it structured, you can manipulate it in all sorts of ways, using the tools we’ve covered, to find insights.
Import.io is the number one tool for data extraction. Import.io enables users to convert websites into structured, machine readable data with no coding required. Using a simple point and click UI, we take a webpage and transform it into an easy to use spreadsheet that you can then analyze, visualize, and use to make data-driven decisions. Features include Authenticated Extractions behind a login, flexible scheduling, and fully documented public APIs. Customers use the data for machine learning, market and academic research, lead generation, app development, and price monitoring.
Getting started: Check out our knowledge base to learn exactly how to use the tool or contact our data experts to get a tailored data solution for your business.
Hope you enjoyed our round-up of big data tools!