We recently wrote a post introducing Meltano, an open source tool that will help data teams version control everything from raw data to visualization. We were blown away by the level of interest it received, including all sorts of comments on Hacker News that gave us a ton of feedback we’re excited to wrestle with and work towards. Special thanks to commenter slap_shot, whose comments prompted us to hop on YouTube for a live conversation. We learned that in real life, slap_shot is a data and analytics engineer and founder named Brett, and you can watch our live chat here:
Brett told us something we suspected after our own experience of assembling our analytics stack – that pretty much every data team he encounters is using a "multitude of internal processes that are broken and cobbled together for data integration, or they're not comfortable with the pricing and sales process for some of these products."
When we started researching tools for our team, the goal was to use only open source. Unfortunately, the best open source that we could find wasn't up to the task for us, and changing the code proved cumbersome due to licensing issues. We settled on Looker, a fantastic (but proprietary) solution for visualization, and began reluctantly building out other parts ourselves. Brett told us the idea of an open source version of Looker could be really promising – it's too expensive for many teams, including, to some extent, our own. We think it doesn't make sense to build a dashboard and not be able to share it with the whole team.
Sid shares, "We spent months assembling our data pipeline… but all these choices were so hard, and I think there's room for a convention over configuration framework, where you type in your Salesforce API keys and you get the proper Salesforce graphs. We want to get as close as possible to that experience."
Issues and next steps
- The Meltano team is building a set of very common core extractors, including Salesforce, Marketo, Zendesk, etc. This way we can hopefully provide a few of the most important sources out of the box, and substantial initial value. Then, being an open source product, we hope others can contribute and increase the breadth of support.
- The data team is going to try to apply Meltano to a machine learning project, probably around predicting probability of winning a sales opportunity, so we can incorporate any requirements specific to ML.
Give me the short and sweet version – what does Meltano do?
Meltano helps companies consolidate, organize, and analyze their data to make better business decisions.
Can the BI tool and integration library be used outside of GitLab?
We're not sure yet. For now, the integration part (which we call orchestration) is GitLab CI-based. We recently had the idea to have a frontend "production mode," where you can at least see everything, and maybe we'll have a "development mode" where you can run different pipelines inside a Python Flask app.
Embulk and Singer built the core foundation and they allow people to build their own integrations, do we envision similar model?
Yes. Right now we are prioritizing getting the architecture and tooling correct, to make it easy for us and others to build additional extractors.
What's the vision for the monorepo and what are the benefits?
We consolidated all of the code for Meltano in a single project, to make it easier to develop and contribute to. We then provide two Meltano Docker images, similar to Jupyter notebook layering: a standard image which contains all of the default extractors and loaders, as well as a base image so users can customize it to contain only what they need.
meltano/analytics is both a prototypical Meltano implementation and the repo for GitLab Analytics.
Would I have to use Meltano for everything?
No! We know teams have different needs and preferences, so you would be able to pick and choose the features that you use.
I'd like to see GitLab CI have a clean API for others to plug into. Do you see that happening?
The Data team is committed to using GitLab CI as our orchestration platform. Airflow is state of the art right now, but we think we can have similar or better features within CI. If appropriate, the Meltano team will contribute back to CI to make it better too. Some features we're excited about would be better statistics across jobs, sub-pipelines and directed acyclic graphs of jobs, and intelligent data backfill support.
This sounds really ambitious, and there are a lot of companies in the data integration space.
You're completely right! But there isn't an open source tool that checks all these boxes. It might sound a bit ludicrous, but as Sid says, "When I saw GitLab for the first time, it made sense that something you collaborate on is also something you contribute to… it makes sense to me that it's not an individual burden, it's a shared burden." We think that the shared nature of the problem will make for a great open source community, and without that community, this won't really get off the ground.
Photo by Ludovic Toinel on Unsplash
“Learn more about @meltanodata, an open source tool for the data science lifecycle from GitLab” – Jacob Schatz and Taylor A. Murphy, PhD
Click to tweet