Skip to main content

· 2 min read

JAN 23, 2023

Our Story

In my last startup, every day we were receiving CSV files from various digital out-of-home screen owners in a specific template defined by us. Even though the template was well-defined, we used to get a lot of errors in the CSV files shared, and getting it cleaned with the data provider was a nightmare.

Each day we received approximately 5,00,000 screen data updates, with Price Changes, Dimension Data. Cleaning, and formatting the Data as per my DB and then uploading it into the DB was a pain and consumed a lot of my engineering hours. and I was looking to automate the complete importing workflow.

Location data importing

img1

I checked the Open-Source community for any full-stack solutions but most of the solutions were available in silos and provided no end-to-end functionality which could seamlessly integrate with my product. Most of the CSV importing is limited to browser-level data import.

After deep diving more into the enterprise solutions, I discovered the current solutions available in the market pose the serious risk of Data sharing which may lead to privacy lapses, and vendor lock-in.

So focused on this problem area, envisioned building an open-source product “Yo-Bulk” which is a scalable, open-sourced tool, that can be embedded with your product or SaaS and can automate complete data importing workflow.

While in my second company LocTruth, I faced a new issue, I was importing Location Data and for Geospatial data I was unable to find a tool that validates Geospatial data not even able to parse and understand geo-specific formats like KLM, geo-json.It's a massive problem to validate a CSV file which has geometry like attributes and import it to postGIS.

Hence, I got convinced there has to be a Generic CSV Importer for SaaS applications which would be built by an Open-Source community to cater to the needs of data received from different Domain like adtech, martech, finance, retail,Geo-spatial etc.

On 1st week of Nov 2022, we conceptualized this Idea, and started development. We chose tech Stack as Next.js, and Mongo DB as our DB and within 12 weeks we are ready..

YoBulk as the name suggests wants to fasten the CSV Importing process and this can be a path breaker for a Developer and for a non-technical user as well.

· 2 min read

SCALING A CSV IMPORTER

Spreadsheet programs like Google Sheets, Excel, Numbers, and a host of other programs can generate CSV formatted files. It is also commonplace for third-party email, social, or marketing tools to allow you to export and download CSV reports. These are all perfect candidates to generate CSV data for loading into a database, data warehouse, or data lake for further analysis.

  • In some cases, the CSV files can be real big ..Yes really big with a size in GBs. Scaling a CSV importer is a deep tech engineering problem.If the CSV file is in GBs then, the it's a real big data problem.But we tried to solve this bigdata problem with a completely different approach.

While building YoBulk's backend, we have encoutered multiple challenges when the CSV file size is in GB and it's not possible to upload and store the CSV file in browser. You can always choose a serverless approach and EMR/databrick kind of solution which scales automattically.

In any distributed computing system (even beyond Apache Spark), there exist well known scaling trends (runtime vs. number of nodes), as illustrated in the images below. These trends are universal and fundamental to computer science.

image As more and more nodes are added, the runtime of the job decreases, but the cost also increases so the bottomline is you can scale anything but you have pay the price to your cloud provider. At some point, adding more nodes has diminishing returns and the job stops running faster, but obviously cloud costs start rising (since more nodes are being added).

We could have easily choosen an architecture like below but it's not solving the real probelm rather passing the problem to a cloud infra. img2

But YoBulk does all the scaling in-prem and through a very optimized node streaming architecture.YoBulk does use mongoDB for storage and handle all concurreny and bulk importing without any serverless cloud and bigdata infrastructure.

Our intention::businesses should not make large tech investments(EMR,Databrick,ETL tools) for solving CSV importing problem.

· 2 min read

Collaboration is the Key

When importing customer data using a CSV file, there are several challenges that can arise during the onboarding process. Here are some of the most common challenges:

Customer data onboarding needs a lot of collaboration. The customer success/business team works very closely with customers.With the ideal case being that the customer provides 100% correctly formatted data, but in reality, the customer's provided data needs a lot of cleaning. In a typical scenario, the customer success team who is in charge of this activity has to work back and forth with the customer. The customer has to resolve manual (unintended) errors. This workflow needs a lot of collaboration.

It is the end customer or data provider who bite the bullet and do the time consuming data cleaning.The customer should know about the errors, duplicates,PII data,inconsitency in data.Customer has to be properly guided to clean the data in best possible manner.

Pictorial represenation of back and forth interaction between customer and Ops/Customer success team.

img

YoBulk's way of solving the data cleaning workflow:

imag

YoBulk is on a mission to solve the complex data importing problem by creating workspaces and taking collaborative data cleaning to the next level of perfection.