2 posts tagged with "node"

SCALING A CSV IMPORTER

January 15, 2023 · 2 min read

last samurai

YoBulk core team

SCALING A CSV IMPORTER

Spreadsheet programs like Google Sheets, Excel, Numbers, and a host of other programs can generate CSV formatted files. It is also commonplace for third-party email, social, or marketing tools to allow you to export and download CSV reports. These are all perfect candidates to generate CSV data for loading into a database, data warehouse, or data lake for further analysis.

In some cases, the CSV files can be real big ..Yes really big with a size in GBs. Scaling a CSV importer is a deep tech engineering problem.If the CSV file is in GBs then, the it's a real big data problem.But we tried to solve this bigdata problem with a completely different approach.

While building YoBulk's backend, we have encoutered multiple challenges when the CSV file size is in GB and it's not possible to upload and store the CSV file in browser. You can always choose a serverless approach and EMR/databrick kind of solution which scales automattically.

In any distributed computing system (even beyond Apache Spark), there exist well known scaling trends (runtime vs. number of nodes), as illustrated in the images below. These trends are universal and fundamental to computer science.

As more and more nodes are added, the runtime of the job decreases, but the cost also increases so the bottomline is you can scale anything but you have pay the price to your cloud provider. At some point, adding more nodes has diminishing returns and the job stops running faster, but obviously cloud costs start rising (since more nodes are being added).

We could have easily choosen an architecture like below but it's not solving the real probelm rather passing the problem to a cloud infra.

But YoBulk does all the scaling in-prem and through a very optimized node streaming architecture.YoBulk does use mongoDB for storage and handle all concurreny and bulk importing without any serverless cloud and bigdata infrastructure.

Our intention::businesses should not make large tech investments(EMR,Databrick,ETL tools) for solving CSV importing problem.

FIRST MILE DATA ONBOARDING YOBULK WAY

January 1, 2023 · 2 min read

last samurai

YoBulk core team

Collaboration is the Key

When importing customer data using a CSV file, there are several challenges that can arise during the onboarding process. Here are some of the most common challenges:

Customer data onboarding needs a lot of collaboration. The customer success/business team works very closely with customers.With the ideal case being that the customer provides 100% correctly formatted data, but in reality, the customer's provided data needs a lot of cleaning. In a typical scenario, the customer success team who is in charge of this activity has to work back and forth with the customer. The customer has to resolve manual (unintended) errors. This workflow needs a lot of collaboration.

It is the end customer or data provider who bite the bullet and do the time consuming data cleaning.The customer should know about the errors, duplicates,PII data,inconsitency in data.Customer has to be properly guided to clean the data in best possible manner.

Pictorial represenation of back and forth interaction between customer and Ops/Customer success team.

YoBulk's way of solving the data cleaning workflow:

imag

YoBulk is on a mission to solve the complex data importing problem by creating workspaces and taking collaborative data cleaning to the next level of perfection.

SCALING A CSV IMPORTER​

Pictorial represenation of back and forth interaction between customer and Ops/Customer success team.​

YoBulk's way of solving the data cleaning workflow:​

SCALING A CSV IMPORTER

Pictorial represenation of back and forth interaction between customer and Ops/Customer success team.

YoBulk's way of solving the data cleaning workflow: