SCALING A CSV IMPORTER
Spreadsheet programs like Google Sheets, Excel, Numbers, and a host of other programs can generate CSV formatted files. It is also commonplace for third-party email, social, or marketing tools to allow you to export and download CSV reports. These are all perfect candidates to generate CSV data for loading into a database, data warehouse, or data lake for further analysis.
- In some cases, the CSV files can be real big ..Yes really big with a size in GBs. Scaling a CSV importer is a deep tech engineering problem.If the CSV file is in GBs then, the it's a real big data problem.But we tried to solve this bigdata problem with a completely different approach.
In any distributed computing system (even beyond Apache Spark), there exist well known scaling trends (runtime vs. number of nodes), as illustrated in the images below. These trends are universal and fundamental to computer science.
As more and more nodes are added, the runtime of the job decreases, but the cost also increases
so the bottomline is you can scale anything but you have pay the price to your cloud provider.
At some point, adding more nodes has diminishing returns and the job stops running faster, but obviously cloud costs start
rising (since more nodes are being added).
We could have easily choosen an architecture like below but it's not solving the real probelm rather
passing the problem to a cloud infra.
But YoBulk does all the scaling in-prem and through a very optimized node streaming architecture.YoBulk does use mongoDB for storage and handle all concurreny and bulk importing without any serverless cloud and bigdata infrastructure.
Our intention::businesses should not make large tech investments(EMR,Databrick,ETL tools) for solving CSV importing problem.