This post discusses a nightmare activity that people working with data face everyday. It is putting the data in a format ready to be processed according to your need. No doubt that working with data sets has become a vital activity in everyday work; not only for developers but also for mathematicians, statisticians, …endless list. The problem is that real data sets have some inconsistencies; for example a variability in data format ( Oct. 7, 2015 vs. 10/07/2015 vs. October, 7 2015, vs. ….etc.). Problems like misspelling, extra spaces, random punctuation or weird capitalization plague your data consistency and accuracy. These problems arise when we retrieve data or work with data from multiple sources that are not inter-operable.
I am a newbie to the field of data science and big data. Actually, I started to learn it smoothly as in the saying “learn to analyze data first, then do it big”. I have been performing algorithms on small data sets in various contexts to solve different problems or come up with some conclusions. Now, the data sets I work on are becoming larger and larger (order of GBs per file) that I cannot process fast. The problem becomes worse when I debug. Each run takes me minutes to get a result that I probably don’t use! So, I had to transform to work on a platform that satisfies my needs. Here my journey with Apache Spark starts. I started learning it from this edX course. The course material uses a virtual machine that comes with the environment pre-installed and everything ready to start learning.
It’s always said, “Do not try to re-invent the wheel!”. When working with Docker, it is a good practice to search for some ready-to-use images on Docker Hub before building you own. It is very powerful to have your software architecture distributed in a set of containers, each one does one job. And the best building block of your distributed application is to use official images from Docker Hub. You can trust their functionalities.
Docker is one of the most trending technology platforms that gained community interest in a short time. In the simplest words, it enables developers and system admins to ship their distributed applications in an easy-to-use process. The ecosystem around Docker is so large and there are A LOT of tools that work with it. One of the most useful tools that is available is Docker Compose. It enables you to define and run multi-container applications in a single file and then spin your application up in a single command.