My Unforgettable Learner Journey
This phase marks the commencement of my journey towards more learning about data science and machine learning, but this time, I am more equipped with the knowledge and practice I have from our Labs unit. In labs, I have learned to work cross-functionally with teams such as the frontend and backend engineers, as well as recognize the thought processes of other data scientists. In this blog I have detailed the most memorable processes I worked on, and the challenges I have faced as a data scientist working on a very promising project.
Labs is the final unit that needs to be completed as a student in Lambda. At this point, we need to be able to put into application all the things that we’ve learned from the core curriculum (Data Science in my case) until Computer Science. We were assigned a project to work on, for our team, we were tasked to complete the CitySpire website.
CitySpire is an app that analyzes data from cities such as jobs, rental rates, populations, crime rate, walk score (park), and all other factors that can influence a moving decision.
As one of the data scientists in the team, I took on the responsibility to work on the jobs data that we scraped from Indeed. I have to develop some features based on user stories and deploy the application via Elastic Beanstalk.
With the task of handling the jobs data, I have to work on user stories such as: “As a user, I would like to get the available jobs count of my chosen city” and “As a user, I would like to see a preview list of available jobs of my chosen city”.
I worked on 2 features basing on those user stories. First is Jobs Count, a function that returns a number representing the count of available jobs in the selected city. Second is Available Jobs which is a function that returns a list of available jobs (job title, salary if available, and job description) from a selected city.
Initially, I was very hesitant that I will be able to complete these, I am feeling that I might not be capable enough to produce the dataset that I need to work on, that I might not have enough time to wrangle and clean up the data. I was overwhelmed by the thought of using multiple functions and tools like python, Postgres, Docker, and AWS - which I have very little knowledge of.
Architecture and Design Choices
As we worked on multiple csv files since we scraped data per city for jobs, I have decided to house the datasets in a schema using Postgres and worked on joining and arranging data using ElephatSQL.
Using FastAPI, I have deployed the app locally to get it tested, and used AWS RDS to generate credentials for a Postgres database.
How did I do it?
I collected my thoughts and organized them. It was with the help of a teammate that I was able to start working. A BeautifulSoup code was made to be able to scrape jobs data, and the rest is history.
I used AWS RDS to build a schema for our database of multiple csv files of jobs data from different cities. I utilized TablePlus to perform the SQL functions and created the connection between the repository and Postgres using credentials from AWS RDS that I set up.
The first feature I worked on was to get the list of available jobs for the selected city. I accomplished this by selecting the columns I need for the output and using the .loc function to filter data based on the selected city, adding a .head at the end to limit the list to just 10 (preview). To get the JSON object of the output, I’ve added a .to_dict at the end.
We were given about 8 weeks to complete the project. Most of my time was not spent on coding and building, but on troubleshooting one particular error — deploying Elastic Beanstalk.
Reference: AWS Elastic Beanstalk
Following the guide from the link posted above, I attempted to deploy my completed features, after successful testing locally with FastAPI. On my first attempt, I ran into a snag — Error 502 Bad Gateway. I thought I might have missed a step, and so I terminated the instance and reattempted, and figured that I missed entering the environment variables in the Elastic Beanstalk console. I tried deploying again, but unfortunately, I received the same error. I had multiple attempts following the same steps as detailed, and I started seeking help from our Data Science Manager. At first, he tried to clone the error and followed the same procedure and the same repository I am working on, it worked on his end. I reattempted, and finally, success!
However, that success was short-lived. My deployment only lasted for few hours, and so does my succeeding reattempts. Then we came to realize that the attempt of the DS Manager also died at some point. After many hours of attempting to troubleshoot, we decided to move forward with another procedure, deploying Elastic Beanstalk via EC2 (Amazon Elastic Compute Cloud).
This process was not initially considered as it requires a lot of steps to complete, from creating AWS instance to creating/downloading PEM Key to Cloning branch or repo-Docker. Docker process alone is lengthy if you encounter errors along the way, you would need to start the cloning process.
The EC2 deployment was not as smooth as well. Initially, I’ve encountered an Outdated Pipfile.lock error, which required me to delete the cloned repository in Docker, and reinstall/uninstall the Pipfile.lock from my local copy of the repository. The problem was corrected, however, I came across another error — when we get to the point of checking the list of all running docker containers, the container I built has a blank port. We used the docker log command to check further, which showed us that we are missing a package file to run the application. I went back to the same process as above, but instead of Pipfile.lock, I installed all the files/packages needed to run the application. Finally, a successful deployment!
Here’s the link to the data science endpoint which was deployed via EC2: CitySpire-C. The /get_jobs and /get_jobs_count were the features I worked on for this app.
I have documented the EC2 process I went through, click here to access it.
We inherited existing features from the previous cohort, for the data science members of our team, we were able to complete 3 features: get available jobs list, get available jobs count, and rental price prediction.
If I have more time I would have wanted to add a visualization feature using NLP and WordCloud, showing popular job titles for the selected city. I have already completed the coding notebook but was not able to include it in the app due to time constraints.
Another great additional feature would be a salary prediction model, which can be done via logistic regression or NLP based on job titles or job descriptions.
The issue of deploying via Elastic Beanstalk using eb commands wasn’t addressed yet. It would be great if this could be resolved as this could uncover hidden issues from the repository which may cause other potential technical problems in the future.
This project brought me to a point of feeling ready to take on an entry-level Data Scientist or Machine Learning Engineer job. I feel that thorough research, teamwork, continuous education, perseverance in digging the root cause, exploring other options for possible resolution, and willingness to learn will help me deliver the responsibilities that will be placed on my shoulder as data science and machine learning professional.