My uncle is a physicist, and recently was asked to speak with a girl who just graduated with a Bachelor’s degree in Engineering Physics. It sounds super impressive, but this girl had no idea what sort of jobs to look for, given her background. It made me think, what if there was a tool that would let you upload your resume, and then you would be recommended jobs you should apply to, based on your qualifications? This would definitely help someone like the girl my uncle talked to. (This may already exist, but it’s a great project regardless!)
The idea of this is two-fold:
First, an application like this could help those on their job hunt, in particular young university graduates who aren’t sure how to make use of their degree. It could help them know what sort of role to look into, and also where to look, geographically.
Second, this could help hiring managers, who know how to describe the role they want to fill, but don’t know the name of the role. This would avoid companies asking for a software engineer, for example, when the job description is really asking for a data analyst, or something along those lines.
This is preliminary work, where I explore the first part of the idea. I use web scraping to create a recommendation system for data scientists seeking new employment. I do this because I’m an aspiring data scientist, so it’s interesting for me to see what jobs “match” my resume best. I’ll use term frequency inverse document frequency with cosine similarity as my distance metric.
So, what I’ll do is: scrape Glassdoor.ca to create dataset of data science related jobs, and use R- Shiny to allow a user (in this case, me) to upload their resume. The program, written in Python, will compute the similarity between the resume and the job posting, and return an ordered ranking of the top jobs that the job seeker could apply to.
Extending this preliminary work to the full project should be simple. It would require getting a corpus of all the job postings on Glassdoor.ca for the entire country for a given time period. The corpus could be updated hourly, daily, weekly, as needed. I suspect daily would suffice, perhaps even weekly. I’d also need to expand the functionality to include the second part of my idea, which would allow employers to enter their job posting, and return the appropriate name of the posting. This should suffice. The idea is to just give students and employers an idea of what roles are available, and what job titles are typically assigned to certain roles.
For this project, I scrape Glassdoor.ca, a well known California-based company that provides a database of job postings along with company and salary reviews, and interview tips. Since this is just a prototype, I just gathered the first page of job postings for “Data Analyst” and “Data Scientist” in both Toronto and Vancouver.
The webscraping and actual body of the project is done in Python, with R – Shiny used for deployment and making a web API. In Python, I use Selenium to web scrape, which allows me to access Glassdoor while tricking the browser into thinking I’m human, not a robot. I also use Sklearn to compute term frequency-inverse document frequency and cosine similarity, to measure closeness between the job postings and a resume. I also use the usual numpy, pandas, csv, re, and pickle packages, as well as time, and collections. In R, I use shiny, PythonInR (very useful!), dplyr, tidyr, and pdftools.
Scraping Glassdoor turned out to be a bit of work. I was hoping I could completely follow the work done by Diego De Lazzari, done in 2016, but it seems like Glassdoor has changed the class names of the elements on the site, so I had to manually go in and look at the source. I save the elements into dictionaries, one dictionary containing the job information (JD), i.e. the job title, company, location, and a link to the posting. The second dictionary contains the job description. I compare the text of each job description to the uploaded resume, and from there compute the cosine similarity between the resume and each JD. The Python code is below.
Let’s review cosine similarity quickly.
First, recall the dot product of two vectors x and y: x·y = |x||y| cos(θ), where |*| is the L2 norm, or the length of the vector, and θ is the angle between the two. This term, cos(θ), is what we’re interested in. We rearrange to get cos(θ) = (x· y) / |x||y|. This measures the similarity of the two vectors direction and length. A small angle between the vectors corresponds to high similarity, and cos(θ) ~ 1. A large angle, on the other hand, means the vectors are not very similar, and we’ll have cos(θ) ~ -1.
This sort of approach, using vectors, works because we represent our data in terms of a matrix, which is made of vectors. In our particular case, since we’re comparing two documents (a resume and a job posting), we have a 2 x N matrix, where N is the number of unique words that appear in the two.
When comparing documents, we really care about what direction the vectors are pointing to, with this metric. For example, if we have the word “data” appearing 200 times in document A, and 10 times in document “B”, the angle will be small between the documents’ vector representations. If we using Euclidean distance, the distance between the two would appear large. Conversely, if we had a document A with “data” 2 times, and document B with “dog” 2 times, the vector representations would be pointing in different directions, so low cosine similarity, but using Euclidean distance as our distance metric could lead us to falsely believe these vectors are more similar than they actually are.
Deployment using R- Shiny
I use R-Shiny to deploy the project, with the library PythonInR which allows me to call the function get_best_csv in R, which does the computations and comparisons for me. The R code and a screenshot of the resulting R-Shiny dashboard are below.
Conclusion and Future Work
Developed in Python and deployed as a web application using R, this mini project allows a user to upload and match their resume to available job postings on Glassdoor.ca. In the future, I would extend the functionality to allow an employer to upload a job description, and match the job description to existing job titles. This would help the employer to better advertise their job posting. I’d also extend the data set to include all job postings on Glassdoor for a given time period, not just those related to data science and data analytics. Additionally, I’d like to play around with different metrics of cosine similarity or obtain some labels of successful resumes matched to corresponding job descriptions. Overall, this was a nice exercise for me to practice web scraping and computing document similarity, and to brush up on R.
The code for this project can be found on my github.