diff --git a/README.md b/README.md index 4d5a093f..e92e448e 100644 --- a/README.md +++ b/README.md @@ -1,28 +1,189 @@ -# Extract Van Gogh Paintings Code Challenge +# Google artworks carousel extractor -Goal is to extract a list of Van Gogh paintings from the attached Google search results page. +Small Python script that pulls the artworks carousel (the row of paintings, or +people, that Google shows at the top of some searches) out of a saved results +page and gives you back an array. It only reads the local HTML file, it doesn't +make any requests. - +Output looks like this: -## Instructions +```json +{ "artworks": [ { "name": "The Starry Night", "extensions": ["1889"], "link": "https://www.google.com/search?...", "image": "data:image/jpeg;base64,..." } ] } +``` -This is already fully supported on SerpApi. ([relevant test], [html file], [sample json], and [expected array].) -Try to come up with your own solution and your own test. -Extract the painting `name`, `extensions` array (date), and Google `link` in an array. +## Install -Fork this repository and make a PR when ready. +```bash +pip install -r requirements.txt +``` -Programming language wise, Ruby (with RSpec tests) is strongly suggested but feel free to use whatever you feel like. +You really only need beautifulsoup4. It'll use lxml if you've got it (bit +faster), and if not it just falls back to Python's built-in html.parser. pytest +is only for the tests. -Parse directly the HTML result page ([html file]) in this repository. No extra HTTP requests should be needed for anything. +## Run -[relevant test]: https://github.com/serpapi/test-knowledge-graph-desktop/blob/master/spec/knowledge_graph_claude_monet_paintings_spec.rb -[sample json]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.json -[html file]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html -[expected array]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json +```bash +python carousel_parser.py files/van-gogh-paintings.html +``` -Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed). +That prints the `{"artworks": [...]}` JSON to the terminal. -Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.) +## Test -The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want. +```bash +pytest +``` + +## What the task wanted + +Just so I didn't miss anything, here's what the +[instructions](instructions/README.md) asked for and what I did: + +- Get the `name`, the `extensions` (the date), and the Google `link` for each one. +- Also grab the thumbnails that are actually in the page, and skip the ones that + would need another request. Those come out as `null`. +- Read the HTML file directly, no extra HTTP calls. +- Write my own test. There's a fair few in `test_parser.py`. +- Try it on a couple of other similar pages. I saved a few real ones, see below. +- They suggest Ruby but say use whatever. I went with Python as that's what I'm + most comfortable in. + +The output matches their `files/expected-array.json` exactly, base64 images and +all. + +## How it works + +Couple of things about the HTML that aren't obvious until you actually look at it. + +### Finding the items + +Each item in the carousel is the same little block repeated: + +```html +
+``` + +The class names (`iELo6`, `pgNMRc` and so on) are just random-looking hashes and +Google changes them between pages, so I didn't want to rely on those. What seems +stable is the shape: a link pointing at `/search`, with a `stick=` bit in it, +that has an image inside it. On the sample page that gets you exactly the 47 items +and leaves out the "More"/"See more" links (they have `stick=` too but no image, +so there's actually 56 of those links and only 47 real items). + +One thing that caught me out: I first checked `href.startswith("/search")`, but +when you save a page the browser rewrites the links to the full +`https://www.google.com/search?...`, so that missed everything. I switched to +checking the URL path is `/search` and now both work. + +If there's a `