Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 178 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,189 @@
# Extract Van Gogh Paintings Code Challenge
# Google artworks carousel extractor

Goal is to extract a list of Van Gogh paintings from the attached Google search results page.
Small Python script that pulls the artworks carousel (the row of paintings, or
people, that Google shows at the top of some searches) out of a saved results
page and gives you back an array. It only reads the local HTML file, it doesn't
make any requests.

![Van Gogh paintings](https://github.com/serpapi/code-challenge/blob/master/files/van-gogh-paintings.png?raw=true "Van Gogh paintings")
Output looks like this:

## Instructions
```json
{ "artworks": [ { "name": "The Starry Night", "extensions": ["1889"], "link": "https://www.google.com/search?...", "image": "data:image/jpeg;base64,..." } ] }
```

This is already fully supported on SerpApi. ([relevant test], [html file], [sample json], and [expected array].)
Try to come up with your own solution and your own test.
Extract the painting `name`, `extensions` array (date), and Google `link` in an array.
## Install

Fork this repository and make a PR when ready.
```bash
pip install -r requirements.txt
```

Programming language wise, Ruby (with RSpec tests) is strongly suggested but feel free to use whatever you feel like.
You really only need beautifulsoup4. It'll use lxml if you've got it (bit
faster), and if not it just falls back to Python's built-in html.parser. pytest
is only for the tests.

Parse directly the HTML result page ([html file]) in this repository. No extra HTTP requests should be needed for anything.
## Run

[relevant test]: https://github.com/serpapi/test-knowledge-graph-desktop/blob/master/spec/knowledge_graph_claude_monet_paintings_spec.rb
[sample json]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.json
[html file]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/van-gogh-paintings.html
[expected array]: https://raw.githubusercontent.com/serpapi/code-challenge/master/files/expected-array.json
```bash
python carousel_parser.py files/van-gogh-paintings.html
```

Add also to your array the painting thumbnails present in the result page file (not the ones where extra requests are needed).
That prints the `{"artworks": [...]}` JSON to the terminal.

Test against 2 other similar result pages to make sure it works against different layouts. (Pages that contain the same kind of carrousel. Don't necessarily have to be paintings.)
## Test

The suggested time for this challenge is 4 hours. But, you can take your time and work more on it if you want.
```bash
pytest
```

## What the task wanted

Just so I didn't miss anything, here's what the
[instructions](instructions/README.md) asked for and what I did:

- Get the `name`, the `extensions` (the date), and the Google `link` for each one.
- Also grab the thumbnails that are actually in the page, and skip the ones that
would need another request. Those come out as `null`.
- Read the HTML file directly, no extra HTTP calls.
- Write my own test. There's a fair few in `test_parser.py`.
- Try it on a couple of other similar pages. I saved a few real ones, see below.
- They suggest Ruby but say use whatever. I went with Python as that's what I'm
most comfortable in.

The output matches their `files/expected-array.json` exactly, base64 images and
all.

## How it works

Couple of things about the HTML that aren't obvious until you actually look at it.

### Finding the items

Each item in the carousel is the same little block repeated:

```html
<div class="iELo6">
<a href="/search?...&q=The+Starry+Night&stick=...">
<img class="taFZJe" alt="The Starry Night" id="_L_..." src="<1x1 gif>">
<div class="KHK6lb">
<div class="pgNMRc">The Starry Night</div> <!-- name -->
<div class="cxzHyb">1889</div> <!-- year -->
</div>
</a>
</div>
```

The class names (`iELo6`, `pgNMRc` and so on) are just random-looking hashes and
Google changes them between pages, so I didn't want to rely on those. What seems
stable is the shape: a link pointing at `/search`, with a `stick=` bit in it,
that has an image inside it. On the sample page that gets you exactly the 47 items
and leaves out the "More"/"See more" links (they have `stick=` too but no image,
so there's actually 56 of those links and only 47 real items).

One thing that caught me out: I first checked `href.startswith("/search")`, but
when you save a page the browser rewrites the links to the full
`https://www.google.com/search?...`, so that missed everything. I switched to
checking the URL path is `/search` and now both work.

If there's a `<g-scrolling-carousel>` wrapper on the page I only look inside that,
so stuff like "People also search for" doesn't sneak in. If there isn't one I
just look at the whole page.

### The name is in two different spots

Google does this carousel two different ways depending on the search, which was a
bit annoying:

- Sometimes the full name is just in the image's `alt` (the van gogh file,
Picasso and Monet are like this).
- Other times the `alt` is empty, the visible name is cut short and split over two
lines ("The Starry" then "Night"), and the actual full name is in the link's
`aria-label` (or `data-entityname`).

So I look at `aria-label`, then `data-entityname`, then the `alt`, then the first
bit of text, and use the first one that's actually there. The extensions are then
whatever text is left over after the name. The first type of page doesn't have
those attributes so it isn't affected and still matches exactly.

### Thumbnails (the img src is never the real image)

The `src` on the img is always a 1x1 transparent gif placeholder, so the real
image is somewhere else. In order:

1. `data-src="https://encrypted-tbn..."`, which is just the URL (39 of the 47).
2. A base64 one set by a little inline script:
`var s='data:image/jpeg;base64,...';var ii=['<img id>',...]` then
`_setImagesSrc` (8 of them). Two gotchas here. The `=` at the end of the base64
is written as `\x3d` in the script so you have to un-escape it (mine came out 3
characters too long until I twigged), and the same id can get set more than
once so you keep the first one.
3. If neither of those, an actual `http` src if there is one.

If there's nothing, the image would need another request, so I just set it to
`null` like the task says.

## Output format

- The keys come out in the order `name`, `extensions`, `link`, `image`.
- If there's no date I leave `extensions` off completely rather than putting an
empty list.
- `image` is either a `data:` URI, an `https://` URL, or `null`.
- The whole thing is wrapped in `{"artworks": [...]}` to match their file.

## Tests

`test_parser.py` is in a few layers so if something breaks you can tell what:

1. The main one: the output has to match `expected-array.json` exactly.
2. Some checks on the shape (key order, links being absolute, the split of base64
vs URL images, the first and last item).
3. A couple of little unit tests for the fiddly bits (the escape decoding, and
keeping the first image when an id repeats).
4. The command line: it prints the right JSON, errors properly if you give it no
file, and the lxml/html.parser fallback gives the same answer.
5. The other pages: it picks up every `.html` + `.expected.json` pair in
`tests/fixtures/`, so I can drop a new page in and it gets tested automatically.

## Fixtures

Mostly real saved pages. I also wrote three small HTML pages by hand for three
things my real pages don't happen to include: a missing thumbnail (so `image`
comes out `null`), an image sitting on the plain `src` rather than `data-src`,
and the other name layout where the name is in `aria-label` instead of `alt`.

Real ones:

- `files/van-gogh-paintings` (the one they gave): 47 items, matches exactly.
- `picasso_paintings` (US): a real saved page with the absolute links, 45 items.
- `monet_paintings_fr` (French): searched with `gl=fr&hl=fr`, 49 items, just to
check the accents and a different language don't break it.
- `power_cast` (UK): this one's a grid of cast members rather than a carousel, 51
people, and the "extension" is their character name. Same code handles it.

Hand-written:

- `vangogh_aria_layout`: the second name layout (empty alt, name in aria-label).
- `monet_carousel`: the g-scrolling-carousel wrapper, all three image types plus a
null one, a no-date item, and some decoy links that should be ignored.
- `star_wars_movies`: a non-painting one with no wrapper, and an image on the
actual src.

## A few notes

- I didn't use a proper JS engine to run Google's scripts, just a regex to pull
the base64 out. Felt like overkill for this and it's easier to follow.
- The `\x3d` thing genuinely took me a while, the images kept coming out slightly
wrong and I couldn't work out why at first.
- The absolute vs relative link thing only showed up once I tried a real page I'd
saved myself, which I suppose is exactly why they ask you to test on other pages.
- Kept lxml optional so you don't strictly need it installed.

## Stuff I didn't do

- The other kind of grid/mosaic. The grids that use the same `/search?stick=`
links work fine (that's the power_cast one), but Google also does grids with
`wp-grid-tile` and no stick links, and I haven't handled those. That would want
its own separate bit of code and I'd want a real example to build it against.
- There's a thing where Google loads a carousel in afterwards with
`window.jsl.dh('id','<html>')`. I had a go at it and it works, but none of my
actual test pages needed it so I didn't leave it in.
- If a page had more than one carousel it'd currently just lump them together.
Loading