Skip to content

TypeError (str/bytes) in warc.py error path #23

@bnewbold

Description

@bnewbold

In production at IA, probably caused by petabox downtime or network error, I got a the following exception and stack trace:

TypeError: sequence item 0: expected str instance, bytes found
  File "extraction_ungrobided.py", line 272, in <module>
    MRExtractUnGrobided.run()
  File "mrjob/job.py", line 424, in run
    mr_job.execute()
  File "mrjob/job.py", line 433, in execute
    self.run_mapper(self.options.step_num)
  File "mrjob/job.py", line 517, in run_mapper
    for out_key, out_value in mapper(key, value) or ():
  File "extraction_ungrobided.py", line 228, in mapper
    info, status = self.extract(info)
  File "extraction_ungrobided.py", line 143, in extract
    info['file:cdx']['c_size'])
  File "extraction_ungrobided.py", line 126, in fetch_warc_content
    gwb_record = rstore.load_resource(warc_uri, offset, c_size)
  File "wayback/resourcestore.py", line 65, in load_resource
    return create_resource(loader.load_block(bstart, blen))
  File "wayback/resource.py", line 583, in create_resource
    record, errors, offset = parser.parse(rs, 0, line)
  File "hanzo/warctools/warc.py", line 223, in parse
    % (",".join(self.KNOWN_VERSIONS)),

self.KNOWN_VERSIONS is defined as bytes at https://github.com/internetarchive/warctools/blob/master/hanzo/warctools/warc.py#L177, but is being joined with a string.

One fix, though i'm not sure it would work in Python 2.7, would be:

(",".join([s.decode('utf-8') for s in self.KNOWN_VERSIONS])

There's probably a more idiomatic way, but I can submit a patch for that.

While we're at it, might want to make it a join on ", ", not ","?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions