Tokenizing and Parsing - Part I

Introduction ¹

A tokenizer breaks a stream of text into tokens, usually by looking for whitespace (tabs, spaces, new lines).

A lexer is basically a tokenizer, but it usually attaches extra context to the tokens -- this token is a number, that token is a string literal, this other token is an equality operator.

A parser takes the stream of tokens from the lexer and turns it into an abstract syntax tree representing the (usually) program represented by the original text.

The best book on the subject for many decades was Compilers: Principles, Techniques, and Tools, aka "The Dragon Book".

Worksheet

1. Tokenizer

Write a function that takes a string target and returns all tokens that start with prefix and end with suffix, inclusive, and returns a list of items that match.

Add tokenizer.pd and add the function as stubbed.

tokenizer(target, prefix, suffix):
    return list_of_tokens_that_match

2. Better Tokenization

Can you describe three limitations of the function described, which if addressed, could allow this code to be more useful and reuseable?

Write your answer in tokenizer.md

3. Scraping a webpage

Write a function that takes a url and returns a list of all urls that are only referenced as ahrefs in the response text. Make sure to use the tokenizer function you wrote in Part 1.

get_url_list(url):
    # do something
    url_list = tokenizer(webpage_source, prefix, suffix)
    # do something else
    return url_list

4. Infix to postfix

Take an infix expression (a + b) ^ c - d / q and render it as a postfix expression.

The output must be a list that can be processed as postfix.

Make no assumptions about spacing or other delimiters.

Add your function to the file tokenizer.py.

infix_to_postfix(infix_expression: str):
    # do stuff
    return postfix_result

Example

>>> infix_to_postfix("a+b*c+d")
["a", "b", "c", "*", "+", "d", "+"]

Stack Overflow ↩

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
README.md		README.md
pylint_output.md		pylint_output.md
tokenizer.md		tokenizer.md
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tokenizing and Parsing - Part I

Introduction ¹

Worksheet

About

Uh oh!

Releases

Packages

Languages

e9srawat/lesson017

Folders and files

Latest commit

History

Repository files navigation

Tokenizing and Parsing - Part I

Introduction 1

Worksheet

Footnotes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Introduction ¹

Packages