Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 70 additions & 3 deletions final_task/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,70 @@
# Your readme here
Some text.
Checkout how to write this file using *markdown*.
## Iteration 1
RSS reader is a command utility, which receives RSS URL and prints the result in convenient output format

Input data has the following interface:

`rss_reader.py source [-h] [--version] [--verbose] [--json] [--limit LIMIT]`
````
positional arguments:
source - URL which provides a RSS feed
optional arguments:
-h - prints this help page
--version - prints in stdout current version
--verbose - prints all logs in stdout
--json - prints news in JSON format
--limit LIMIT - limits the amount of news entries in the output
````
JSON structure:
```
[
{
"title": "A black man was put in handcuffs after a police officer stopped him on a trainplatform because he was eating",
"article": "Bay Area Rapid Transit police said Steve Foster, of Concord, California,violated state law by eating a sandwich on a BART station's platform. ",
"links": [
"https://news.yahoo.com/black-man-put-handcuffs-police-170516695.html",
"http://l.yimg.com/uu/api/res/1.2/iLcp4eQPeHI64PZ9LpeQcw--/YXBwaWQ9eXRhY2h5b247aD04Njt3PTEzMDs-/https://media.zenfs.com/en-US/insider_articles_922/e4254e78d7432dae4387d72624ee3086"
],
"link": "https://news.yahoo.com/black-man-put-handcuffs-police-170516695.html",
"date": "Mon, 11 Nov 2019 17:06:55 -0500"
},
{
...
},
...
]
```

## Iteration 2
to run rss parser on your computer you need to:
1) clone repository from https://github.com/ElizabethUniverse/FinalTaskRssParser
2) `$cd final_task`
3) `$python setup.py sdist bdist_wheel`
4) `$cd dist`
3) `$pip install rss_reader-1.1.tar.gz`
4) run `$rss_reader https://news.yahoo.com/rss --limit 2 --verbose`


## Iteration 3
News is stored in the csv cache in following format and with tab delimiter.

`date title link article list_links`

Now we are searching for the news in the cache with O(n) complexity. But in the near future we plan to optimize this process.

If you want to receive news for the 15/11/2019, please enter the following command in the command line

`$python rss_reader.py https://news.yahoo.com/rss --date 20191115`

--date argument works without internet connection and with --verbose, --json, --limit LIMIT arguments the same way.

## Iteration 4

News can be converted to pdf or html.

If you want to convert news to pdf:

`$python rss_reader.py https://news.yahoo.com/rss --to-pdf path`

to html:

`$python rss_reader.py https://news.yahoo.com/rss --to-html path`
80 changes: 80 additions & 0 deletions final_task/rss_reader.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
Metadata-Version: 1.2
Name: rss-reader
Version: 1.4
Summary: RSS parser
Home-page: https://github.com/ElizabethUniverse/FinalTaskRssParser
Author: Elizaveta Lapunova
Author-email: liza.lapunova99@gmail.com
License: BSD
Description: ## Iteration 1
RSS reader is a command utility, which receives RSS URL and prints the result in convenient output format

Input data has the following interface:

`rss_reader.py source [-h] [--version] [--verbose] [--json] [--limit LIMIT]`
````
positional arguments:
source - URL which provides a RSS feed
optional arguments:
-h - prints this help page
--version - prints in stdout current version
--verbose - prints all logs in stdout
--json - prints news in JSON format
--limit LIMIT - limits the amount of news entries in the output
````
JSON structure:
```
[
{
"title": "A black man was put in handcuffs after a police officer stopped him on a trainplatform because he was eating",
"article": "Bay Area Rapid Transit police said Steve Foster, of Concord, California,violated state law by eating a sandwich on a BART station's platform. ",
"links": [
"https://news.yahoo.com/black-man-put-handcuffs-police-170516695.html",
"http://l.yimg.com/uu/api/res/1.2/iLcp4eQPeHI64PZ9LpeQcw--/YXBwaWQ9eXRhY2h5b247aD04Njt3PTEzMDs-/https://media.zenfs.com/en-US/insider_articles_922/e4254e78d7432dae4387d72624ee3086"
],
"link": "https://news.yahoo.com/black-man-put-handcuffs-police-170516695.html",
"date": "Mon, 11 Nov 2019 17:06:55 -0500"
},
{
...
},
...
]
```

## Iteration 2
to run rss parser on your computer you need to:
1) clone repository from https://github.com/ElizabethUniverse/FinalTaskRssParser
2) `$cd final_task`
3) `$python setup.py sdist bdist_wheel`
4) `$cd dist`
3) `$pip install rss_reader-1.1.tar.gz`
4) run `$rss_reader https://news.yahoo.com/rss --limit 2 --verbose`


## Iteration 3
News is stored in the csv cache in following format and with tab delimiter.

`date title link article list_links`

Now we are searching for the news in the cache with O(n) complexity. But in the near future we plan to optimize this process.

If you want to receive news for the 15/11/2019, please enter the following command in the command line

`$python rss_reader.py https://news.yahoo.com/rss --date 20191115`

--date argument works without internet connection and with --verbose, --json, --limit LIMIT arguments the same way.

##Iteration 4

News can be converted to pdf or html.

If you want to convert news to pdf:

`$python rss_reader.py https://news.yahoo.com/rss --to-pdf path`

to html:

`$python rss_reader.py https://news.yahoo.com/rss --to-html path`
Platform: any
Requires-Python: >=3.7.0
19 changes: 19 additions & 0 deletions final_task/rss_reader.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
README.md
setup.py
rss_reader/CSVEntities.py
rss_reader/ClassNews.py
rss_reader/ToHTML.py
rss_reader/ToPDF.py
rss_reader/__init__.py
rss_reader/__main__.py
rss_reader/requirements.txt
rss_reader/rss_reader.py
rss_reader.egg-info/PKG-INFO
rss_reader.egg-info/SOURCES.txt
rss_reader.egg-info/dependency_links.txt
rss_reader.egg-info/entry_points.txt
rss_reader.egg-info/not-zip-safe
rss_reader.egg-info/requires.txt
rss_reader.egg-info/top_level.txt
test/RssUnitTest.py
test/__init__.py
1 change: 1 addition & 0 deletions final_task/rss_reader.egg-info/dependency_links.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

3 changes: 3 additions & 0 deletions final_task/rss_reader.egg-info/entry_points.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[console_scripts]
rss_reader=rss_reader.rss_reader:main

1 change: 1 addition & 0 deletions final_task/rss_reader.egg-info/not-zip-safe
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

4 changes: 4 additions & 0 deletions final_task/rss_reader.egg-info/requires.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
html2text==2019.9.26
python-dateutil==2.8.0
jinja2==2.10.1
fpdf==1.7.2
2 changes: 2 additions & 0 deletions final_task/rss_reader.egg-info/top_level.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
rss_reader
test
56 changes: 56 additions & 0 deletions final_task/rss_reader/CSVEntities.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import csv
from datetime import date
from dateutil.parser import parse
from dataclasses import dataclass, asdict
import os

import ClassNews

FIELDNAMES = ['date', 'title', 'link', 'article', 'links']


def csv_to_python(articles_list, csv_file):
"""This function inserts news to the source csv file that has never been seen in it."""
if not os.path.exists(csv_file):
open(csv_file, 'x', encoding='utf-8').close()

articles_list_from_csv = []
with open(csv_file, "r", encoding='utf-8') as file:
reader = csv.DictReader(file, FIELDNAMES, delimiter='\t')
for item in reader:
r = ClassNews.Article(**item)
articles_list_from_csv.append(r)

union_list = articles_list_from_csv[:]
for item in articles_list:
if item not in articles_list_from_csv:
union_list.append(item)

with open(csv_file, "w", encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=FIELDNAMES, delimiter='\t')
for item in union_list:
writer.writerow(asdict(item))
return True
return False

def return_news_to_date(input_date, csv_file, limit):
"""This function read from the file those news that match by date"""
article_list_by_date = []
datetime_input = date(int(input_date[0:4]), int(input_date[4:6]), int(input_date[6:8]))
with open(csv_file, "r", encoding='utf-8') as file:
reader = csv.DictReader(file, FIELDNAMES, delimiter='\t')
match_counter = 0
for item in reader:
article_from_file = ClassNews.Article(**item)

date_time = parse(article_from_file.date)
date_from_file = date_time.date()

if date_from_file == datetime_input:
match_counter += 1
article_list_by_date.append(article_from_file)

if limit == match_counter:
return article_list_by_date

return article_list_by_date
77 changes: 77 additions & 0 deletions final_task/rss_reader/ClassNews.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
import re
import html2text
from dataclasses import dataclass


LINKS_TEMPLATE = '\"((http|https)://(\w|.)+?)\"'


def xml_arguments_for_class(xml_string, limit):
"""This function receive the xml and limit of articles and returns list of dictionaries"""
dict_article_list = []
text = html2text.HTML2Text()
text.ignore_images = True
text.ignore_links = True
text.ignore_emphasis = True
for counter, xml_news in enumerate(xml_string.iter('item')):
parser_dictionary = {}
for xml_news_item in xml_news:
# Here we create the article in the form of a dictionary
if xml_news_item.tag == 'title':
parser_dictionary['title'] = text.handle(xml_news_item.text).replace('\n', "")

if xml_news_item.tag == 'pubDate':
parser_dictionary['date'] = xml_news_item.text

if xml_news_item.tag == 'link':
parser_dictionary['link'] = xml_news_item.text

if xml_news_item.tag == 'description':
parser_dictionary['article'] = text.handle(xml_news_item.text).replace('\n', '')
parser_dictionary['links'] = xml_news_item.text.replace('\n', '')

dict_article_list.append(parser_dictionary)

if limit == counter + 1:
Comment thread
HenadziStantchik marked this conversation as resolved.
return dict_article_list
return dict_article_list

def dicts_to_articles(dict_list):
"""This function receive list of dictionaries and convert it to list of articles """
article_list = []
for item in dict_list:
article_list.append(Article(**item))
return article_list

def html_text_to_list_links(html_links):
html_links = html_links.replace("\'", "\"")
list_links = []
for group1 in re.finditer(LINKS_TEMPLATE, html_links):
list_links.append(group1.group(1))
return list_links

@dataclass
class Article:
Comment thread
HenadziStantchik marked this conversation as resolved.
"""This is news class, which receives dictionary and have title, date, link, article and links keys fields"""
title: str
date: str
link: str
article: str
links: str

def __post_init__(self):
self.links = html_text_to_list_links(self.links)

def __str__(self):
result_string_article = "\nTitle: %s\nDate: %s\nLink: %s\n\n%s\n\n" % (self.title, self.date, self.link,
self.article)
for link_idx, link in enumerate(self.links):
result_string_article += "[%d]: %s\n" % (link_idx + 1, link)
result_string_article += '\n'
return result_string_article

def __eq__(self, other):
if self.article == other.article and self.title == other.title and self.link == other.link and \
self.date == other.date:
return True
return False
19 changes: 19 additions & 0 deletions final_task/rss_reader/ToHTML.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
from jinja2 import Environment, FileSystemLoader
import os

FILENAME_HTML = "articles.html"


def print_article_list_to_html(list_articles, path):
if not os.path.exists(path):
raise FileNotFoundError
html_stream = print_article_list(list_articles)
with open(os.path.join(path, FILENAME_HTML), "w", encoding='utf-8') as html:
html.write(html_stream)


def print_article_list(list_articles):
# directory with templates
env = Environment(loader=FileSystemLoader('.'))
template = env.get_template('template.html')
return template.render(articles=list_articles)
44 changes: 44 additions & 0 deletions final_task/rss_reader/ToPDF.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import os
from fpdf import FPDF

FILENAME_PDF = "articles.pdf"


def conv_str(input_str):
return (input_str.replace('\u2026', '').replace('\u2019', '').replace('\u201c', '').replace('\u201d', '')\
.replace('\u2013', '').replace('\u2018', ''))


class PDF(FPDF):

# Page footer
def footer(self):
# Position at 1.5 cm from bottom
self.set_y(-15)
# Arial italic 8
self.set_font('Arial', 'I', 8)
# Page number
self.cell(0, 10, 'Page ' + str(self.page_no()) + '/{nb}', 0, 0, 'C')


def print_article_list_to_pdf(list_articles, path):

if not os.path.exists(path):
raise FileNotFoundError
path = os.path.join(path, FILENAME_PDF)

pdf = PDF()
pdf.alias_nb_pages()
pdf.add_page()
pdf.set_font('Arial', '', 12)

for item in list_articles:
pdf.cell(0, 10, "Title: %s" % (conv_str(item.title)), 0, 1)
pdf.cell(0, 10, "Date: %s" % (conv_str(item.date)), 0, 1)
pdf.cell(0, 10, "Link: %s" % (conv_str(item.link)), 0, 1)
pdf.multi_cell(0, 10, '%s' % (conv_str(item.article)), 0, 1)
for idx, link in enumerate(item.links):
pdf.multi_cell(0, 10, "[%d]:%s" % (idx, (conv_str(link))), 0, 1)
pdf.cell(0, 10, "", 0, 1)
pdf.output(path, 'F')
return True
1 change: 1 addition & 0 deletions final_task/rss_reader/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

Loading