~akspecs/numbeo-scraping-dev

1

refactoring

Details
Message ID
<CFTHU1U8XTSS.1QGYQFRQA1R8C@nano>
DKIM signature
missing
Download raw message
there are a number of things that need to be done to refactor this code.
here are some of my thoughts based on our recent conversation.

the end will result will be a more organized repo where our script's
code is readable, easily maintainable, and for every file to have its
place.

here's what we need to do in order to achieve this goal (feel free to
add your feedback/comments on what i missed or what should/shouldn't be
prioritized):

json2sqlite.py:
1)
break the code up into many functions, and guard it.
i.e. 
...
def main():
    one_of_many_functions()

if __name__ == '__main__':
    main()

to reiterate, all of the code that is run every time to create/update the
database should be split up into as many different functions as
necessary.

2)
implement argument parsing.  the default behaviour should be to look for
each json file in the current working directory under their assumed
names, but in the event they are not there or named something else, the
script should be able to take their name in the form of a command line
argument.
e.g.

python3 json2sqlite.py --countries countries.json [...]

this passes countries.json to the script, which tells the script where
to find said json file to create/update a table with.  this will be it's
default behaviour, but in the event the script can't find the required
json files to update the database with, we can just pass the relative
path of the needed json file and voila!

a couple more command line paramters come to mind:
 --all           update all tables in the database
 --db <file>     specify which database file to use

again, this isn't an all encompassing list, just food for thought.


scrape2db.sh may also benefit from an argument parser, e.g. pointing to
where the root of the numbeo scrapy directory is, and whether or not to
overwrite/backup the json files with scrapy's results, for example.

otherwise, both scrape2db.sh and json2sqlite.py should be moved to the
scripts directory in the root of the project's repo, leaving the
numbeo directory to only contain scrapy related files.

let me know what you think!

ak
Rebecca Medrano <rmedran4@mail.ccsf.edu>
Details
Message ID
<CFUF37FQNEI8.3QT2IBBGMK4JC@manjaro-t480>
In-Reply-To
<CFTHU1U8XTSS.1QGYQFRQA1R8C@nano> (view parent)
DKIM signature
missing
Download raw message
On Thu Nov 18, 2021 at 9:06 PM PST, Andrei K. wrote:
> the end will result will be a more organized repo where our script's
> code is readable, easily maintainable, and for every file to have its
> place.
Agreed.

> json2sqlite.py:
> break the code up into many functions, and guard it.
> ...
> to reiterate, all of the code that is run every time to create/update
> the database should be split up into as many different functions as
> necessary.
This will be especially important as we scrape even more data.

> implement argument parsing.
I agree that this will be nice to have, but having a working
implementation (albeit hacked together) should be the priority as our
deadlines approach.


Here are some more TODOs:

 - Add # of contributors to existing data (e.g. qol, climate)
   This will help filter data with many contributors, which we can
   assume is more accurate.
 - I would like to ensure that the data scraping methods are consistent
   throughout the different spiders.

 - Finish scraping the remaining numbeo pages:
   - safety
   - healthcare
   - commute time
   - food prices
   - property prices
 
 - Convert all empty strings and `?' to NULL values
 - If time allows, make a region table which includes full names, and
   not just abbreviations
 - Also if time allows, add a language table which each location's
   official language(s)
 

Again, I agree that refactoring json2sqlite.py is a priority.


Rebecca
Reply to thread Export thread (mbox)