~akspecs/numbeo-scraping-dev

add scrape2db.sh v1 APPLIED

Andrei K.: 1
 add scrape2db.sh

 1 files changed, 73 insertions(+), 0 deletions(-)
Also!

On Tue Nov 16, 2021 at 5:24 AM PST, Andrei K. wrote:
Next
Export patchset (mbox)
How do I use this?

Copy & paste the following snippet into your terminal to import this patchset into git:

curl -s https://lists.sr.ht/~akspecs/numbeo-scraping-dev/patches/26650/mbox | git am -3
Learn more about email & git

[PATCH] add scrape2db.sh Export this patch

this commit contains scrape2db.sh, an all-in-one script that intends to:
 - scrape
 - create and/or update the database with the scraped data
---

Well, it works. (tested on the mac and the thinkpad nano)

You can call the script from any directory, and it will figure out how
and where to run the spiders.

e.g.

cd /tmp
git clone https://git.sr.ht/~akspecs/numbeo-scraping
/tmp/numbeo-scraping/scrape2db.sh

... scraping ensues ...

as opposed to having to cd into scrapy's numbeo directory.

Note my TODO's for this:
- use a more portable sed command that works with gnu/bsd/other

- conditionally use mkdummyjson if and only if the desired json file
  does not already exist (to avoid overwriting existing data)

- check for dummy lines before removing them with rmdummylines to avoid
  accidentally damaging the json data

- output scrapy's json files to the current working directory

I plan to get to that last TODO after getting some form of an argument
parser for the json2sqlite.py script.  That way we can move these two
scripts into their own directory outside of scrapy's numbeo directory
and get things a little more organized.


 scrape2db.sh | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)
 create mode 100755 scrape2db.sh

diff --git a/scrape2db.sh b/scrape2db.sh
new file mode 100755
index 0000000..a527ba4
--- /dev/null
+++ b/scrape2db.sh
@@ -0,0 +1,73 @@
#!/bin/sh
#
# scrape2db.sh - an all in one script that scrapes numbeo.com and
#                stores the scraped data in a(n) sqlite3 database
#
# in a little more detail, this script does the following:
#  - uses scrapy to crawl data from numbeo.com
#  - stores the scraped data in a series of json files
#  - ...??
#  - .....?!??!?!
#  - profit(s)! (spiritually, literally, metaphysically, etc...) or
#  - creates/updates a sqlite3 database from the data in the json files
#
# dependencies:
#  - scrapy >= 2.5.1
#
# be advised, some of the methodology here is incredibly hacky,
# aka quick and dirty.  this is because our hacky approach during the
# web scraping process lead us to this point, where we need an yet
# another hacky way to glue everything together

# TODO: use a more portable sed command that works with gnu/bsd/other
#
# TODO: conditionally use mkdummyjson if and only if the desired json
#       file does not already exist (to avoid overwriting existing data)
#
# TODO: check for dummy lines before removing them with rmdummylines to
#       avoid accidentally damaging the json data
#
# TODO: output scrapy's json files to the current working directory

dirname_realpath() {
  python3 -c "import os.path; print(os.path.dirname(os.path.realpath('$1')))"
}

PROJECT_HOME=$(dirname_realpath "$0")

fail() {
  echo "error: $1" >&2
  exit 1
}

mkdummyjson() {
  printf '[\n]\n' > "${1}.json"
}

rmdummylines() {
  # works with gnu sed
  sed -i '1,2d' "$1"
}

scrapy_crawl() {
  SPIDER="$1"
  OUTPUT="$2"
  cd "${PROJECT_HOME}/numbeo"
  scrapy crawl "${SPIDER}" -o "${OUTPUT}.json" || fail "$SPIDER failed"
}

main() {
  cd "${PROJECT_HOME}/numbeo"
  for dummy in countries cities qoli climate; do
    mkdummyjson "$dummy"
  done
  scrapy_crawl numbeo_countries countries && rmdummylines countries.json
  scrapy_crawl numbeo_cities cities && rmdummylines cities.json
  scrapy_crawl qol qoli && rmdummylines qoli.json
  scrapy_crawl climate climate && rmdummylines climate.json
  ./json2sqlite.py
}

main

# vim: set ft=sh tw=79 ts=2 sw=2 et:
-- 
2.33.0
Be advised, THE SCRIPT WILL DELETE / OVERWRITE your scraped json files,
so DON'T RUN scrape2db.sh if you DO NOT want to LOSE the json data.
Sorry about the caps, but I wanted to get my point across.

Feel free to try it in some temporary directory, though.

If you get to test it, time it.  Also, make note of the behaviour of the
db after numerous updates and tell me if you notice something.

Let me know what you think.


ak