~ntietz/isabella-db-devel

This thread contains a patchset. You're looking at the original emails, but you may wish to use the patch review UI. Review patch
1

[PATCH] Create design document for initial work

Details
Message ID
<20221024213620.2399627-1-me@ntietz.com>
DKIM signature
missing
Download raw message
Patch: +191 -0
This design doc covers the basic design I'm trying to implement with
Isabella and is an intended plan for the initial work. I have a short
"roadmap" of things to do so that it's broken into smaller pieces.
---
 docs/0001-initial-design-and-plan.md | 191 +++++++++++++++++++++++++++
 1 file changed, 191 insertions(+)
 create mode 100644 docs/0001-initial-design-and-plan.md

diff --git a/docs/0001-initial-design-and-plan.md b/docs/0001-initial-design-and-plan.md
new file mode 100644
index 0000000..d0038c5
--- /dev/null
+++ b/docs/0001-initial-design-and-plan.md
@@ -0,0 +1,191 @@

This document is an initial design document. It's sparse in details in some
places and is meant only to serve as exploration for initial development. It
contains omissions and is light on details for things that are in a more
exploratory phase.

The main focus is ensuring that the overall direction is correct, as well as
discovering things that will need to be learned or implemented.


# Context

IsabellaDB is a chess database. It's meant to be used for analyzing past games
and patterns across them to uncover interesting facts or improve as a chess
player.

Some things which IsabellaDB will support:
- Search for games
    - Filter games on metadata (player, event, year, rating, etc.)
    - Filter games based on position matches
    - Filter games based on sequences of position matches
- Display a game (show move sequence and allow going forward/back through it)
- Opening tree
    - Explore from the beginning of a game
    - Find games which match a given position
    - See win/lose/draw percents for a given position
    - Apply a game search filter to the above
- (TODO) Opening repertoires
    - Discover a player's opening repertoire
    - Build an opening repertoire
- (TODO) Ingest new game data
    - Update as new master game databases are published
    - Pull in data for us amateurs (for openings especially)


# Game Storage and Retrieval

The fundamental functionality of a chess database is to be able to store and
retrieve games with various filters. Additional functionality, like analysis,
is layered on top of this. As such, game storage and retrieval is the critical
core of IsabellaDB.

We'll focus first on storage of games, then retrieval, then indexing.

## Storage

Each game has a set of metadata associated with it, as well as a list of moves.

The metadata contains:
- Player names
- Player ratings
- Game result
- Game date
- Event information (event name, site name, round)

The game information contains:
- Sequence of materialized positions representing the game
- The algebraic notation representation of the game (for display)

There will be a lot of repetition in metadata, so it will be converted to store
only once in memory in a separate names store, and then the game records will
point into that store.

Each game will be assigned a unique id when it is ingested into the system,
and these will be used as the lookup keys from the indexes.

The initial in-memory storage will be pretty basic and will look like:
```rust
type GameID = u64;

struct GameStore {
    games: HashMap<GameID, Game>,
    indexes: ...
}
```


## Retrieval

There are two types of filters that you can do: filtering based on metadata
about a game, and filtering based on what happened within the game.

Retrieval will be pretty straightforward. Given a list of filters, in order of
use (this order will be determined by the query planner), we can retrieve them
with this pseudocode:

```python
# get the games that match the first filter
games = db.retrieve(filters[0])

for filter in filters:
    games = games.filter(filter)
```

To do the retrieval, the `db.retrieve` call will utilize the index that
corresponds to that filter and retrieve the records that match it, either as
a range or an exact result list.


## Indexing

To start, we'll have some fairly simple indexes. Over time, we'll build up more
complicated indexes to handle more complicated search patterns.

The initial indexes will be:
- **Player name index**: How to implement this is TBD and may start as a prefix
  tree, but ideally would be able to match on any substring within the
  metadata, not just based on prefixes. Inverted indexes also seem relevant.
- **Event name index**: Same as player name index.
- **Rating index**: a B-tree index over ratings.
- **Date index**: a B-tree index over dates.
- **Position index**: We'll have a reverse lookup from position to the games which
  contained that position. Lookups into this index can use hashes of the
  positions for quick initial lookups, but will have to do an exact comparison
  comparison of the positions to avoid collisions.

Indexing on game results would not add any significant benefit, since there are
only three classes (win, lose, draw).

Eventually, we'll also have indexes based on features of positions, instead of
exact position matches. This includes things like the presence (or absence) of
pins, certain pieces being on the board or not, etc. Further development of
indexes will be driven by need to support certain types of queries.

Out of scope for the initial design:
- Persistence. We'll reconstruct these at runtime for now to keep things simple.
- Updates. We'll want to insert new games into the index at runtime eventually,
  but for now we'll assume the dataset is static.


# Querying

A user should be able to submit a query for a set of games which match certain
criteria. Here is an example of doing that with a few different filters on
metadata:

```
# Find all the games played by Magnus and Fabi after 2015
(search-games
    (match-metadata
        (either-name "Fabiano Caruana") # matches either player
        (either-name "Magnus Carlsen")  # matches either player
        (year > 2015)
    )
)
```

And here's another example based on positional matching:

```
# Find games which match the provided position (5 ply of the Italian) with an
# extra filter on metadata
(search-games
    (match-position
        (fen "r1bqkbnr/pppp1ppp/2n5/4p3/2B1P3/5N2/PPPP1PPP/RNBQK2R b KQkq - 3 3")
    )
    (match-metadata
        (white-rating > 2300)
        (black-rating > 2300)
    )
)
```

The query language will be defined in more detail over time. This is just a
sample to guide what capabilities it'll need to provide. This will be one of the
later pieces developed.

When a query is received, it will be parsed (as an S-expression), converted to
a sequence of operations that can be run to retrieve the data from the store,
and optimized (mostly around ordering of application of filters, and use of
indexes).


# Development Plan and Roadmap

The initial development for the first phase will focus on getting data ingested
and basic indexes constructed. Querying will be in the second phase.

v0.1.0:
- PGN parser to be able to read in the dataset I have available to me
- Index on metadata (player names, event names, ratings, dates)
- Index on positions
- Search page with basic metadata filters

v0.2.0:
- Query parser
- Opening tree page
- Search page to also include position matches



-- 
2.37.3
Details
Message ID
<325d9b36484f40b8c0d098be4769b65b8d8f36da.camel@ntietz.com>
In-Reply-To
<20221024213620.2399627-1-me@ntietz.com> (view parent)
DKIM signature
missing
Download raw message
I got some informal feedback from a friend on this and intend to merge
it today.

> +```python
> +# get the games that match the first filter
> +games = db.retrieve(filters[0])
> +
> +for filter in filters:
> +    games = games.filter(filter)
> +```
> +

The first filter is done twice. This should say `for filter in
filters[1:`.
Reply to thread Export thread (mbox)