11 lines
600 B
Text
11 lines
600 B
Text
[x] Refactor website table to generic document table (maybe using URN instead of URL?)
|
|
[x] Define tokens table FKed to document table
|
|
[x] Refactor index.py to tokenize input into tokens table
|
|
[x] Define N-Grams table
|
|
[x] Add N-Gram generation to index.py
|
|
[x] Add clustered index to document_ngrams table model
|
|
[x] Add clustered index to document_tokens table model
|
|
[ ] Add ddl command to create partition tables
|
|
[ ] Investigate whether or not robots.txt is as aggressive as I'm making ito ut to be
|
|
[ ] Instead of starting from a random page on the site, go to root and find site map and crawl that
|
|
|