This document descritbe how you can make your own search index freely from any content using the SearchBase class. The first part show how to create a new index, and the second part how to keep it up to date. To search in the index is done the same way as a base search, just that your do the reading of data and adding it to the index your self.

from debdata3 import getrepo, uninitprism9
from debsearch3.searchbase import SearchBase
from bs4 import BeautifulSoup
import os

updatedone = False
searchroot = "C:\\\path\\to\\your\\search\\index\\root"
searchname = "searchname"
try:
repo = getrepo("C:\\\path\\to\\your\\database\\to\\search", "YourAppName")
search = SearchBase(searchname, searchroot)
table = repo.gettable("Toc",'BookPrefix="searchname"', (('TocSort', 'in order'),))
if os.path.exists(search.indexpath):
isupdate = True
lastupdate, dateastxt, timeastxt = search.getlastupdate()
else:
isupdate = False
search.createindex()
updatedone = False
for rec in table:
ps = rec["PageSlug"]
cp = rec["ChapterPrefix"]
bp = rec["BookPrefix"]
if ps == "":
toprint = rec["Title"]
else:
url = "/viewdoc.html?&ps=%s&cp=%s" % (ps,cp)
dequery = 'PageSlug="%s"' % (ps)
doctable = repo.gettable("Document",dequery)
doctable.cleanup()
utime = doc["UpdatedTime"]
udate = doc["UpdatedDate"]
ctime = doc["CreatedTime"]
cdate = doc["CreatedDate"]
soup = BeautifulSoup(doc["Body"],features="html.parser")
title = doc["Title"] + '\n'
txt = soup.get_text('\n')
extra = txt[:252]
lret = extra.rfind('\n')
lspace = extra.rfind(' ')
if not (lret==-1 and lspace==1):
if lret>lspace:
title += extra[:lret] + '...'
else:
title += extra[:lspace] + '...'
if not isupdate:
search.updateindex(title=title,url=url,content=title+'\n'+ txt,isnew=True,autocommit=False)
updatedone = True
else:
docdatetime = search.datetimefromfields(udate,utime)
if docdatetime > lastupdate:
created = search.datetimefromfields(cdate,ctime)
if created < lastupdate:
search.updateindex(title=title,url=url,content=title+'\n'+ txt,isnew=False,autocommit=False)
else:
search.updateindex(title=title,url=url,content=title+'\n'+ txt,isnew=True,autocommit=False)
updatedone = True
toprint = "\t%s (%s %s) -> %s" % (rec["Title"], udate, utime, url)
print(toprint)
print ("Returned %d records" % table.rowcount())
if updatedone:
search.commit()
uninitprism9()
print("Done")
except Exception as e:
print(e)



This code open a repository for reading the data. Then open a index and create it if it do not exist. If it exists we will need to get the last update date and time and use the for checking if the documents should be added or updated. If new index, just add all. We also use bs4 library to extract just text of our html documents and add first 252 characters to the title so we can return the results a little google like. The example above are based on indexing this book that have a Toc table the tell the order and the content of the book and another table (Document") with articles form several books. The important in a document table is to keep creation date and time and update date and time. Then you can do as in the example and only update documents when they change and create any new ones. If not you have to delete the index each time and add all documents.