Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike elasticlunr in docs site #575

Draft
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

flyinggrizzly
Copy link
Contributor

@flyinggrizzly flyinggrizzly commented Feb 28, 2021

This PR is a spike at adding client-side search using
ElasticLunr.

This PR message, and the code, are still WIP. I'll keep this up to date as I go.

At the moment it is still an experiment, manually adding in scripts in the
doc/ site.

Because cache.json does not have the note bodies/content, we can't build in
full-text search yet.

I'm not sure if that's a problem at this point, since it may make sense to move
the search index generation out of the browser and into Neuron itself, so that
an index can be requested by the browser instead of being constructed on the
fly.

ElasticLunr appears to have some support for jsonifying its
indices
, though I
haven't looked into the process of hydrating these yet. It looks pretty
straightforward though
.

To do:

  • display results in UI instead of logging to console
  • test jsonifying/rehydrating an index to see how it would behave if we were
    to build the index server-side
  • investigate "pipelines" (stop-word removal, stemming)
  • investigate plugins (is this the same as above?)
  • determine good/optimum search configs (and suppress default config warning)
  • spike "advanced search" functionality, possibly with #fieldSearch
    • may want its own search.html page, so that might want to be part of
      displaying results before moving too far ahead
  • Handle pretty-url configuration in search
  • Add date (and other field?) search
  • Enable advanced search options (tags, dates, etc) based on relevant plugin settings in Neuron

Once those checks are done, I think things'll be in a position to make a better
decision about whether this makes sense for Neuron.

References #568

For #567


Next immediate steps

  • use saved cache from https://lost-frequencies.eu/impulse.html to make sure we're not hitting performance issues
  • remove 3-char limit in search function
  • blindly copy elasticlunr settings from mdBook and test
    • investigate each one and remove any unnecessary options
    • if this doesn't fix the insta search issue, keep investigating
  • replace Rails-style syntax with Obisidian-style syntax in URL parsing
  • move search box out to footer/header and remove seach.md

@flyinggrizzly flyinggrizzly changed the title Set up elasticlunr in docs site Spike elasticlunr in docs site Feb 28, 2021
@srid
Copy link
Owner

srid commented Mar 1, 2021

Because cache.json does not have the note bodies/content, we can't build in
full-text search yet.

At least we can prototype a working version of metadata search (i.e., search by title & tags), right?

@flyinggrizzly
Copy link
Contributor Author

At least we can prototype a working version of metadata search (i.e., search by title & tags), right?

That's what I was thinking, yea.

@flyinggrizzly
Copy link
Contributor Author

@srid I've just pushed up some commits with a basic UI (screenshot added to PR description). This may be all I have time for this weekend, but my plan is to hit the plugins/pipeline next.

@flyinggrizzly
Copy link
Contributor Author

Just pushed up a few more commits that add handling of URL params for search. This is a nice-to-have, but also saves me the trouble of adding in a UI for advanced search before I fully understand it.

Currently, I'm using this to attempt searching only the tags field, with /search.html?search[tags]=["root"].

I've got the simple general search working, with /search.html?search=Install, which returns the same results as if you navigated in and entered "Install" into the input, which is 👍.

Field-scoped search is going to take some more research on my part though.

@srid
Copy link
Owner

srid commented Mar 6, 2021

Cheers, I'll take a peek.

We should test the search perf on large Zettelkastens like https://lost-frequencies.eu/impulse.html

@zettelzottel Is the source of your public zettelkasten available somewhere?

@ghost
Copy link

ghost commented Mar 6, 2021

@srid no the source is not public. it is hosted on private gitea instance. you can sign up on the instance and i give you access to the repo. https://git.lost-frequencies.eu

@srid
Copy link
Owner

srid commented Mar 6, 2021

Maybe use https://lost-frequencies.eu/cache.json in ./doc's zk just for testing ...

@srid
Copy link
Owner

srid commented Mar 7, 2021

@flyinggrizzly Search fails for me with this warning in the console

![image](https://user-images.githubusercontent.com/3998/110226469-a2ca0780-7ebd-11eb-98f2-ccb36727fbdc.png)

Anything in particular I should do when running neuron on ./doc? I access the search HTML using: http://localhost:8080/search.html

EDIT: It works (as in it displays the results) when I go to a URL with query like http://localhost:8080/search.html?search=Install (the console warning above still appears), but not when I type a query and press Enter (nothing happens then).

EDIT 2: Oh n/m, looks like, in some cases, the search query doesn't do substring match. It requires exact words.

@flyinggrizzly
Copy link
Contributor Author

flyinggrizzly commented Mar 7, 2021

Yea, that search config warning is still something to look at--I'll add it to the to-do list. Related, I've added in a very basic and probably problematic config for the fieldSearch call that also needs to be actually looked at--so far I've just added enough to get a successful call of the method.

Weird about the substring though--I was having decent luck with "instal" (1 "l"). What was your query? It might be useful when working out the search configs.

Also, hitting return shouldn't be required for getting the search to fire--if you type in "install" do you get results, or have I used some browser APIs that work in Firefox but not your browser? (I've been trying not to worry about that so much while spiking, and leaving until later making sure that the JS syntax and implementation is general enough to support more browsers... but if it's not working for you I'll move it up the list)

@srid
Copy link
Owner

srid commented Mar 10, 2021

"instal" works, but "insta" (or "inst") returns nothing. I'd expect search to work even if you type only 3 characters of a word. I suppose this is a matter of configuring elasticlunr? And yea search works without having to hit enter.

I think it would be good to see tag-based search; we can then create permalinks to search page that lists all zettels tagged with a tag (and link to it from tag links). Something like /search.html?q=tag:foo

Also, what kind of index file can neuron generate that would help the frontend JS maximally (in regards to perf)? Assuming metadata only. I can actually implement this in master if we have a clear idea of the data format ...

@flyinggrizzly
Copy link
Contributor Author

Just checking in--this weekend ended up being overwhelming so I didn't get the time I planned to work on this. But, before it goes too long, here's my thoughts/plans:

Search by tag

Tag search is already configured in the URL params handling, using search[tags]=[tag1,tag2], which should allow linking by tag, and also linking to custom searches of multiple tags. I can't remember if elasticlunr defaults to AND or OR conjunctions, but it can be set either way.

I went with a Rails-esqu format for the params, but can very easily switch to ?tag:[foo]/?tag=[foo]/?tag:foo etc.

At the moment, only 1 tag is ever searched at a time, but I thought that building in the option of multiple would help in the future, and doesn't really cost anything now.

Do you have a query-string syntax preference? At the moment, cases we need to handle are:

  • general (keyword) search. Currently ?search=search%20terms
  • tag-search. Currently ?search[tags]=[tag1]

Also, based on the current config, we could also have title search, slug search, or date search (which is a bigger can of worms, but I suspect elasticlunr might have some decent date logic if it's based on Lucene syntax. But not sure). Ideally these would have similar syntax to tags, at least in the naming of the query parameter (so search[title] or search[date]). I wasn't planning on working on these for now, since tags and general search seem the priorities, but worth considering when we look at how we want the query string formed.

NOTE: while tag search is set up in the URL params handling, it's not working--I still need to get my head around fieldSearch

Minimum search string length

That does look like an elasticlunr thing--mdBook search fires on any number of keystrokes.

I had also added a minimum search string length of 3 which I'll remove too, just in case that's contributing somehow.

Server generated support

For now, I still think it'd be best to keep everything in the browser--the cache is supplying everything we need still.

At some point the search index may need to have behavior configured based on plugins that are active, but those are already listed in the cache so that's still OK.

@srid
Copy link
Owner

srid commented Mar 15, 2021

  • For search query format, see Support for "or" and "not" in tag queries #358 (comment) .. so perhaps using a generic ?q="foo tag:someTag" (which allows complex queries like like ?q="foo title:Foo (tag:tag1 OR tag:tag2)" would be good for long-term. To begin with, we don't necessarily have to support from the get-go URL queries other than a singular tag filtering (for use in tag links), i.e., ?q=tag:someTag.
  • Re: server generated cache ... okay if cache.json is good enough, let's use that. We can optimize index performance (esp. for larger notebooks) latter anyway, once the general search implementation is in place.

I see that you have a search.md zettel that contains the HTML. Ultimately, neuron could generate search.html with this HTML/JS in it, but a better approach is to put the search box in the footer action bar in all zettel pages, and do the search right there, and display the results inline ... just like mdbook does currently. Then we don't need a dedicated search page. I imagine I (or somebody) would do this in PureScript (or TypeScript or whatever), using your search.html and search.js as reference implementation (which this PR would provide as a proof of concept).

@srid
Copy link
Owner

srid commented Mar 15, 2021

I imagine I (or somebody) would do this in PureScript (or TypeScript or whatever)

Actually you know what. Let's just do this in plain JavaScript, to begin with. I could modify neuron to use the current search.js in all generated pages.

@flyinggrizzly
Copy link
Contributor Author

flyinggrizzly commented Mar 16, 2021

Actually you know what. Let's just do this in plain JavaScript, to begin with. I could modify neuron to use the current search.js in all generated pages.

That makes sense to me. Do you have a minimum browser target in mind? I've been working under the assumption that some compilation might be used so I've been using plenty of ES6--it wouldn't be a problem to manually roll it down to ES5 compatible syntax etc, but I'll just need to check out what works.

Also, that query syntax works--I'll update it to use that.

For search.md, this was just a fast way of getting up and running--once search behavior is working then I was thinking I'd move it to being present on every page and dropping down over page content. We'll probably want a dedicated (static) search page for linking out to tag queries etc, but that's also pretty easy.

And for the cache stuff, my guess is that it'll be faster/more performant to have a server generated search index as mdBook does, but I don't think the search function is stable enough (or even known to be actually work for Neuron 😅) to warrant putting more of your time into it yet.

@srid
Copy link
Owner

srid commented Mar 16, 2021

More comprehensive reference for search query syntax: https://help.obsidian.md/Plugins/Search

Would be nice if neuron supported the same syntax as that of Obsidian. Assuming the current z:zettels?.. URI syntax is deprecated in favour of Obsidian's syntax (see #407 (comment)), then tag:foo would work for tag links.

This new syntax allows for more complex queries - but the frontend JS search (this PR) doesn't have to support all of them of course, even if neuron's new z-queries do. It is just something to keep in mind as a distant possibility.

This appears to be what is required to get the term "ins" to begin
returning results for "install" etc. ("in" is a stop word, so does not
trigger the search).

Searching for just "ins" returns more than just "install" results,
because it could expand to other words. It appears that it also expands
to "integration", which makes enough sense to me since that could be a
typo. "integration" drops out of the results once the query term is
expanded to "inst".

This also highlights a potentially better and simpler way to handle
tag-search--using the field boost/suppression in the search options to
limit tag search to only the tag field.
@flyinggrizzly
Copy link
Contributor Author

flyinggrizzly commented Apr 5, 2021

"ins" as search term

This wasn't working because of a search settings issue--the searchOptions parameter for index searching takes an optional expand attribute, which when set to true causes the search terms to be expanded. The example is that "micro" expands to "microscope" and "microwave", which is the behavior that we want.

Pipeline

On closer inspection, mdBook is using the default pipeline settings, and so are we--I'm making no changes until we determine we need to.

Tag-search

Fixing the term-expansion also made me realize that perhaps #fieldSearch is not the tool we want for tag-search--we might be better off with a general search and using the field boost attributes to suppress everything except the tags:

const searchOptions = {
  bool: "AND",
  expand: true,
  fields: {
    title: { boost: 0 }, // set to 1 on general search
    tags: { boost: 1 }, // also set to 1 on general search, but now the only field with a boost > 0
  }
}

I've still got some time in me today for this so I'll keep going (and editing this comment), but wanted to document progress today so far.

performance on large zettels

I've saved the Lost Frequencies cache in static as large-cache.json, and it's noticeably slower, but not in a way that I would find problematic as a user. The searcher is configured to use that cachefile at the moment (so links are broken), but it would be good to see how others feel about this performance.

search on every page

So I've got the search.js script inserting an additional content block at the top of every page now, that shows just a searchbox to start. On input, this content block expands to show all results.

For now, it's just pushing actual page content down (instead of expanding over it), though this could be adjusted.

For the moment, I've left in the search.md note, but made it entirely empty. This is acting in place of what Neuron would ideally provide--a search page that could handle things like tag linking (so that <z:zettels?tag=foo> (or the replacement syntax) would generate a link like a href="/search.html?q=tag:foo">#foo</a>). This could also target the generated site root if we don't want to effectively prevent people from using search as a zettel ID.

Right now, in order to avoid eliminating search as a usable zettel ID for users, I think link queries should target the generated site's root: a query<z:zettels?tag=foo> (or the replacement syntax) would generate a link like <a href="/?q=tag:foo">#foo</a>. This is good for two reasons:

  1. we preserve search as a usable ID
  2. we don't need to worry about handling pretty URLs in search

If we do target the site root, it would probably be best to move the searchbox and results into a modal over the page content--pushing actual page content down will be potentially confusing I think.

It would be good to have a better idea of what the final implementation will look like, roughly, before embarking on that though--I've been avoiding external libraries like the plague, but using the DOM APIs for HTML manipulation is definitely less ergonomic than say, JSX. (If you look at the creation of the search bar you'll see what I mean).

I think it's worth it to avoid external libraries, but because it takes more care to build up, I'd like to know that I'm heading in the right direction before I put some more work into it.

Advanced search

If Neuron were to generate a page with UI for construction advanced search queries, that could also double as a dedicated search results listing page for linking.

Right now, I'm planning on adding date search, and thinking about looking into scoping subqueries to certain fields (so something like title:link tag:root/Walkthrough date:2021-01-01)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants