Spike elasticlunr in docs site #575

flyinggrizzly · 2021-02-28T00:02:20Z

This PR is a spike at adding client-side search using
ElasticLunr.

This PR message, and the code, are still WIP. I'll keep this up to date as I go.

At the moment it is still an experiment, manually adding in scripts in the
doc/ site.

Because cache.json does not have the note bodies/content, we can't build in
full-text search yet.

I'm not sure if that's a problem at this point, since it may make sense to move
the search index generation out of the browser and into Neuron itself, so that
an index can be requested by the browser instead of being constructed on the
fly.

ElasticLunr appears to have some support for jsonifying its
indices, though I
haven't looked into the process of hydrating these yet. It looks pretty
straightforward though.

To do:

display results in UI instead of logging to console
test jsonifying/rehydrating an index to see how it would behave if we were
to build the index server-side
investigate "pipelines" (stop-word removal, stemming)
investigate plugins (is this the same as above?)
~~determine good/optimum search configs (and suppress default config warning)~~
spike "advanced search" functionality, possibly with #fieldSearch
- may want its own search.html page, so that might want to be part of
  displaying results before moving too far ahead
Handle pretty-url configuration in search
Add date (and other field?) search
Enable advanced search options (tags, dates, etc) based on relevant plugin settings in Neuron

Once those checks are done, I think things'll be in a position to make a better
decision about whether this makes sense for Neuron.

References #568

For #567

Next immediate steps

use saved cache from https://lost-frequencies.eu/impulse.html to make sure we're not hitting performance issues
remove 3-char limit in search function
blindly copy elasticlunr settings from mdBook and test
- investigate each one and remove any unnecessary options
- if this doesn't fix the insta search issue, keep investigating
replace Rails-style syntax with Obisidian-style syntax in URL parsing
move search box out to footer/header and remove seach.md

srid · 2021-03-01T00:58:58Z

Because cache.json does not have the note bodies/content, we can't build in
full-text search yet.

At least we can prototype a working version of metadata search (i.e., search by title & tags), right?

flyinggrizzly · 2021-03-01T09:50:32Z

At least we can prototype a working version of metadata search (i.e., search by title & tags), right?

That's what I was thinking, yea.

flyinggrizzly · 2021-03-06T15:51:24Z

@srid I've just pushed up some commits with a basic UI (screenshot added to PR description). This may be all I have time for this weekend, but my plan is to hit the plugins/pipeline next.

Use JSON format for tags in URL search params

flyinggrizzly · 2021-03-06T19:16:14Z

Just pushed up a few more commits that add handling of URL params for search. This is a nice-to-have, but also saves me the trouble of adding in a UI for advanced search before I fully understand it.

Currently, I'm using this to attempt searching only the tags field, with /search.html?search[tags]=["root"].

I've got the simple general search working, with /search.html?search=Install, which returns the same results as if you navigated in and entered "Install" into the input, which is 👍.

Field-scoped search is going to take some more research on my part though.

srid · 2021-03-06T19:36:50Z

Cheers, I'll take a peek.

We should test the search perf on large Zettelkastens like https://lost-frequencies.eu/impulse.html

@zettelzottel Is the source of your public zettelkasten available somewhere?

ghost · 2021-03-06T20:25:03Z

@srid no the source is not public. it is hosted on private gitea instance. you can sign up on the instance and i give you access to the repo. https://git.lost-frequencies.eu

srid · 2021-03-06T21:00:01Z

Maybe use https://lost-frequencies.eu/cache.json in ./doc's zk just for testing ...

srid · 2021-03-07T01:51:12Z

@flyinggrizzly ~~Search fails for me~~ with this warning in the console

![image](https://user-images.githubusercontent.com/3998/110226469-a2ca0780-7ebd-11eb-98f2-ccb36727fbdc.png)

Anything in particular I should do when running neuron on ./doc? I access the search HTML using: http://localhost:8080/search.html

EDIT: It works (as in it displays the results) when I go to a URL with query like http://localhost:8080/search.html?search=Install (the console warning above still appears), but not when I type a query and press Enter (nothing happens then).

EDIT 2: Oh n/m, looks like, in some cases, the search query doesn't do substring match. It requires exact words.

flyinggrizzly · 2021-03-07T21:00:44Z

Yea, that search config warning is still something to look at--I'll add it to the to-do list. Related, I've added in a very basic and probably problematic config for the fieldSearch call that also needs to be actually looked at--so far I've just added enough to get a successful call of the method.

Weird about the substring though--I was having decent luck with "instal" (1 "l"). What was your query? It might be useful when working out the search configs.

Also, hitting return shouldn't be required for getting the search to fire--if you type in "install" do you get results, or have I used some browser APIs that work in Firefox but not your browser? (I've been trying not to worry about that so much while spiking, and leaving until later making sure that the JS syntax and implementation is general enough to support more browsers... but if it's not working for you I'll move it up the list)

srid · 2021-03-10T02:53:03Z

"instal" works, but "insta" (or "inst") returns nothing. I'd expect search to work even if you type only 3 characters of a word. I suppose this is a matter of configuring elasticlunr? And yea search works without having to hit enter.

I think it would be good to see tag-based search; we can then create permalinks to search page that lists all zettels tagged with a tag (and link to it from tag links). Something like /search.html?q=tag:foo

Also, what kind of index file can neuron generate that would help the frontend JS maximally (in regards to perf)? Assuming metadata only. I can actually implement this in master if we have a clear idea of the data format ...

flyinggrizzly · 2021-03-14T18:34:03Z

Just checking in--this weekend ended up being overwhelming so I didn't get the time I planned to work on this. But, before it goes too long, here's my thoughts/plans:

Search by tag

Tag search is already configured in the URL params handling, using search[tags]=[tag1,tag2], which should allow linking by tag, and also linking to custom searches of multiple tags. I can't remember if elasticlunr defaults to AND or OR conjunctions, but it can be set either way.

I went with a Rails-esqu format for the params, but can very easily switch to ?tag:[foo]/?tag=[foo]/?tag:foo etc.

At the moment, only 1 tag is ever searched at a time, but I thought that building in the option of multiple would help in the future, and doesn't really cost anything now.

Do you have a query-string syntax preference? At the moment, cases we need to handle are:

general (keyword) search. Currently ?search=search%20terms
tag-search. Currently ?search[tags]=[tag1]

Also, based on the current config, we could also have title search, slug search, or date search (which is a bigger can of worms, but I suspect elasticlunr might have some decent date logic if it's based on Lucene syntax. But not sure). Ideally these would have similar syntax to tags, at least in the naming of the query parameter (so search[title] or search[date]). I wasn't planning on working on these for now, since tags and general search seem the priorities, but worth considering when we look at how we want the query string formed.

NOTE: while tag search is set up in the URL params handling, it's not working--I still need to get my head around fieldSearch

Minimum search string length

That does look like an elasticlunr thing--mdBook search fires on any number of keystrokes.

I had also added a minimum search string length of 3 which I'll remove too, just in case that's contributing somehow.

Server generated support

For now, I still think it'd be best to keep everything in the browser--the cache is supplying everything we need still.

At some point the search index may need to have behavior configured based on plugins that are active, but those are already listed in the cache so that's still OK.

srid · 2021-03-15T00:26:32Z

For search query format, see Support for "or" and "not" in tag queries #358 (comment) .. so perhaps using a generic ?q="foo tag:someTag" (which allows complex queries like like ?q="foo title:Foo (tag:tag1 OR tag:tag2)" would be good for long-term. To begin with, we don't necessarily have to support from the get-go URL queries other than a singular tag filtering (for use in tag links), i.e., ?q=tag:someTag.
Re: server generated cache ... okay if cache.json is good enough, let's use that. We can optimize index performance (esp. for larger notebooks) latter anyway, once the general search implementation is in place.

I see that you have a search.md zettel that contains the HTML. Ultimately, neuron could generate search.html with this HTML/JS in it, but a better approach is to put the search box in the footer action bar in all zettel pages, and do the search right there, and display the results inline ... just like mdbook does currently. Then we don't need a dedicated search page. I imagine I (or somebody) would do this in PureScript (or TypeScript or whatever), using your search.html and search.js as reference implementation (which this PR would provide as a proof of concept).

srid · 2021-03-15T19:40:44Z

I imagine I (or somebody) would do this in PureScript (or TypeScript or whatever)

Actually you know what. Let's just do this in plain JavaScript, to begin with. I could modify neuron to use the current search.js in all generated pages.

flyinggrizzly · 2021-03-16T11:20:07Z

Actually you know what. Let's just do this in plain JavaScript, to begin with. I could modify neuron to use the current search.js in all generated pages.

That makes sense to me. Do you have a minimum browser target in mind? I've been working under the assumption that some compilation might be used so I've been using plenty of ES6--it wouldn't be a problem to manually roll it down to ES5 compatible syntax etc, but I'll just need to check out what works.

Also, that query syntax works--I'll update it to use that.

For search.md, this was just a fast way of getting up and running--once search behavior is working then I was thinking I'd move it to being present on every page and dropping down over page content. We'll probably want a dedicated (static) search page for linking out to tag queries etc, but that's also pretty easy.

And for the cache stuff, my guess is that it'll be faster/more performant to have a server generated search index as mdBook does, but I don't think the search function is stable enough (or even known to be actually work for Neuron 😅) to warrant putting more of your time into it yet.

srid · 2021-03-16T17:14:41Z

More comprehensive reference for search query syntax: https://help.obsidian.md/Plugins/Search

Would be nice if neuron supported the same syntax as that of Obsidian. Assuming the current z:zettels?.. URI syntax is deprecated in favour of Obsidian's syntax (see #407 (comment)), then tag:foo would work for tag links.

This new syntax allows for more complex queries - but the frontend JS search (this PR) doesn't have to support all of them of course, even if neuron's new z-queries do. It is just something to keep in mind as a distant possibility.

This appears to be what is required to get the term "ins" to begin returning results for "install" etc. ("in" is a stop word, so does not trigger the search). Searching for just "ins" returns more than just "install" results, because it could expand to other words. It appears that it also expands to "integration", which makes enough sense to me since that could be a typo. "integration" drops out of the results once the query term is expanded to "inst". This also highlights a potentially better and simpler way to handle tag-search--using the field boost/suppression in the search options to limit tag search to only the tag field.

flyinggrizzly · 2021-04-05T12:48:55Z

"ins" as search term

This wasn't working because of a search settings issue--the searchOptions parameter for index searching takes an optional expand attribute, which when set to true causes the search terms to be expanded. The example is that "micro" expands to "microscope" and "microwave", which is the behavior that we want.

Pipeline

On closer inspection, mdBook is using the default pipeline settings, and so are we--I'm making no changes until we determine we need to.

Tag-search

Fixing the term-expansion also made me realize that perhaps #fieldSearch is not the tool we want for tag-search--we might be better off with a general search and using the field boost attributes to suppress everything except the tags:

const searchOptions = {
  bool: "AND",
  expand: true,
  fields: {
    title: { boost: 0 }, // set to 1 on general search
    tags: { boost: 1 }, // also set to 1 on general search, but now the only field with a boost > 0
  }
}

I've still got some time in me today for this so I'll keep going (and editing this comment), but wanted to document progress today so far.

performance on large zettels

I've saved the Lost Frequencies cache in static as large-cache.json, and it's noticeably slower, but not in a way that I would find problematic as a user. The searcher is configured to use that cachefile at the moment (so links are broken), but it would be good to see how others feel about this performance.

search on every page

So I've got the search.js script inserting an additional content block at the top of every page now, that shows just a searchbox to start. On input, this content block expands to show all results.

For now, it's just pushing actual page content down (instead of expanding over it), though this could be adjusted.

For the moment, I've left in the search.md note, but made it entirely empty. This is acting in place of what Neuron would ideally provide--a search page that could handle things like tag linking (so that <z:zettels?tag=foo> (or the replacement syntax) would generate a link like a href="/search.html?q=tag:foo">#foo</a>). This could also target the generated site root if we don't want to effectively prevent people from using search as a zettel ID.

Right now, in order to avoid eliminating search as a usable zettel ID for users, I think link queries should target the generated site's root: a query<z:zettels?tag=foo> (or the replacement syntax) would generate a link like <a href="/?q=tag:foo">#foo</a>. This is good for two reasons:

we preserve search as a usable ID
we don't need to worry about handling pretty URLs in search

If we do target the site root, it would probably be best to move the searchbox and results into a modal over the page content--pushing actual page content down will be potentially confusing I think.

It would be good to have a better idea of what the final implementation will look like, roughly, before embarking on that though--I've been avoiding external libraries like the plague, but using the DOM APIs for HTML manipulation is definitely less ergonomic than say, JSX. (If you look at the creation of the search bar you'll see what I mean).

I think it's worth it to avoid external libraries, but because it takes more care to build up, I'd like to know that I'm heading in the right direction before I put some more work into it.

Advanced search

If Neuron were to generate a page with UI for construction advanced search queries, that could also double as a dedicated search results listing page for linking.

Right now, I'm planning on adding date search, and thinking about looking into scoping subqueries to certain fields (so something like title:link tag:root/Walkthrough date:2021-01-01)

- using tag:foo syntax

flyinggrizzly added 2 commits February 27, 2021 23:50

Set up elasticlunr in docs site

7643441

Initial spike at search script

b8fbd15

flyinggrizzly changed the title ~~Set up elasticlunr in docs site~~ Spike elasticlunr in docs site Feb 28, 2021

flyinggrizzly added 4 commits March 6, 2021 14:39

Simplify extraction of cache values

3490da1

Render search results to page

7aba6f1

Add Simple search page

0f82eb8

Improve styling on search results

bb36250

flyinggrizzly added 6 commits March 6, 2021 19:13

Add links to tags in search results

0dc0097

Use JSON format for tags in URL search params

Uncouple handleSearch interface from events

ab7476e

Add funtions for handling url param search values

15c07f1

Handle URL search params

a1e6099

Drop console.log call

39be6ca

Attempt fieldSearch

cb11352

flyinggrizzly added 2 commits April 5, 2021 13:27

Remove 3 char min in search

e8682b8

flyinggrizzly added 3 commits April 5, 2021 14:10

Set up tag-search

9dc74c8

- using tag:foo syntax

DROP Temporarily test search performance on large zettelkasten

4bec148

Move search to a topnav element

7605003

flyinggrizzly force-pushed the elastic-lunr-search-spike branch from c22d577 to 7605003 Compare April 5, 2021 13:54

Drop search.md

d38990f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike elasticlunr in docs site #575

Spike elasticlunr in docs site #575

flyinggrizzly commented Feb 28, 2021 •

edited

Loading

srid commented Mar 1, 2021

flyinggrizzly commented Mar 1, 2021

flyinggrizzly commented Mar 6, 2021

flyinggrizzly commented Mar 6, 2021

srid commented Mar 6, 2021

ghost commented Mar 6, 2021 •

edited by ghost

Loading

srid commented Mar 6, 2021

srid commented Mar 7, 2021 •

edited

Loading

flyinggrizzly commented Mar 7, 2021 •

edited

Loading

srid commented Mar 10, 2021 •

edited

Loading

flyinggrizzly commented Mar 14, 2021

srid commented Mar 15, 2021 •

edited

Loading

srid commented Mar 15, 2021

flyinggrizzly commented Mar 16, 2021 •

edited

Loading

srid commented Mar 16, 2021 •

edited

Loading

flyinggrizzly commented Apr 5, 2021 •

edited

Loading

Spike elasticlunr in docs site #575

Are you sure you want to change the base?

Spike elasticlunr in docs site #575

Conversation

flyinggrizzly commented Feb 28, 2021 • edited Loading

To do:

Next immediate steps

srid commented Mar 1, 2021

flyinggrizzly commented Mar 1, 2021

flyinggrizzly commented Mar 6, 2021

flyinggrizzly commented Mar 6, 2021

srid commented Mar 6, 2021

ghost commented Mar 6, 2021 • edited by ghost Loading

srid commented Mar 6, 2021

srid commented Mar 7, 2021 • edited Loading

flyinggrizzly commented Mar 7, 2021 • edited Loading

srid commented Mar 10, 2021 • edited Loading

flyinggrizzly commented Mar 14, 2021

Search by tag

Minimum search string length

Server generated support

srid commented Mar 15, 2021 • edited Loading

srid commented Mar 15, 2021

flyinggrizzly commented Mar 16, 2021 • edited Loading

srid commented Mar 16, 2021 • edited Loading

flyinggrizzly commented Apr 5, 2021 • edited Loading

"ins" as search term

Pipeline

Tag-search

performance on large zettels

search on every page

Advanced search

flyinggrizzly commented Feb 28, 2021 •

edited

Loading

ghost commented Mar 6, 2021 •

edited by ghost

Loading

srid commented Mar 7, 2021 •

edited

Loading

flyinggrizzly commented Mar 7, 2021 •

edited

Loading

srid commented Mar 10, 2021 •

edited

Loading

srid commented Mar 15, 2021 •

edited

Loading

flyinggrizzly commented Mar 16, 2021 •

edited

Loading

srid commented Mar 16, 2021 •

edited

Loading

flyinggrizzly commented Apr 5, 2021 •

edited

Loading