Regex search in trees!
h
tangert (49)

hi everyone - this is a work in progress, but here's a library im making for easily searching for regex patterns in tree data structures. I'm building this to understand how you can search for data in the DOM (the tree data structure for web pages). right now it's just based on a "mock" DOM where the tree looks something like this:

tree = [
    { text: "hi replit!",
      children: [
      { text: "im posting on talk" },
      { text: "and learning about trees" },
    ]
]

basically, it's a list of objects that either contain text or a list of children. the main challenge of building this was figuring out how to search for text/regex patterns that go across several nodes in the tree. for example, if you you had:

text that looked like this

my program would still be able to go through all of the nested bold / italic elements and return back the search matches.

If you go and inspect the DOM, the actual HTML that represents the wacky text above looks like this:

<strong>te</strong>xt that <em>lo<strong>o</strong></em> ked like <em>th</em>is

When you break it down into it's tree, it would look like:

<strong>te</strong>
"xt that"
<em>lo
    <strong>o</strong>
 </em>
"ked like"
<em>th</em>
"is"

And represented as JSON it would look something like:

const tree = [
    { text: "te", tag: "strong" },
    { text: "xt that" },
    { text: "lo", 
      tag: "em" 
      children: [
        { text: "o", tag: "strong" } 
      ] }, 
     { text: "ked like" },
     { text: "th", tag: "em" },
     { text: "is" }
]

the basic algorithm is actually just three steps:

1) walk through the tree and make all of its text into one string.
2) go through that string and record all of the character locations (start / end) of regex matches.
3) go back through the tree and keep a count of which character you're currently on. if you get to a location with a recorded match, congrats! you found a node in the tree with you're text.

  • if the length of the regex match is longer than the length of the text in the node, that means that your match crosses node boundaries.
  • record the starting point of the next portion of your match, which is going to be the length of the last portion you found. so you step through your match portion by portion across nodes.

i couldn't have done this without the guidance of this tutorial which goes over the technique! it was originally used for finding AND replacing elements in the DOM, but i just used it as a basis for finding.

the next step is going to be parsing the DOM from an actual HTML file, extracting the content, turning it into a JSON-like data structure, and adjusting the algorithm to fit that.

after that, I'm planning on adding support to search for NLP data (people, places, things, parts of speech, etc) in the same way, and will create a visualization of it.

next step after that is making a chrome extension that lets you search for NLP data on any web page you're on 🎉

the code isn't as commented / clean as i'd like but it's a start! lmk if you have questions

You are viewing a single comment. View All
tangert (49)

Here's a forked version doing search on the DOM of a hacker news html file: https://repl.it/@tangert/tree-text-search-himalaya