New Module: Apache Tika & `RAW_DATA` events #1434

domwhewell-sage · 2024-06-03T11:34:41Z

Description
As discussed in #717 and #907 it is probably a good time to start investigating apache tika and create RAW_DATA events

Apache Tika is a toolkit to extract metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). (All filetypes) the idea would be a module that consumes FILESYSTEM events sends the document type to apache tika and then produces a RAW_DATA event with a big blob of text that can be consumed by excavate to pull urls and alike.

I've only ever seen apache tika run as an API endpoint that a file can be uploaded to but will do some more research into how its used then start work on a module for it

The text was updated successfully, but these errors were encountered:

domwhewell-sage · 2024-06-03T11:58:17Z

Something like...

async def setup(self):
        await self.run_process("systemctl", "start", "docker", sudo=True)
        await self.run_process("docker", "pull", "apache/tika:latest", sudo=True)
        self.tika_url = "http://127.0.0.1:8889"
        return True

def extract_text(file_path):
    """
    extract_text Extracts plaintext from a document path using Tika.

    :param file_path: The path of the file to extract text from.
    :return: ASCII-encoded plaintext extracted from the document.
    """

    with open(file_path, 'rb') as f:
        resp = requests.put(self.tika_url, f, headers={'Accept': 'text/plain'})
        if(resp.status_code == 200):
            return resp.text.strip().encode("ascii","ignore").decode()

domwhewell-sage · 2024-06-03T13:46:43Z

Or instead of using docker there is a pip package https://pypi.org/project/tika/

def extract_text(file_path):
    """
    extract_text Extracts plaintext from a document path using Tika.

    :param file_path: The path of the file to extract text from.
    :return: ASCII-encoded plaintext extracted from the document.
    """
   parsed = parser.from_file(filepath)
   return parsed

Appears this is a wrapper around the REST server (java) so this method would probably involve managing a java install 🤢

To use this library, you need to have Java 7+ installed on your system as tika-python starts up the Tika REST server in the background.

TheTechromancer · 2024-06-03T14:31:16Z

you need to have Java 7+ installed on your system

TheTechromancer · 2024-06-03T14:41:17Z

A possible alternative: https://github.com/Unstructured-IO/unstructured.

Probably would also need to run in a container, but still worth testing.

domwhewell-sage · 2024-06-03T18:22:20Z

Whilst I investigate unstructured

https://github.com/domwhewell-sage/bbot/blob/fileparser/bbot/modules/fileparser.py

domwhewell-sage · 2024-06-04T11:45:59Z

unstructured

Install is either via a container or a pip package and a few apt dependency's. The pip install failed for me a few times with the error error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 so i opted to use the container instead. I'm not sure how you would add a file to the container without a REST endpoint available. But these are the results I got with a pdf generated by python reportlab inside the container

>>> from unstructured.partition.auto import partition
>>> elements = partition(filename="simple_pdf.pdf")
>>> for element in elements:
...  print(element)
...
Hello, I am a PDF document created with Python!
>>> for element in elements:
...  print(element.metadata.to_dict())
...
{'coordinates': {'points': ((100.0, 82.37380000000007), (100.0, 94.37380000000007), (362.77600000000007, 94.37380000000007), (362.77600000000007, 82.37380000000007)), 'system': 'PixelSpace', 'layout_width': 595.2756, 'layout_height': 841.8898}, 'filename': 'simple_pdf.pdf', 'languages': ['eng'], 'last_modified': '2024-06-04T11:16:09', 'page_number': 1, 'filetype': 'application/pdf'}

Apache tika

Install is either via a container or a java file. Managing a java install could be painful so I opted for the container, the container exposes a REST endpoint where you can upload files and the response is returned in a JSON format. Below is the response I got back from a similar PDF generated using reportlab.

{
        "pdf:unmappedUnicodeCharsPerPage": "0",
        "pdf:PDFVersion": "1.3",
        "pdf:docinfo:title": "untitled",
        "xmp:CreatorTool": "ReportLab PDF Library - www.reportlab.com",
        "pdf:hasXFA": "false",
        "access_permission:modify_annotations": "true",
        "access_permission:can_print_degraded": "true",
        "X-TIKA:Parsed-By-Full-Set": ["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"],
        "dc:creator": "anonymous",
        "pdf:num3DAnnotations": "0",
        "dcterms:created": "2024-06-03T18:58:16Z",
        "dcterms:modified": "2024-06-03T18:58:16Z",
        "dc:format": "application/pdf; version=1.3",
        "pdf:docinfo:creator_tool": "ReportLab PDF Library - www.reportlab.com",
        "pdf:overallPercentageUnmappedUnicodeChars": "0.0",
        "access_permission:fill_in_form": "true",
        "pdf:docinfo:modified": "2024-06-03T18:58:16Z",
        "pdf:hasCollection": "false",
        "pdf:encrypted": "false",
        "dc:title": "untitled",
        "pdf:containsNonEmbeddedFont": "true",
        "Content-Length": "1426",
        "pdf:docinfo:subject": "unspecified",
        "pdf:hasMarkedContent": "false",
        "Content-Type": "application/pdf",
        "pdf:docinfo:creator": "anonymous",
        "pdf:producer": "ReportLab PDF Library - www.reportlab.com",
        "dc:subject": "unspecified",
        "pdf:totalUnmappedUnicodeChars": "0",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:assemble_document": "true",
        "xmpTPg:NPages": "1",
        "pdf:hasXMP": "false",
        "pdf:charsPerPage": "13",
        "access_permission:extract_content": "true",
        "access_permission:can_print": "true",
        "pdf:docinfo:trapped": "False",
        "X-TIKA:Parsed-By": ["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"],
        "X-TIKA:content": '<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n<meta name="pdf:PDFVersion" content="1.3" />\n<meta name="pdf:docinfo:title" content="untitled" />\n<meta name="xmp:CreatorTool" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="pdf:hasXFA" content="false" />\n<meta name="access_permission:modify_annotations" content="true" />\n<meta name="access_permission:can_print_degraded" content="true" />\n<meta name="dc:creator" content="anonymous" />\n<meta name="dcterms:created" content="2024-06-03T18:58:16Z" />\n<meta name="dcterms:modified" content="2024-06-03T18:58:16Z" />\n<meta name="dc:format" content="application/pdf; version=1.3" />\n<meta name="pdf:docinfo:creator_tool" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="access_permission:fill_in_form" content="true" />\n<meta name="pdf:docinfo:modified" content="2024-06-03T18:58:16Z" />\n<meta name="pdf:hasCollection" content="false" />\n<meta name="pdf:encrypted" content="false" />\n<meta name="dc:title" content="untitled" />\n<meta name="Content-Length" content="1426" />\n<meta name="pdf:docinfo:subject" content="unspecified" />\n<meta name="pdf:hasMarkedContent" content="false" />\n<meta name="Content-Type" content="application/pdf" />\n<meta name="pdf:docinfo:creator" content="anonymous" />\n<meta name="pdf:producer" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="dc:subject" content="unspecified" />\n<meta name="access_permission:extract_for_accessibility" content="true" />\n<meta name="access_permission:assemble_document" content="true" />\n<meta name="xmpTPg:NPages" content="1" />\n<meta name="pdf:hasXMP" content="false" />\n<meta name="access_permission:extract_content" content="true" />\n<meta name="access_permission:can_print" content="true" />\n<meta name="pdf:docinfo:trapped" content="False" />\n<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />\n<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />\n<meta name="access_permission:can_modify" content="true" />\n<meta name="pdf:docinfo:producer" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="pdf:docinfo:created" content="2024-06-03T18:58:16Z" />\n<title>untitled</title>\n</head>\n<body><div class="page"><p />\n<p>Hello, World!</p>\n<p />\n</div>\n</body></html>',
        "access_permission:can_modify": "true",
        "pdf:docinfo:producer": "ReportLab PDF Library - www.reportlab.com",
        "pdf:docinfo:created": "2024-06-03T18:58:16Z",
        "pdf:containsDamagedFont": "false",
    }

Apache Tika seemed to obtain more metadata such as the links in the pdf's author info it also seems to support more file types than unstructured. (Although Im thinking of implementing a extensions list of files that will be parsed https://github.com/blacklanternsecurity/bbot/blob/stable/bbot/modules/filedownload.py#L24 just to avoid potential incorrect strings being extracted from sourcecode files).

Installing either library via a container seems to be the way forward to avoid installation issues. Im leaning towards apache tika as the REST endpoint is very easy to use and it seems ready to go out the box. Its a shame it requires a separate docker container but that seems unavoidable. If bbot itself is running in a docker container would it be able to spawn this apache tika docker container..?

TheTechromancer · 2024-06-04T13:01:19Z

Okay I see the appeal of Apache Tika. But you make a good point about the docker container. I hadn't thought about the scenario of BBOT itself being in a container, which would make spawning another container unfeasible.

I guess dastardly is already affected by this, although I'm less concerned about that one since text extraction should be a core feature of BBOT. It's really important we get this right.

In this case, we need to support all possible architectures and installation methods, so I'm afraid docker is out of the picture. I think you might also agree that adding a java dependency to BBOT is not ideal.

What I'm wondering is if we can find a middle ground, maybe a golang or rust binary, that we can call similar to what we're currently doing with httpx and gowitness.

domwhewell-sage · 2024-06-04T15:38:46Z

I have managed to get the unstructured python package working now in a fresh environment, it was probably some conflicting packages which didn't work for me....

but from the results it didnt find as much metadata as tika, from the unstructured documentation

Unstructured metadata tracks general document information, like filename and file type, and more detailed document-specific information, such as element type.

They both obtained the contents of the file which is probably most valuable for us. I don't have any objections to using unstructured instead as long as we are ok with potentially missing out on some document metadata

TheTechromancer · 2024-06-04T16:09:14Z

Okay, I think when it comes to metadata vs text extraction, it might be best to treat these as two separate tasks.

I'm not opposed to having an Apache Tika module. This would be pretty convenient and provide high-tier metadata and text extraction, at the cost of complexity. If we do that, I think it would make sense to have the docker setup, but allow the user to set their own URL if needed.

Eventually of course I would like BBOT to have a high-quality text extraction module, which doesn't require docker or Java. Since this is a CPU-intensive task, it would make sense to offload it into its own script. Whether that be a rust/golang/c++ binary, or a python script written by us (we could easily cover 95% of cases just by handling PDF + MS Office), this is the approach we should be using for most modules going forward.

Since BBOT has so many modules and CPU is so scarce in the main process, to get the max performance, it makes sense to use a simple binary or python script with parseable (i.e. JSON) output.

So yeah to summarize the ultimate goal is to have native functionality for metadata and text extraction, probably in separate modules. But since Tika is easier to implement, I'm open to using it in the meantime.

@domwhewell-sage which way are you leaning?

domwhewell-sage · 2024-06-05T08:30:04Z

If its ok having a metadata extraction and text extraction as separate modules I think unstructured might be the best module to add for text extraction. I have created the module in my fork and will be testing it some more.

https://github.com/domwhewell-sage/bbot/blob/unstructured/bbot/modules/unstructured.py

TheTechromancer · 2024-06-05T11:04:24Z

Sounds good, let's not forget to set SCARF_NO_ANALYTICS=true before we publish.

domwhewell-sage added the enhancement New feature or request label Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Module: Apache Tika & `RAW_DATA` events #1434

New Module: Apache Tika & `RAW_DATA` events #1434

domwhewell-sage commented Jun 3, 2024

domwhewell-sage commented Jun 3, 2024

domwhewell-sage commented Jun 3, 2024 •

edited

Loading

TheTechromancer commented Jun 3, 2024

TheTechromancer commented Jun 3, 2024

domwhewell-sage commented Jun 3, 2024

domwhewell-sage commented Jun 4, 2024

TheTechromancer commented Jun 4, 2024 •

edited

Loading

domwhewell-sage commented Jun 4, 2024

TheTechromancer commented Jun 4, 2024 •

edited

Loading

domwhewell-sage commented Jun 5, 2024

TheTechromancer commented Jun 5, 2024

New Module: Apache Tika & RAW_DATA events #1434

New Module: Apache Tika & RAW_DATA events #1434

Comments

domwhewell-sage commented Jun 3, 2024

domwhewell-sage commented Jun 3, 2024

domwhewell-sage commented Jun 3, 2024 • edited Loading

TheTechromancer commented Jun 3, 2024

TheTechromancer commented Jun 3, 2024

domwhewell-sage commented Jun 3, 2024

domwhewell-sage commented Jun 4, 2024

unstructured

Apache tika

TheTechromancer commented Jun 4, 2024 • edited Loading

domwhewell-sage commented Jun 4, 2024

TheTechromancer commented Jun 4, 2024 • edited Loading

domwhewell-sage commented Jun 5, 2024

TheTechromancer commented Jun 5, 2024

New Module: Apache Tika & `RAW_DATA` events #1434

New Module: Apache Tika & `RAW_DATA` events #1434

domwhewell-sage commented Jun 3, 2024 •

edited

Loading

TheTechromancer commented Jun 4, 2024 •

edited

Loading

TheTechromancer commented Jun 4, 2024 •

edited

Loading