Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Module: Apache Tika & RAW_DATA events #1434

Open
domwhewell-sage opened this issue Jun 3, 2024 · 11 comments
Open

New Module: Apache Tika & RAW_DATA events #1434

domwhewell-sage opened this issue Jun 3, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@domwhewell-sage
Copy link
Contributor

Description
As discussed in #717 and #907 it is probably a good time to start investigating apache tika and create RAW_DATA events

Apache Tika is a toolkit to extract metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). (All filetypes) the idea would be a module that consumes FILESYSTEM events sends the document type to apache tika and then produces a RAW_DATA event with a big blob of text that can be consumed by excavate to pull urls and alike.

I've only ever seen apache tika run as an API endpoint that a file can be uploaded to but will do some more research into how its used then start work on a module for it

@domwhewell-sage domwhewell-sage added the enhancement New feature or request label Jun 3, 2024
@domwhewell-sage
Copy link
Contributor Author

Something like...

async def setup(self):
        await self.run_process("systemctl", "start", "docker", sudo=True)
        await self.run_process("docker", "pull", "apache/tika:latest", sudo=True)
        self.tika_url = "http://127.0.0.1:8889"
        return True
def extract_text(file_path):
    """
    extract_text Extracts plaintext from a document path using Tika.

    :param file_path: The path of the file to extract text from.
    :return: ASCII-encoded plaintext extracted from the document.
    """

    with open(file_path, 'rb') as f:
        resp = requests.put(self.tika_url, f, headers={'Accept': 'text/plain'})
        if(resp.status_code == 200):
            return resp.text.strip().encode("ascii","ignore").decode()

@domwhewell-sage
Copy link
Contributor Author

domwhewell-sage commented Jun 3, 2024

Or instead of using docker there is a pip package https://pypi.org/project/tika/

def extract_text(file_path):
    """
    extract_text Extracts plaintext from a document path using Tika.

    :param file_path: The path of the file to extract text from.
    :return: ASCII-encoded plaintext extracted from the document.
    """
   parsed = parser.from_file(filepath)
   return parsed

Appears this is a wrapper around the REST server (java) so this method would probably involve managing a java install 🤢

To use this library, you need to have Java 7+ installed on your system as tika-python starts up the Tika REST server in the background.

@TheTechromancer
Copy link
Collaborator

you need to have Java 7+ installed on your system

ew

@TheTechromancer
Copy link
Collaborator

A possible alternative: https://github.com/Unstructured-IO/unstructured.

Probably would also need to run in a container, but still worth testing.

@domwhewell-sage
Copy link
Contributor Author

@domwhewell-sage
Copy link
Contributor Author

unstructured

Install is either via a container or a pip package and a few apt dependency's. The pip install failed for me a few times with the error error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 so i opted to use the container instead. I'm not sure how you would add a file to the container without a REST endpoint available. But these are the results I got with a pdf generated by python reportlab inside the container

>>> from unstructured.partition.auto import partition
>>> elements = partition(filename="simple_pdf.pdf")
>>> for element in elements:
...  print(element)
...
Hello, I am a PDF document created with Python!
>>> for element in elements:
...  print(element.metadata.to_dict())
...
{'coordinates': {'points': ((100.0, 82.37380000000007), (100.0, 94.37380000000007), (362.77600000000007, 94.37380000000007), (362.77600000000007, 82.37380000000007)), 'system': 'PixelSpace', 'layout_width': 595.2756, 'layout_height': 841.8898}, 'filename': 'simple_pdf.pdf', 'languages': ['eng'], 'last_modified': '2024-06-04T11:16:09', 'page_number': 1, 'filetype': 'application/pdf'}

Apache tika

Install is either via a container or a java file. Managing a java install could be painful so I opted for the container, the container exposes a REST endpoint where you can upload files and the response is returned in a JSON format. Below is the response I got back from a similar PDF generated using reportlab.

{
        "pdf:unmappedUnicodeCharsPerPage": "0",
        "pdf:PDFVersion": "1.3",
        "pdf:docinfo:title": "untitled",
        "xmp:CreatorTool": "ReportLab PDF Library - www.reportlab.com",
        "pdf:hasXFA": "false",
        "access_permission:modify_annotations": "true",
        "access_permission:can_print_degraded": "true",
        "X-TIKA:Parsed-By-Full-Set": ["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"],
        "dc:creator": "anonymous",
        "pdf:num3DAnnotations": "0",
        "dcterms:created": "2024-06-03T18:58:16Z",
        "dcterms:modified": "2024-06-03T18:58:16Z",
        "dc:format": "application/pdf; version=1.3",
        "pdf:docinfo:creator_tool": "ReportLab PDF Library - www.reportlab.com",
        "pdf:overallPercentageUnmappedUnicodeChars": "0.0",
        "access_permission:fill_in_form": "true",
        "pdf:docinfo:modified": "2024-06-03T18:58:16Z",
        "pdf:hasCollection": "false",
        "pdf:encrypted": "false",
        "dc:title": "untitled",
        "pdf:containsNonEmbeddedFont": "true",
        "Content-Length": "1426",
        "pdf:docinfo:subject": "unspecified",
        "pdf:hasMarkedContent": "false",
        "Content-Type": "application/pdf",
        "pdf:docinfo:creator": "anonymous",
        "pdf:producer": "ReportLab PDF Library - www.reportlab.com",
        "dc:subject": "unspecified",
        "pdf:totalUnmappedUnicodeChars": "0",
        "access_permission:extract_for_accessibility": "true",
        "access_permission:assemble_document": "true",
        "xmpTPg:NPages": "1",
        "pdf:hasXMP": "false",
        "pdf:charsPerPage": "13",
        "access_permission:extract_content": "true",
        "access_permission:can_print": "true",
        "pdf:docinfo:trapped": "False",
        "X-TIKA:Parsed-By": ["org.apache.tika.parser.DefaultParser", "org.apache.tika.parser.pdf.PDFParser"],
        "X-TIKA:content": '<html xmlns="http://www.w3.org/1999/xhtml">\n<head>\n<meta name="pdf:PDFVersion" content="1.3" />\n<meta name="pdf:docinfo:title" content="untitled" />\n<meta name="xmp:CreatorTool" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="pdf:hasXFA" content="false" />\n<meta name="access_permission:modify_annotations" content="true" />\n<meta name="access_permission:can_print_degraded" content="true" />\n<meta name="dc:creator" content="anonymous" />\n<meta name="dcterms:created" content="2024-06-03T18:58:16Z" />\n<meta name="dcterms:modified" content="2024-06-03T18:58:16Z" />\n<meta name="dc:format" content="application/pdf; version=1.3" />\n<meta name="pdf:docinfo:creator_tool" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="access_permission:fill_in_form" content="true" />\n<meta name="pdf:docinfo:modified" content="2024-06-03T18:58:16Z" />\n<meta name="pdf:hasCollection" content="false" />\n<meta name="pdf:encrypted" content="false" />\n<meta name="dc:title" content="untitled" />\n<meta name="Content-Length" content="1426" />\n<meta name="pdf:docinfo:subject" content="unspecified" />\n<meta name="pdf:hasMarkedContent" content="false" />\n<meta name="Content-Type" content="application/pdf" />\n<meta name="pdf:docinfo:creator" content="anonymous" />\n<meta name="pdf:producer" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="dc:subject" content="unspecified" />\n<meta name="access_permission:extract_for_accessibility" content="true" />\n<meta name="access_permission:assemble_document" content="true" />\n<meta name="xmpTPg:NPages" content="1" />\n<meta name="pdf:hasXMP" content="false" />\n<meta name="access_permission:extract_content" content="true" />\n<meta name="access_permission:can_print" content="true" />\n<meta name="pdf:docinfo:trapped" content="False" />\n<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" />\n<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />\n<meta name="access_permission:can_modify" content="true" />\n<meta name="pdf:docinfo:producer" content="ReportLab PDF Library - www.reportlab.com" />\n<meta name="pdf:docinfo:created" content="2024-06-03T18:58:16Z" />\n<title>untitled</title>\n</head>\n<body><div class="page"><p />\n<p>Hello, World!</p>\n<p />\n</div>\n</body></html>',
        "access_permission:can_modify": "true",
        "pdf:docinfo:producer": "ReportLab PDF Library - www.reportlab.com",
        "pdf:docinfo:created": "2024-06-03T18:58:16Z",
        "pdf:containsDamagedFont": "false",
    }

Apache Tika seemed to obtain more metadata such as the links in the pdf's author info it also seems to support more file types than unstructured. (Although Im thinking of implementing a extensions list of files that will be parsed https://github.com/blacklanternsecurity/bbot/blob/stable/bbot/modules/filedownload.py#L24 just to avoid potential incorrect strings being extracted from sourcecode files).

Installing either library via a container seems to be the way forward to avoid installation issues. Im leaning towards apache tika as the REST endpoint is very easy to use and it seems ready to go out the box. Its a shame it requires a separate docker container but that seems unavoidable. If bbot itself is running in a docker container would it be able to spawn this apache tika docker container..?

@TheTechromancer
Copy link
Collaborator

TheTechromancer commented Jun 4, 2024

Okay I see the appeal of Apache Tika. But you make a good point about the docker container. I hadn't thought about the scenario of BBOT itself being in a container, which would make spawning another container unfeasible.

I guess dastardly is already affected by this, although I'm less concerned about that one since text extraction should be a core feature of BBOT. It's really important we get this right.

In this case, we need to support all possible architectures and installation methods, so I'm afraid docker is out of the picture. I think you might also agree that adding a java dependency to BBOT is not ideal.

What I'm wondering is if we can find a middle ground, maybe a golang or rust binary, that we can call similar to what we're currently doing with httpx and gowitness.

@domwhewell-sage
Copy link
Contributor Author

I have managed to get the unstructured python package working now in a fresh environment, it was probably some conflicting packages which didn't work for me....

but from the results it didnt find as much metadata as tika, from the unstructured documentation

Unstructured metadata tracks general document information, like filename and file type, and more detailed document-specific information, such as element type.

They both obtained the contents of the file which is probably most valuable for us. I don't have any objections to using unstructured instead as long as we are ok with potentially missing out on some document metadata

@TheTechromancer
Copy link
Collaborator

TheTechromancer commented Jun 4, 2024

Okay, I think when it comes to metadata vs text extraction, it might be best to treat these as two separate tasks.

I'm not opposed to having an Apache Tika module. This would be pretty convenient and provide high-tier metadata and text extraction, at the cost of complexity. If we do that, I think it would make sense to have the docker setup, but allow the user to set their own URL if needed.

Eventually of course I would like BBOT to have a high-quality text extraction module, which doesn't require docker or Java. Since this is a CPU-intensive task, it would make sense to offload it into its own script. Whether that be a rust/golang/c++ binary, or a python script written by us (we could easily cover 95% of cases just by handling PDF + MS Office), this is the approach we should be using for most modules going forward.

Since BBOT has so many modules and CPU is so scarce in the main process, to get the max performance, it makes sense to use a simple binary or python script with parseable (i.e. JSON) output.

So yeah to summarize the ultimate goal is to have native functionality for metadata and text extraction, probably in separate modules. But since Tika is easier to implement, I'm open to using it in the meantime.

@domwhewell-sage which way are you leaning?

@domwhewell-sage
Copy link
Contributor Author

If its ok having a metadata extraction and text extraction as separate modules I think unstructured might be the best module to add for text extraction. I have created the module in my fork and will be testing it some more.

https://github.com/domwhewell-sage/bbot/blob/unstructured/bbot/modules/unstructured.py

@TheTechromancer
Copy link
Collaborator

Sounds good, let's not forget to set SCARF_NO_ANALYTICS=true before we publish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants