tikatree

I made tikatree after trying to look through tens of thousands of files for anything interesting and getting tired of manually looking through the various directories. tikatree collects information about the files in a directory and outputs that information in a couple of different formats. It gets metadata for each file, creates file lists and directory trees, and calculates checksums. The idea is to make it easy to see what’s in a directory along with being able to use the information to look for duplicates or compare various directories.

tikatree is a command-line tool written in Python that uses Apache Tika to parse metadata and largely avoids external libraries for performance and simplicity’s sake.

Example metadata:

"tikatree.py": {
    "Content-Encoding": "UTF-8",
    "Content-Type": "application/x-sh; charset=UTF-8",
    "X-Parsed-By": [
        "org.apache.tika.parser.DefaultParser",
        "org.apache.tika.parser.csv.TextAndCSVParser"
    ],
    "X-TIKA:content_handler": "ToTextContentHandler",
    "X-TIKA:embedded_depth": "0",
    "X-TIKA:parse_time_millis": "2",
    "resourceName": "b'tikatree.py'"
}

Example from file tree:

"tikatree.py": {
    "modified": "2020-07-26 05:42:12.124221",
    "size": "13.27KB",
    "md5": "cbb4f179676cbaba1a0a4ea9affee724",
    "sha256": "13a96980342cd7deebba290c00e4c1f1bb7058b3adafb9a7a537a5abc176b21e"
}

Apache Tika

Tika is a cool project, here’s a snippet from their website “The Apache Tika™ toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). All of these file types can be parsed through a single interface, making Tika useful for search engine indexing, content analysis, translation, and much more.”

The metadata it can collect isn’t always super useful but it can reveal some really interesting information about a file. One downside to Tika is that it runs as a server that you access over HTTP so it can’t access files directly requiring the file to be sent to it over HTTP, which it then stores in RAM before processing which can be a problem with large files. I’m not entirely sure if it has an issue with memory leaks or garbage collection but the Java VM’s RAM usage can balloon in size requiring it to be restarted.

Speed

When I was first prototyping tikatree I didn’t exactly think through what I was doing. I was testing it on rather large sets of files to try and catch edge cases, however for each output file it rescanned the directory. This put a bunch of extra wear on the hard drive I was using prematurely killing it. Now tikatree scans the directory once or twice depending on the options selected and caches information about the files scanned.

After improving the performance of tikatree I started running into networking problems on Windows where it wasn’t able to keep up with the number of connections tikatree was making to the Tika server. Adjusting some registry keys largely resolved this problem.

I was curious what the performance would be if I had written tikatree in Rust or Go so I made a little proof of concept for both that created the _metadata.json file. Both Rust and Go were almost 3x faster than my Python implementation, it’s not a complete apples-to-apples comparison but was pretty close.

Plans

Apache Tika 2.0.0 is on the horizon so I imagine I’ll need to make some changes to support that. I’d also like to rewrite tikatree in Rust for better performance. I’m also interested in creating a better system for actually viewing and comparing file information.

Phabricator