AutoExIf: Automated OSINT Metadata Extraction from Websites
This post covers AutoExIf, a Python command-line tool for automated metadata extraction during OSINT reconnaissance. It discovers documents across target websites using three input modes (web crawling, DuckDuckGo dorking, and URL lists), downloads every file it finds, runs exiftool against each one, and produces structured CSV and JSON reports containing every metadata field. The tool surfaces author names, software versions, GPS coordinates, and other identifying data that organizations often leave embedded in publicly hosted files.
Why Document Metadata Matters for OSINT
Files published on the web frequently contain metadata their authors never intended to expose. A PDF uploaded to a corporate website might embed the Author field with an employee’s full name, the Producer field revealing the exact version of Adobe Acrobat used, and a CreateDate timestamp placing the document’s creation in a specific timezone [1]. Images are worse: a JPEG taken on a phone and uploaded without stripping can carry GPS coordinates accurate to a few meters, the phone’s Make and Model, and the Software that last processed it [2].
This metadata leakage is a well-documented reconnaissance vector. Tools like FOCA [3] pioneered automated metadata extraction from public documents. AutoExIf takes this further by combining three discovery methods into a single pipeline: it can spider entire websites, dork search engines for specific file types, or consume a pre-built URL list. All three feed into the same download and extraction pipeline.
Three Input Modes, One Shared Pipeline
The architecture separates URL discovery from processing. Each mode produces a flat list of URLs, which then flows into a shared download, extraction, and output pipeline.
CLI (argparse)
|-- --dork mode -> DuckDuckGo search -> URL list
|-- --crawl mode -> Scrapy spider -> URL list
|-- --urls-file -> read from file -> URL list
|
Download pipeline
|
Exiftool extraction
|
Output (CSV + JSON)
The three input flags are mutually exclusive. From cli.py:
# From: autoexif/cli.py
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument(
"--dork", "-d", help="Search dork query (e.g. 'site:example.com filetype:pdf')"
)
group.add_argument(
"--crawl", "-c", nargs="+", help="URL(s) to crawl for documents"
)
group.add_argument(
"--urls-file", "-u", help="File containing URLs (one per line)"
)
This design means adding a new discovery mode in the future only requires producing a list[str] of URLs. The download, extraction, and reporting code does not change.
Scrapy Spider: Crawling for Documents
The crawl mode uses a Scrapy spider that follows same-domain HTML links while collecting document URLs from any origin. This distinction is important: many organizations host their web pages on example.com but serve downloadable files from CDNs, S3 buckets, or subdomains like cdn.example.com. The spider follows links only on the target domain but collects document links regardless of where they point.
# From: autoexif/spider.py
def parse(self, response):
content_type = response.headers.get("Content-Type", b"").decode("utf-8", errors="ignore")
mime = content_type.split(";")[0].strip().lower()
# Non-text response: treat as a discovered document
if not mime.startswith("text/") and mime != "application/xhtml+xml":
url = response.url
if url not in self.found_urls:
self.found_urls.add(url)
yield {"url": url}
return
# Text/HTML response: extract links
for href in response.css("a::attr(href)").getall():
url = response.urljoin(href)
if is_document_url(url):
# Document link: collect it (any origin allowed)
if url not in self.found_urls:
self.found_urls.add(url)
yield {"url": url}
elif is_same_domain(url, self._allowed_domains):
# Same-domain HTML page: follow it
# Scrapy's DUPEFILTER handles already-visited URLs
yield response.follow(url, callback=self.parse, errback=self.on_error)
The domain matching logic strips www. prefixes so that www.example.com and example.com are treated as the same domain:
# From: autoexif/spider.py
def is_same_domain(url: str, allowed_domains: list[str]) -> bool:
"""Check if a URL's domain matches any allowed domain (exact match, www allowed)."""
host = urlparse(url).hostname or ""
if host.startswith("www."):
host = host[4:]
return host in allowed_domains
Subdomains like cdn.example.com are intentionally treated as different domains for link-following purposes. The spider will not recursively crawl a CDN, but it will download a document linked from there.
File Type Detection by Extension and MIME Type
The tool recognizes 35 file extensions across four categories, defined as frozensets for O(1) lookup:
# From: autoexif/filetypes.py
DOCUMENT_EXTENSIONS = frozenset({
".pdf", ".doc", ".docx", ".xls", ".xlsx", ".ppt", ".pptx",
".odt", ".ods", ".odp", ".rtf",
})
IMAGE_EXTENSIONS = frozenset({
".jpg", ".jpeg", ".png", ".tiff", ".tif", ".gif", ".bmp", ".webp", ".svg",
})
AUDIO_EXTENSIONS = frozenset({
".mp3", ".wav", ".flac", ".ogg", ".aac", ".m4a",
})
VIDEO_EXTENSIONS = frozenset({
".mp4", ".avi", ".mov", ".mkv", ".wmv", ".webm",
})
ALL_EXTENSIONS = DOCUMENT_EXTENSIONS | IMAGE_EXTENSIONS | AUDIO_EXTENSIONS | VIDEO_EXTENSIONS
URL-based detection parses the path component and checks the suffix. This avoids false positives from query parameters or fragments:
# From: autoexif/filetypes.py
def is_document_url(url: str) -> bool:
"""Check if a URL points to a known document/media file by extension."""
try:
path = urlparse(url).path
except ValueError:
return False
ext = PurePosixPath(path).suffix.lower()
return ext in ALL_EXTENSIONS
The module also maintains a MIME_TO_CATEGORY dictionary with 23 content type mappings, used as a fallback when the spider encounters non-text responses without a recognizable file extension.
DuckDuckGo Dorking: Scraping Search Results
The dork mode submits queries to DuckDuckGo’s HTML interface and extracts result URLs from the response. This approach avoids any API dependency. The tool uses the html.duckduckgo.com/html/ endpoint, which returns a simpler page structure than the JavaScript-heavy main site.
# From: autoexif/dork.py
def duckduckgo_search(dork: str, limit: int) -> list[str]:
"""Scrape DuckDuckGo HTML search results for the given dork query."""
urls: list[str] = []
session = requests.Session()
session.verify = False
session.mount("https://", _LenientHTTPSAdapter())
session.headers.update({"User-Agent": random.choice(USER_AGENTS)})
page = 0
max_pages = (limit // 10) + 3
# ...
DuckDuckGo wraps result URLs in redirect links. The tool extracts the actual target URL from the uddg query parameter:
# From: autoexif/dork.py
href = a_tag["href"]
match = re.search(r"[?&]uddg=([^&]+)", href)
if match:
url = unquote(match.group(1))
elif href.startswith("http"):
url = href
else:
continue
The dorking module handles pagination by extracting the hidden form fields from DuckDuckGo’s “Next” button and submitting them as POST data. It implements rate limit handling: if DuckDuckGo returns HTTP 202, the tool backs off with exponential delays up to two retries before giving up.
Search engine ads and tracker redirects are filtered out by checking the result hostname against a blocklist:
# From: autoexif/dork.py
SKIP_DOMAINS = {"duckduckgo.com", "bing.com", "google.com", "google.de"}
def is_ad_url(url: str) -> bool:
"""Check if a URL is an ad/tracker redirect rather than an organic result."""
try:
parsed = urlparse(url)
host = parsed.hostname or ""
except ValueError:
return True
if not host:
return True
return any(d in host for d in SKIP_DOMAINS)
The Extraction Pipeline: exiftool as a Subprocess
The core extraction runs exiftool with the -json flag on each downloaded file and parses the JSON output. This is the right approach: exiftool supports over 20,000 tag types across hundreds of file formats [4], and trying to reimplement even a fraction of that in Python would be impractical.
# From: autoexif/pipeline.py
def run_exiftool(filepath: Path) -> dict:
"""Run exiftool on a file and return parsed metadata dict."""
try:
result = subprocess.run(
["exiftool", "-json", str(filepath)],
capture_output=True,
text=True,
timeout=30,
)
if result.returncode != 0:
print(f" [!] exiftool error for {filepath.name}: {result.stderr.strip()}")
return {}
data = json.loads(result.stdout)
return data[0] if data else {}
except FileNotFoundError:
print("[!] exiftool not found. Please install it: https://exiftool.org/")
sys.exit(1)
except (subprocess.TimeoutExpired, json.JSONDecodeError) as e:
print(f" [!] exiftool failed for {filepath.name}: {e}")
return {}
Two details worth noting. First, the 30-second timeout per file prevents the tool from hanging on malformed inputs. Second, if exiftool is not installed on the system at all, the tool exits immediately with install instructions rather than silently producing empty results.
The download function handles filename collisions by appending an incrementing counter. This is necessary because multiple pages on a target site might link to files with the same name:
# From: autoexif/pipeline.py
dest = download_dir / filename
counter = 1
while dest.exists():
stem = Path(filename).stem
suffix = Path(filename).suffix
dest = download_dir / f"{stem}_{counter}{suffix}"
counter += 1
TLS Tolerance for Real-World Targets
Many target websites, especially older government or corporate sites, run outdated TLS configurations. The tool includes a custom HTTPS adapter that relaxes certificate verification and cipher suite requirements:
# From: autoexif/pipeline.py
class _LenientHTTPSAdapter(requests.adapters.HTTPAdapter):
"""HTTPS adapter that tolerates servers with broken/legacy TLS configs."""
def init_poolmanager(self, *args, **kwargs):
ctx = create_urllib3_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
ctx.set_ciphers("DEFAULT:@SECLEVEL=1")
ctx.minimum_version = ssl.TLSVersion.TLSv1
kwargs["ssl_context"] = ctx
return super().init_poolmanager(*args, **kwargs)
Setting SECLEVEL=1 and allowing TLSv1 as the minimum version means the tool can connect to servers that modern browsers would refuse. This is a deliberate tradeoff for an OSINT tool: the goal is to extract metadata from whatever the target serves, not to enforce modern TLS standards.
Output: Dynamic CSV Columns and Full JSON Dump
The output format is designed around the fact that different file types have completely different metadata fields. A PDF might have Author, Producer, and PageCount while a JPEG has Make, Model, and GPSLatitude. Rather than defining a fixed schema, the CSV writer dynamically computes columns from the union of all keys across all extracted files:
# From: autoexif/pipeline.py
def build_csv_columns(rows: list[dict]) -> list[str]:
"""Build CSV column list: URL, Filename first, then remaining sorted."""
all_keys: set[str] = set()
for row in rows:
all_keys.update(row.keys())
all_keys.discard("URL")
all_keys.discard("Filename")
return ["URL", "Filename"] + sorted(all_keys)
URL and Filename are always the first two columns, with all other fields sorted alphabetically after them. The DictWriter uses restval="" so that missing fields produce empty cells rather than errors. The result is a wide, sparse CSV that can contain hundreds of columns if the downloaded files span many types.
The JSON output preserves every field exactly as exiftool returned it, serving as the lossless record for programmatic processing.
Terminal Summary: What the Analyst Cares About
After extraction, the tool prints a summary highlighting the fields most useful for OSINT: file type counts, unique author names, software used to create the files, and any GPS-tagged files with their coordinates.
# From: autoexif/pipeline.py
def format_summary(rows: list[dict]) -> str:
"""Build a human-readable summary string."""
# ...
# Unique authors
authors = {r["Author"] for r in rows if r.get("Author")}
if authors:
lines.append(f"[+] Authors: {', '.join(sorted(authors))}")
# Unique tools (Creator + Producer)
tools: set[str] = set()
for r in rows:
if r.get("Creator"):
tools.add(str(r["Creator"]))
if r.get("Producer"):
tools.add(str(r["Producer"]))
if tools:
lines.append(f"[+] Tools/Software: {', '.join(sorted(tools))}")
# GPS coordinates
gps_files: list[str] = []
for r in rows:
if r.get("GPSLatitude") and r.get("GPSLongitude"):
gps_files.append(
f" {r['Filename']}: {r['GPSLatitude']}, {r['GPSLongitude']}"
)
if gps_files:
lines.append("[+] Files with GPS coordinates:")
lines.extend(gps_files)
The Author and Creator/Producer fields are extracted separately because exiftool treats them as distinct: Author typically comes from the document’s own metadata, while Creator and Producer identify the software that generated it [4]. A single PDF might list “Jane Doe” as the Author and “Microsoft Word” as the Creator with “Adobe PDF Library 17.0” as the Producer.
Polite Defaults with Override Options
The tool ships with conservative defaults: depth-3 crawling, 1-second delay between requests, concurrency of 2, and robots.txt respected. Each of these can be overridden:
# Aggressive crawl (depth 5, fast, ignore robots.txt)
python autoexif.py --crawl https://example.com --depth 5 --delay 0.2 --concurrency 8 --ignore-robots
# Conservative dork search (10 results max)
python autoexif.py --dork "site:example.com filetype:pdf" --limit 10
The --keep flag preserves downloaded files after extraction. By default, the downloads directory is cleaned up after the CSV and JSON are written.
Domain-Aware Output Naming
When scanning multiple targets, output files include a domain slug to prevent overwriting:
# From: autoexif/cli.py
def _derive_slug(args, urls: list[str]) -> str:
"""Return a short, filename-safe slug identifying the target(s)."""
# ...
if len(domains) == 1:
slug = domains[0]
elif len(domains) <= 3:
slug = "-".join(domains)
else:
slug = f"{domains[0]}-plus{len(domains) - 1}"
return re.sub(r"[^A-Za-z0-9._-]", "_", slug)
Scanning example.com produces results_example.com.csv. Scanning three sites produces results_a.com-b.com-c.com.csv. Scanning more than three produces results_a.com-plus4.csv. This keeps output files distinguishable across multiple runs without requiring the operator to specify output paths manually.
Limitations and Honest Assessment
AutoExIf does not strip metadata; it extracts it. The name references EXIF, but the tool’s purpose is reconnaissance, not sanitization. It reads metadata from target files but does not modify them.
The DuckDuckGo dorking depends on HTML scraping. If DuckDuckGo changes the structure of their HTML results page, the parsing logic in dork.py will break. There is no API key or official SDK involved.
TLS verification is disabled globally for downloads. The _LenientHTTPSAdapter disables certificate verification and hostname checking. This is appropriate for an OSINT tool that needs to reach poorly configured servers, but it means the tool is vulnerable to MITM attacks on the download path. The metadata extracted from intercepted files would be the attacker’s metadata, not the target’s.
Exiftool invocation is per-file. The tool spawns a new exiftool subprocess for each downloaded file. For large crawls with hundreds of files, this creates significant process spawning overhead. Exiftool supports batch mode (-json .) which processes an entire directory in one invocation; using that would be faster.
The spider does not handle JavaScript-rendered pages. Sites that load document links dynamically via JavaScript will yield no results from the crawl mode. The spider only sees links present in the initial HTML response.
This post was generated by an LLM based on code from AutoExIf. All code snippets are from the actual repository.
References
[1] Phil Harvey, “ExifTool by Phil Harvey,” exiftool.org. [Online]. Available: https://exiftool.org/
[2] CIPA DC-008-2023, “Exchangeable image file format for digital still cameras: Exif Version 2.32,” Camera & Imaging Products Association, 2023.
[3] ElevenPaths, “FOCA - Fingerprinting Organizations with Collected Archives,” GitHub. [Online]. Available: https://github.com/ElevenPaths/FOCA
[4] Phil Harvey, “ExifTool Tag Names,” exiftool.org. [Online]. Available: https://exiftool.org/TagNames/