EDGAR Web Logs

CS320: Data Science Programming II

Project Overview

This project is dedicated to analyzing the vast amount of data generated by the EDGAR (Electronic Data Gathering, Analysis, and Retrieval system) web logs, which record web requests to the SEC's EDGAR database by users worldwide. These logs are not only extensive—often amounting to 2 GB uncompressed data per day—but also contain anonymized but valuable user behavior data.

The primary goal is to develop tools in Python for extracting and processing information from these logs to understand better the behavior of users, particularly investment firms. This analysis can provide insights into which companies or industries hedge funds may be considering for investment and whether they rely more on automated or manual research for their trading decisions.

Project Components

main.ipybv: A Jupyter notebook used for the analysis of user behavior based on the data extracted.
: A Python module for extracting data from the EDGAR filings.

Motivation

Understanding the behavior of users who access the EDGAR database can offer predictive insights into market trends and investment patterns. This project aims to uncover these patterns by analyzing how different documents are accessed and correlated with investment success.

Usage

Open the main.ipybv notebook in a Jupyter environment to view the analysis:

jupyter notebook main.ipybv

Data

The data used in this project consists of anonymized logs from the EDGAR database, which include user interactions such as the types of filings accessed. These logs are structured to provide insights into user demographics and behavior without compromising individual privacy.

Acknowledgments

SEC for providing the EDGAR logs.
All contributors who have worked on analyzing public data for academic and professional purposes.

License

Distributed under the MIT License. SeeLICENSE.txt for more information.