Analyzing S3 Download Logs with Athena
Analysis on package download logs can provide valuable insights into your users usage patterns.
In this guide, we'll explore how to export package download logs to S3 and use AWS Athena to extract valuable insights as well as preparing the data for consumption by other applications.
Why Download Logs are Important
Download logs are a record of every package downloaded by your users, along with information about the download such as the date and time, the user agent, and the IP address.
By analyzing these logs, you can gain valuable insights into how your users are interacting with your product.
For example, you can identify the most popular packages, track download trends by country, and monitor usage by package versions.
This information can be used to optimize your product development, improve your marketing efforts, and identify potential security threats.
Exporting Download Logs to S3
Head over to our help documentation on access log exports to S3
Once configured to export its download logs to S3, you will start receiving new log files at the specified interval. These log files will contain detailed information about each download event that takes place in your repositories.
Analysing Download Logs with AWS Athena
Athena is an interactive query service that makes it easy to analyze data in S3 using standard SQL.
Here's how to use Athena to analyze your download logs:
Create an Athena table that matches the schema of your download logs.
This can be done using a CREATE TABLE statement that specifies the location of the log files in S3 and the table properties. The exact SERDE settings and properties will depend on the format your logs are exported in, the below shows an example for CSV-formatted logs.
CREATE EXTERNAL TABLE IF NOT EXISTS my_repository_downloads (
`datetime` STRING,
`repository` STRING,
`status` INT,
`method` STRING,
`uri` STRING,
`host` STRING,
`ip_address` STRING,
`bytes` INT,
`city` STRING,
`country` STRING,
`edge` STRING,
`format` STRING,
`package` STRING,
`package_tags` STRING,
`recorded` STRING,
`referer` STRING,
`request_id` STRING,
`token` STRING,
`user` STRING,
`user_agent` STRING,
`eula` STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
)
LOCATION 's3://bucketname/repositoryname/'
TBLPROPERTIES ("skip.header.line.count"="1", "serialization.encoding"="UTF-8")
Once the table is created, you can start querying the data using standard SQL statements. For example, you can write a query to identify your most active tokens:
SELECT count(*) as num_downloads, token
FROM my_repository_downloads
GROUP BY token
ORDER BY count(*) desc
LIMIT 10
Examples of Insights
Here are some examples of the types of insights that can be gained from analyzing download logs:
Most popular packages: By counting the number of downloads for each package, you can identify the most popular packages and prioritize their development and marketing.
Download trends by country: By filtering the data by country, you can identify where your product is most popular and adjust your marketing efforts accordingly.
Usage by package versions: By filtering the data by package version, you can identify which versions are most widely used and prioritize bug fixes and feature improvements.
Aggregation and export
While it's great to have all of your download logs in one place on S3, sometimes you want to be able to quickly answer high-level questions about usage.
This is where rolled-up views come in. By aggregating the data in different ways, you can create summary tables that can be queried or exported into other systems more quickly and easily than the raw logs.
Here's an example of how you could use Athena to create a summary table that shows the total number of downloads per IP Address.
CREATE TABLE package_downloads_per_ip_address
WITH (
format = 'PARQUET',
external_location = 's3://my-bucket/package_downloads_per_ip_address/'
) AS
SELECT
ip_address,
COUNT(*) AS downloads
FROM
my_repository_downloads
GROUP BY
ip_address
In this example, we're creating a new table called package_downloads_per_ip_address. The table is stored in Parquet format and partitioned by package to make querying more efficient. The SELECT statement does the actual aggregation using the GROUP BY clause to group downloads by IP Address.
Once you have this rolled-up view, you can ETL it into your application database to make these insights available to end users. You can also create additional views that roll up the data in different ways, depending on the questions you want to answer.
Conclusion
Whether you're a developer or a product manager, understanding how your users interact with your product is essential for success.
By taking advantage of the rich data provided by download logs, you can gain a deeper understanding of your users and optimize your product accordingly.
We hope this guide has been helpful in showing you how to generate insights and track usage with download logs exported to S3. If you have any questions or feedback, please let us know in the comments below.
Updated 23 days ago