Amazon Athena data federation

Project at a glance

Project goal: Support an external Hive metastore outside of AWS Glue for Amazon Athena. Support data federation to query data sources outside of Amazon S3.

Timeline: September - November 2019 (public preview at ReInvent)

Console team: Product manager, UX designer (me), research coordinator, front-end developer, writer, editor

Solutions: Customer feedback helped guide how to present and name options.

User research: 6 data architects and cloud engineers

Outcome: Athena Hive metastore and data federation launched at ReInvent 2019

Design problem

Athena Hive metastore

Athena users may only use AWS Glue to store metadata about their datasets. However, some data admins use their own external Hive Metastore to access schema metadata in their private VPC. This is an adoption blocker for Athena. Migrating metadata from Hive to AWS Glue requires time and effort. Users may have workflows or use services like EMR or Ranger that depend on Hive metastore. No migration tool exists, just guidance. Supporting Hive means users don’t have to migrate data and can query legacy data where it sits.

Athena data federation

Athena users may only query data stored in Amazon S3. Users may have data stored in various data stores inside AWS services, such as Cloudwatch or Redshift, that they wish to query, but it requires moving their data into an S3 bucket. They may also have data stored in third-party databases like MongoDB or Redis. With the data federation project, a user may connect to a data source outside of S3 and run queries on the data where it sits. This allows them to avoid paying additional storage fees and setting up ways to move data in and out of S3 in order to be available to query.

Design process

In this project, I started talking immediately to customer-facing AWS big data architects to get an understanding of the use cases and possible pain points for Athena users. I worked with the product manager to consume requirements and technical constraints. I successfully argued for a change to an API parameter name to make it more user-friendly in the console, which then impacted the Athena CLI (command-line-interface) and SDK (software development kit) users as well.

I reached out to our research team to request help recruiting users for a feedback study. I spoke with 6 users and shared early versions of the workflows using a clickable prototype over WebEx. I incorporated their feedback, along with UX review and product team feedback, to come to the final designs. I worked with the writer and editor to use familiar terminology and avoid creating any new jargon. After the feature launched at Re:Invent 2019 conference, we got feedback about the Lambda connectors and the data source connection which I incorporated into subsequent revisions.

User research

I organized a customer site visit with AWS employees and users to hear feedback on the Athena console.

Formative research

I conducted a site visit with our top Athena console users for feedback on their experiences. I learned that many of them are in the console all day, every day. They identified changes to micro-interactions would make their repetitive process more efficient. I learned about a variety of user types at the same company using Athena for data science, data engineering, software development and business analysis. This session helped strengthen my understanding of the user personas and user journey for Athena customers on a team.

Usability testing

In early versions of the federated query prototype, users expressed concern about creating a Lambda function. They thought they might need help from a more experienced AWS user to complete that step. In subsequent versions, we de-emphasized the Lambda connectors in the initial decision-making.

I discovered that users expected to connect to external data sources in a similar way that is done in business intelligence tools and database servers. However, Athena would be using Lambda functions to serve as middleware between Athena service and the external data source. These functions would be templates that the user would need to configure and deploy in order to secure a connection. Although I argued that we should attempt to conceal the Lambda function step and collect the needed information from the user in Athena, ultimately we were not able to do so for security reasons. Therefore, we had to guide the user on a round trip between Athena, Lambda and back to Athena. It was also important to expose the user to the Lambda connector because they would be creating a resource that they would pay for every time they connected to the data source.

An early wireframe for Athena query federation.

A wireframe guiding the user to connect to data sources in S3 (the existing workflow) or an external datasource (the new workflow. At the time, there was little guidance from the user on connecting to their data in S3 from the console. This project … — A wireframe guiding the user to connect to data sources in S3 (the existing workflow) or an external datasource (the new workflow. At the time, there was little guidance from the user on connecting to their data in S3 from the console. This project was an opportunity to clarify those steps for the user.

Findings in action

When adding the data federation and external metastore features, I uncovered an opportunity to clarify the existing steps to connect to data in S3 with the AWS Glue data catalog. I created wizards to guide the user over to Glue to create crawlers and create a table schema, since that guidance didn’t exist before.

I collaborated closely with the Athena writer and editor to work on the language used in the interface around data. Data source, data store, data catalog and database can be used interchangeably at times, or have more nuanced differences. Ultimately, we used data source for the higher-level concept and data catalog for the resource we connected with. We also changed our verb from “Add” to “Connect” in order to reinforce the concept that Athena does not move or load data.

Personas

The two primary personas for Athena are the admin and the analyst. The admin works like a reference librarian: sourcing data, assigning permissions, and troubleshooting issues with query slowness or errors. Their primary concerns are managing costs and granting access without oversharing. The analyst works to answer business questions with the selected data.

User journey Map

Because this project focused on the needs of the admin user, I created a journey map for their process. I learned that admins often whittle down the amount of data in the pipeline before handing it over to analyst end-users so that they don’t run expensive queries on extraneous data. The admin controls access upstream and downstream, requesting data from teams and carefully scrubbing it of personal identifiers or other sensitive information before sharing.

Design Solution

Connect to S3 data sources

This workflow allows the user to connect to an external Hive metastore or the AWS Glue Data Catalog. This flow is designed to help admins the various options available for connecting data to Amazon in S3. The schema information can be created implicitly, explicitly or connected to existing. Previously, the user would have had to read through documentation to understand these options.

Connect to external data sources

In this workflow, the user sees a full variety of external data sources that Athena can connect to using Lambda. The inspiration for this gallery format was from various business intelligence tools I researched early in the project, who present the user with a visual assortment of familiar logos to demonstrate their capabilities. Users responded positively to the variety of sources they could now connect to. We moved away from wireframe, outline style logos for the AWS products and used color block logos in order to be consistent with color logos for open-source and competitor data servers.

Re:Invent 2019 launch

To celebrate and promote this new functionality for the Re:Invent user conference, I designed t-shirts and stickers for our AWS product and engineering teams to wear and distribute.