Road to re:Invent - Amazon Athena and Quicksight

Amazon Athena lets you treat files in Amazon S3 like they were a database. That doesn’t sound too cool until you realize the possibilities: data stays in place, storage is low cost, data management gets easier. Now add in Amazon Quicksight. This service allows you to quick and easily visualize that data and share your insights. This one-two punch is key to the AWS data pipeline.

Reasonably Accurate 🤖🧠 Transcript

All right. Good morning, everybody. How you doing? Today? I’m back for another episode Another stream of road to reinvent 2019 and we’ve been doing a few of these. I already threw up the link in the comments here on LinkedIn where you can go back and check all of the previous episodes posting their YouTube along with the transcript.

I’m so you can kind of go through on your own pace. I’m in Asia can review them here on LinkedIn if you can manage to find him in my feed, that’s one of the frustrations with LinkedIn great way to connect activity outside of the normal course of the day got any questions while we’re going through this as always fire them up here in a LinkedIn chat.

I am monitoring it. I have the systems were down at this point. I am where I can see them in the life. I was fond of them live on the stream here if you have suggested for future content go for it and the next episode. I will probably be Friday maybe Monday depending on scheduling going to double-check on that.

A little higher level right now. We’ve been diving in the specific Services. We’re going to talk a little bit about strategy how to build how to pick what you’re going to build. I am a next episode have but more on that later. So little bit housekeeping today shockingly enough is already October 2nd October the to October 15th is when reserved seating goes live for AWS reinvent and that means that if you are interested in getting a seat in a particular session, then that’s what you need to be looking for on your calendar response the website here in the background and they don’t actually have it up on the hope you have a do.

Okay. So let me flip over to that right away and show you this as it goes. Let me just make sure I just saw that one window session catalog is already live for reinvent. So if you have a ticket, you can log in and start starring sessions that you’re interested in this check through what One on October 15th.

Are they going to say that on the main page here? And now they’re going to go back to the main page. You’ll see here October 15th. Can’t highlight it cuz image but October 15th is when not I’m reserved seating actually opens up, which means if you want to actually get a seat, if you want to see these sessions live you got to be on the ball on October 15th, 10% 15 to 10:10 to 15% of all the seats in each session are held back for a walk UPS in 10 minutes before session time.

If you as a reserved seat person and not showed up your seat gets released to the Walk of Life. I have the biggest disappointment. Most people have to reinvent is not actually getting the decisions they want so remember October 15th at we are going to be coming back to that several times to make sure that you do not forget because it’s absolutely critical that you get into those seats.

I’m actually giving a couple talks. I’ll give you that information on the website in the coming days, but go to the session catalog now flag at I’m interested in this and it’ll create a little short list that makes it easier to Poke your sessions. So I’m today back to today.

We’re going to go for about 25 minutes or so. We’re going to go around 10:30. We’re going to talk into two of my favorite services and they probably say that way too often on Amazon, Athena and Amazon quicksight at these are sort of lesser-known services and but I think they’re extremely important.

So let’s put up some time in the clock and you see that in the bottom right corner and we’re going to switch back over to the browser. We’re going to go through we’re going to see what’s going on. Alright. So Amazon, Athena. First of all is a query service.

It’s the way that you can treat data in S3. Like it was in a database. Why would you care? Why do we need to put your data in a database? Well as 3 is to be really inexpensive to store large amounts of data and there’s a great life cycle behind S3.

We covered that in the previous stream when we talked about us three and put the thing about Athena is that inquiries all of your data in place. So you don’t have this process where you’re putting data in s When you’re pulling it out to either in ec2 instance or locally or pushing it into another database to do work on it at Fina lets you work on it where it is.

Now that I can have cost savings that definitely far more efficient from an operational perspective from a security and privacy perspective. I like it cuz the data stays where it is, you don’t need to pull the data out. So we’re going to dive into Athena and we’re going to see what happens because a lot of people think is really complicated.

I’m and it can be Do you need to take a sip of my way to hot tea the and I should have known that the Little Bear shows up in the when the Bear shows up driving. It’s too hot. Should I look for that? So with Athena we’re going to dive in and do a very simple basic thing to connect to Quick seitan will taco quicksight and as the second part of this, so I’ve already cheated a little bit.

Sorry, but bear with me I’m basically already uploaded a big file cuz I don’t think you guys would want to watch me upload a file into a story. So we have our bucket that we made previously road to reinvent and we’ve created a folder called you names. We’ve uploaded this 1.4 gigabyte text file 1.4 gig in one text file that is has the least optimized way to do this, but I wanted to demonstrate the power to 1.4 gig of pure uncompressed text is a lot of texts.

Okay, what is in here is actually a significant amount of information about places. So it’s from a site called you names. Org, they freely license this out for for research on my belief potentially for commercial use as well. But we’re doing it for educational research purposes right now in our 1.4 KGB file.

We have a whole bunch of lines at every single line has these properties to its there’s 19 different properties for every single line. We have a name. We’ve alternative names with the longitude and latitude of where I whatever it is. We’re talking about. I’m the features of it. So there’s some some mapping information here and population elevation and the time zone that it’s in and then the date of the records are so say like a hundred fifty thousand people we have that over and over and over and over again, and I’m in fact if I just double-check this I’m just running their ears out.

111 million / 12 million lines in a text file. Okay, just shy of 12 million lines. So obviously that’s a lot of data. And what we’re going to do is we are going to use Fina to query that data. So we’re going to say we have 11 million lines or 12 million lines of city data in this is 3 bucket.

What we are going to do is go over to Athena now because I’ve already used Athena I get jumped right into the default setup, which is basically about databases on the left hand side. And we’ve queries on the right. What we’re going to do is create a new database I believe or create a new table.

Create a table from an S3 bucket. That gives us the new Wizard so we can walk through this. So we’re going to create a new database. We’re going to call this road to reinvent. Geonames, that’s our database name table is places its road to reinvent / Geo names of the interesting thing here is that so we’ve given our database and name just in this case road to reinvent geonames.

We’re going to call this table places because of the info for putting in there is just a bunch of places and what we have now is an S3 folder. So you need a folder because by default Athena will treat everything in that folder as one big database. So we have 1/4 Mile in her folder and that’s okay.

What would be more efficient for both of Fina and for pricing which will cover in a second would be if instead of one big file. We had those files of broken up by at let’s say the first letter so some way that I feel could index them better. But in this case, we’re just going to give it one big giant file.

It’s not an encrypted data set if it was you can actually passing the key and say use this key to decrypt it. Click the blue next button and I forgot lower. We need a little underscores. Not the hyphens believe. Let me double-check some quick little things. Next question is data format.

What’s that text? What are those text files in that bucket look like in this case. We have a bunch of default format. You can see there’s things like a Json documents and parquet is an optimized format for Athena, which is great and it’s used and some other stuff like Hive but we have a TSB which is tab separated values.

You can also see us, separated values in our case. There’s a tab between every fields of name tab next field tab. That’s what we’re going to use tsv. And then homes what are a if we separate each line up by tabs. What do we have? And in this case? I’ve already pre-populated this to make it easier.

I’m going to say bulk, and I’m going to just Pace this in but you can see we have the name of the the name of the column in this case the Jew Name ID and then the type it’s an integer next columns a place name. Little Bear still driving, so this is hot The next is a place name in that the string so very simple you can do this by hand as well.

But because we have 19 field. I just pre-populated it. I’m going to click add and you’ll see it populated into Athena’s very simple call Amanda type. We wanted to change it. We could select a different type here and what we’ve already knocked us out. So we scroll down to the bottom and now we’re going to click the blue next button again.

Now, I know this is kind of an in-depth is got a little more technical than we’ve gone. And I think that’s okay. I think this is just, you know, this introduces some of the the power of Athena some of the the interesting facts for the next part is going to be what we’re we’re setting up to which is a quick side part sew week with the blue crate table.

And you see this is Chase takes a second. It actually gives us the query that it has created to build a stable. So if you want to save this weekend if we want to rebuild this in any time, but we’re just going to click on the new On the new query button here on the new query tab, you can see on the left hand side under tables.

We have our places table with all of the things in here. Now, if we want to preview this we click on the three. What’s the click preview table that this is going to give us a default query, which is Select * from our database places table. Give me the 10 records limit of 10.

If you scroll down here, you’ll see that it’s already pulled these out. I’m going to go out on a limb and say these are probably in Iceland or Greenland there somewhere in the Nordic region based on the language. But if we want to what we can do is write our own query so we can say we’re place name.

like New York It’s what I’ve done here is we say select star or select everything from our database name our table name where the place names of the name of the place is like New York. Now the little modulus little percentage sign means New York just needs to be in there somewhere.

It can be there could be stuff before it. There can be stuffed after write such a great way to search through of a fuzzy search for text and if we click run query going to see at the bottom. It’s running the query takes took about a second to go over those 12 million lines, which is pretty amazing and then we have all over these things.

So we have placed name New York. We have all of these items in New York. So see there’s a lot of hotels in the state of Base, New York, New York, New York Palace Fairfield it. So there’s all these a different places with in New York, but we’re getting a ton of information here.

So what we’re actually going to do is we’re going to get a little pickier. So we’re going to say give me place name. Give me latitude longitude. elevation population in time zone so that if we click the blue run query button again is going to give us just a little better filter, right? So here we’ve gotten rid of a bunch of the junk is standard ask you to write.

This is standard database query and that’s really the key takeaway from Athena’s it lets you treat text files or Json documents are parking photos. All these files that are sitting in S3. You can treat them just like a normal database. Okay. Now you guys might be faking. Okay. That’s way too technical.

Why are we doing this? Does this really matter? What is one last step in Athena? And then we’re going to get you to the really cool stuff. So if I click create I can actually create a new table or a new view from this query. So why don’t you do is create a new table.

I’m going to keep it into my current database. guess a New York places places in New York and then my output bucket is going to be the same. Okay, so my output location and then it tells you which format is going to put them out. If you’re going to just leave it in Athena stick with the default app RK cuz that’s going to be the least expensive now.

I would say it’s going to write this in we say create and now it’s actually written another database for us another table for us and more. It’s still crunching on that and watch it finishes that will should see another table over here before we were very expensive if we look at Athena pricing and you’re going to see that the price is actually run per query.

So it’s $5 in u.s. East per terabyte of data scan. So every time works running if we’re right now, we’re scanning 1.5 gig of data so we can only run Just shy of what we can run 900 or so queries before we’re charge $5. That’s because we just dumped all of our data into one big text file if we have broken our data up or put it in a better format.

Then we wouldn’t be going across all of it all the time and we would be able to reduce that query cost down significantly. So in this case are adding failed because the place already exists because I was running tests previously. Sorry, but you get the point is that you can say about the results of queries as smaller tables and then that makes it cheaper to query things repeatedly.

So we have our big data set then we we know we want to do a lot of work with the stuff in New York that we can reduce the cost by outputting just the New York results as a whole new table then we only query that because now we’re doing way less information we go for my 1.4 gig down to a couple hundred kilobytes and that would give us way cheaper queries as we move forward.

Is far more interesting because quicksight is all about visualizing your data. It is a data visualization service. It is interesting enough. It was the very first date of your service to use its own top-level. Domain. AWS quicksight. Adriana to get pushed into that since you have a bunch of data use Quick site to look at it.

You make interactive dashboards email reports. You can do analytics. So we’re going to do that right now. We’re going to switch over to Kwik site as if you go to Services we go to Kwik site is going to pull this is the first time this account has ever been using quick site.

So it’s going to ask us to sign up for it, which is just enabling it on this. On this side Temple and you can see your own this account. You can see there’s actually charges for Kwik site, right? So I’m very clearly if we start this at UniFirst author with a gig of data is free team.

Trial is 60 days. I’m a knee after that you’re paying per additional author. So be aware of the cost of for one author with one gig of data. It’s actually free. So we’re just going to click continue. We’re going to give ourselves a unique name. We’re going to give ourselves our email notification address that we are going to automatically discover some data.

We are going to let quick sites access RS3 bucket. So this you’ll see here is the data sources for quick side cuz it needs to pull data to visualize from somewhere. We’re going to give it access to our S3 bucket that we are using for this asshole seminar for all these things and we just click finish and it’s going to create our account in the background.

It’s a setting up all the permissions that we need and you notice I didn’t say give all access test 3. I just said that one bucket, which is a really really good practice. You want to restrict it to Brian’s got a great question here asking whether quicksight is similar to power bi it is but not as mature.

So quit site is very much about visualizing data and making some interactive dashboards and there is some really cool powerful stuff here and depending on what you’re using a power bi for in Microsoft. You have you may find what you need here. I think this is far more powerful and flexible, but you got to do a lot more work to get stuff out of it and what you’ll see as we go here by and keep up the question if you have a specific one.

I’m around power bi but I think I would say just maturity-wise quick sites not there yet and but it definitely has the potential. So we got this little welcome dialogue cuz it’s the first time we’ve ever been here. Hey it lets us quickly visualize analyze my data connect to your data super fast performance visualizing analyze right share insights with anyone and that’s really the key here so we can set up a 10 sites and then send them along.

So we have some default analyses if we’d like to and let’s just clicking that for a second. This is an analysis of website and social media its pre-made this is just a sample database and you can see here that we’ve got website visits graft out overtime with website page views mailing list subscription data, so somebody’s going through and actually made this dashboard of a free vs.

Paid signups Twitter mentions and follows pretty straightforward standard analytical stuff. Right? So you can experiment around and you can play around with this and see what you want to do in our case. What we’re actually going to do is we are going to go back to the main and we’re going to start a new analysis alysis.

We are going to create a new dataset and then just increase the size of this a little bit so you can upload files directly. We can connect to a number of services including at Salesforce at 3 raw Athena. So that means we can take our data from S3. We could carve it up better in Athena and Process it there.

I’m so what we’re actually going to do is work going to connect to Athena because we’ve already told Athena how our data is stored. So we can also go to Twitter with a new Twitter analytics, which is pretty cool. I have organized say Athena the data source name is going to be geonames.

Play create a new data source. So it’s loading my table and now I need to connect to my Athena thing. You notice is the exact same road to reinvent geonames is what we made in Afena is loading that up. It says there’s one table which is places. That’s the one we want to click select.

So if we have more than 25 million rows or over 25 gig this could fail So if it fails we’re going to see what’s what’s going on so we can import it or we could directly query in our case. We know we’re over the limit. So we’re not going to import it directly into quicksight, even though that would be faster.

We’re going to directly query the data which is a little more expensive but it’s going to be more useful for us. So one of the things we have is in our day that we have longitude and latitude which gives us the ability to do geospatial analysis. Which means where is it on a map is may or may not work.

We’ll see but hey, look it worked. That’s always good. You never know with this stuff when you’re doing it live weather is going to work or not. And this is a little slower because we are doing it. Out of we’re not doing it in Athena itself. So it’s maximized that there we go.

So we because we’re pulling right for my Athena. We’re not pulling it from website where you can see every one of these dots is actually a place in our data sets of rental. We had New York and there’s a bunch of items in New York. This is where it has mapped all these things out so you can see each of his.

The stock has four places in their right. Quickly. All we did was point in our data and said hey, by the way, there are there is geospatial data in here, right so we can give us eyes as well which we have population which gives us our size. So now we have not only are we mapping it on the on the On the map but we’ve also giving a size of.

So if we scroll this a little bit UCR. Sad bunch of them changed now, it’s really tricky to kind of see let me zoom in here. So let’s get there we go. That’s getting better. So you can see that the dots are changing transparency based on how big they are right so we can see now we’ve got population.

So we have New York tag with population. We’ve got Newark tag with his population and that’s how we very simply are doing a geospatial analysis right to all we did was we dragged from this list of our data and we put it into the right Fields very simply done.

Right? So if we rename this, this is a map that’s really handy. Let’s create a second analysis. So let’s do something like let’s do a scatter plot. So quick a scatter plot, and now we have x-axis and y-axis. So let’s go by population to elevation. And then we’re going to group by country code.

So we’ll see how big things are based on population on the x-axis and elevation on the y-axis and then we’ll see the country. So we see way up here. The US has the US have the number of u.s. Items of the population is obviously a sum of all the population figures.

So it’s not the actual country population. I’m in a some of the elevation. So there’s far more entries in the US rather than somewhere like empty empty has a bunch of stuff with no elevation tag needs the India here has low elevation but a bunch of entries included. So the interesting thing here is that we’ve spotted that our data is not rolling up the way we would expect right? What we want to see is instead of a total of everything grouped by country code.

We want to see just the spread of population versus elevation for items. So because we were grouping it, we were actually changing the date. Rolling up everything in that country code, which is not accurate. There’s not 26 billion people in the United States, but here we now start to get a little bit more if we group this.

Sorry. Let me add one more thing we Group by asking name. So by name What we should see is a little more readable data. Right. So now we’re still graphing by population and elevation is a population along the X elevation along the Y and in this case were grouping it together.

So now we’ve got a different distribution. So if we look here we got 40 million people in this particular area, which looks to be let’s see if that’s going to load up a second. So we get these different ways of playing around with our data right in the case of are there we go in the case of our particular data set.

That is not fast at all in the case of her particular. They decided to Simply. It’s a grouping of places with some data about those places. So in that case, what makes the most sense is that map that first map we did but play the scatterplot. Maybe we find something interesting a cute little fact that we can use for something else, but it’s very simple to and did this just kill everything.

This might have killed it everything. That’s a first down the stream. I’ve never had Chrome lock up on me and then again chromes not my normal browser, but So the point is qwiksite makes it really really simple to visualize our data no matter where it is from. There we go.

All right, it’s responding again where it’s from in this case. What we did was we took a bunch of data that was sitting in one big honking file in S3. We created in Athena query to make that a little more understandable and then we pulled that Athena query into Amazon quicksight and with two or three clicks, we made a map that mapped out all of our data.

Here we go again, right if we look at this so we maximize it right we could look at all of our data and we can see very clearly, you know, sort of how things break up our clocks these places if you wanted to dive into to the nail salons and see what’s going on at what places are in Amsterdam, we’ve got that right.

I’m so very very simply put you can still walk to the sample stuff is a ton of different chart types of the ton of different ways to connect data into a quick site in the back. Here’s what’s really really interesting. This date is still just sitting in that one file in S3.

We have moved it, right? We did move it at all. We sit in at 3 week from Athena we visualizing quick. So we did another 20 minutes or under 30 minutes, right? I’m very simply made some mistakes along the way we’re going to keep making some mistakes and but I think this is just kind of gives you some of the idea of what is possible with these tools in the beauty is that you know, you’re Athena queries you need to make sure that you’re understand how it queries to keep the price down so I can we start with a big data set we could chuck that up with other queries a thing like you just give us the New York stuff.

Give us the US. Whatever the case maybe save those queries back into S3 is optimized files which will reduce the cost because now we’ve gone from 1.5 gig down to like a couple hundred kilobytes. So are Costco’s way lower because remember it’s $5 for every terabyte you query with Athena that cannot very very quickly.

So cut up at data and then we pulled it in. Quicksight to start looking at what’s going on at Kwik site has a pretty generous free tier and then it’s $9 a month or $12 a month for additional authors. And then you pay per gig as well and it can start to add up but it’s also extremely powerful now think about pulling in sales date instead of just like an open data set that we did you’re pulling in at your sales data or log data.

If you go back to these samples, right? So we go people overview analysis. This is just a breakdown of employees by region hiring by region. You can very easily create these kind of visualizations and then click the share button publish this out or share it. You can also capture this right into a storyboard which is you can download or you can just view a full-scale like this very very simple to do very very easy to do and if you go to the sharing options you have the ability was just click back in here.

You have the ability to publish this dashboard and then give people the ability to You play around with it or not, depending on what you want and you can send a whole bunch of permissions around that so you don’t need to make it public to share it with in your team.

So that was a quick glance at 2. Very very powerful AWS Services. The key here is that they leave the data where it sits in S3. You need to be aware that pricing like I’ve said repeatedly cuz it is really important and I want to get hit by a big Bill unexpectedly, but from a privacy and data perspective your data only ever stays in that one bucket or in those areas in S3 where you have control over it and then you make sense of it with Athena and with quicksight at and you can share it out.

I also the interesting thing with Tina is that it then is a standard database output is a standard database connectivity, which means if you don’t want you to excel you can go to silence a new data source, it’s a relational database it’s here and give it to the details for thin as long as you set up a promotions correctly and all of a sudden now from Excel your query in your day today.

Meaning of three that’s really really powerful stuff. It’s also the beginning of the AWS data pipeline. You’re going to see that come up at reinvent. The last two years burner has spent a good chunk of his keynote talking about the AWS data lifecycle in the data pipeline. You probably heard the term data links that all starts with S 3 at Fina and quicksight are extremely powerful tools on top of that.

Okay, and hopefully this is giving you a bit of a taste. Thank you very much for your comments or appreciate the feedback. Thank you Brian for the feedback in the great question and I will continue to monitor This Thread as always to see if there is additional questions. I’m publishing this video over on YouTube and then again on my website, so there’s an additional questions or suggestions and let me know we’re looking for future topics as well.

We’ve got 60 days before being bad for the next to Big major timeline, October 15th. Make sure you are on board for October 15th so that you can get your seat in the sessions that you want the way to start doing that is log into the session catalog right now and start fly.

The ones you’re interested in as I’m interested the little star that creates a quick list over on the side. I’ll be giving a couple talks at the show up with that info up on my website shortly. But thank you for joining me today. I really appreciate it. I hope you are set up.

Hope you learned a little bit here and I hope you are set up for a fantastic. Wednesday will talk to you soon. As I launch my pennant you not aggressively hopefully. All right. Good morning, everybody. How you doing? Today? I’m back for another episode Another stream of road to reinvent 2019 and we’ve been doing a few of these.

I already threw up the link in the comments here on LinkedIn where you can go back and check all of the previous episodes posting their YouTube along with the transcript. I’m so you can kind of go through on your own pace. I’m in Asia can review them here on LinkedIn if you can manage to find him in my feed, that’s one of the frustrations with LinkedIn great way to connect activity outside of the normal course of the day got any questions while we’re going through this as always fire them up here in a LinkedIn chat.

Hope you learned a little bit here and I hope you are set up for a fantastic. Wednesday will talk to you soon. As I launch my pennant you not aggressively hopefully.

Road to re:Invent - Amazon Athena and Quicksight

Reasonably Accurate 🤖🧠 Transcript

Read next

ChatGPT Delivers Ideas and Answers on Demand, If You Know How To Ask

Road to re:Invent - Amazon Athena and Quicksight

Reasonably Accurate 🤖🧠 Transcript

Read next

ChatGPT Delivers Ideas and Answers on Demand, If You Know How To Ask

AWS re:Invent 2022 Attendee Guide: Security

Accelerating innovation at AWS Security