Amazon Athena lets you treat files in Amazon S3 like they were a database. That doesn鈥檛 sound too cool until you realize the possibilities: data stays in place, storage is low cost, data management gets easier. Now add in Amazon Quicksight. This service allows you to quick and easily visualize that data and share your insights. This one-two punch is key to the AWS data pipeline.
Reasonably Accurate 馃馃 Transcript
All right. Good morning, everybody. How you doing? Today? I鈥檓 back for another episode Another stream of road to reinvent 2019 and we鈥檝e been doing a few of these. I already threw up the link in the comments here on LinkedIn where you can go back and check all of the previous episodes posting their YouTube along with the transcript.
I鈥檓 so you can kind of go through on your own pace. I鈥檓 in Asia can review them here on LinkedIn if you can manage to find him in my feed, that鈥檚 one of the frustrations with LinkedIn great way to connect activity outside of the normal course of the day got any questions while we鈥檙e going through this as always fire them up here in a LinkedIn chat.
I am monitoring it. I have the systems were down at this point. I am where I can see them in the life. I was fond of them live on the stream here if you have suggested for future content go for it and the next episode. I will probably be Friday maybe Monday depending on scheduling going to double-check on that.
A little higher level right now. We鈥檝e been diving in the specific Services. We鈥檙e going to talk a little bit about strategy how to build how to pick what you鈥檙e going to build. I am a next episode have but more on that later. So little bit housekeeping today shockingly enough is already October 2nd October the to October 15th is when reserved seating goes live for AWS reinvent and that means that if you are interested in getting a seat in a particular session, then that鈥檚 what you need to be looking for on your calendar response the website here in the background and they don鈥檛 actually have it up on the hope you have a do.
Okay. So let me flip over to that right away and show you this as it goes. Let me just make sure I just saw that one window session catalog is already live for reinvent. So if you have a ticket, you can log in and start starring sessions that you鈥檙e interested in this check through what One on October 15th.
Are they going to say that on the main page here? And now they鈥檙e going to go back to the main page. You鈥檒l see here October 15th. Can鈥檛 highlight it cuz image but October 15th is when not I鈥檓 reserved seating actually opens up, which means if you want to actually get a seat, if you want to see these sessions live you got to be on the ball on October 15th, 10% 15 to 10:10 to 15% of all the seats in each session are held back for a walk UPS in 10 minutes before session time.
If you as a reserved seat person and not showed up your seat gets released to the Walk of Life. I have the biggest disappointment. Most people have to reinvent is not actually getting the decisions they want so remember October 15th at we are going to be coming back to that several times to make sure that you do not forget because it鈥檚 absolutely critical that you get into those seats.
I鈥檓 actually giving a couple talks. I鈥檒l give you that information on the website in the coming days, but go to the session catalog now flag at I鈥檓 interested in this and it鈥檒l create a little short list that makes it easier to Poke your sessions. So I鈥檓 today back to today.
We鈥檙e going to go for about 25 minutes or so. We鈥檙e going to go around 10:30. We鈥檙e going to talk into two of my favorite services and they probably say that way too often on Amazon, Athena and Amazon quicksight at these are sort of lesser-known services and but I think they鈥檙e extremely important.
So let鈥檚 put up some time in the clock and you see that in the bottom right corner and we鈥檙e going to switch back over to the browser. We鈥檙e going to go through we鈥檙e going to see what鈥檚 going on. Alright. So Amazon, Athena. First of all is a query service.
It鈥檚 the way that you can treat data in S3. Like it was in a database. Why would you care? Why do we need to put your data in a database? Well as 3 is to be really inexpensive to store large amounts of data and there鈥檚 a great life cycle behind S3.
We covered that in the previous stream when we talked about us three and put the thing about Athena is that inquiries all of your data in place. So you don鈥檛 have this process where you鈥檙e putting data in s When you鈥檙e pulling it out to either in ec2 instance or locally or pushing it into another database to do work on it at Fina lets you work on it where it is.
Now that I can have cost savings that definitely far more efficient from an operational perspective from a security and privacy perspective. I like it cuz the data stays where it is, you don鈥檛 need to pull the data out. So we鈥檙e going to dive into Athena and we鈥檙e going to see what happens because a lot of people think is really complicated.
I鈥檓 and it can be Do you need to take a sip of my way to hot tea the and I should have known that the Little Bear shows up in the when the Bear shows up driving. It鈥檚 too hot. Should I look for that? So with Athena we鈥檙e going to dive in and do a very simple basic thing to connect to Quick seitan will taco quicksight and as the second part of this, so I鈥檝e already cheated a little bit.
Sorry, but bear with me I鈥檓 basically already uploaded a big file cuz I don鈥檛 think you guys would want to watch me upload a file into a story. So we have our bucket that we made previously road to reinvent and we鈥檝e created a folder called you names. We鈥檝e uploaded this 1.4 gigabyte text file 1.4 gig in one text file that is has the least optimized way to do this, but I wanted to demonstrate the power to 1.4 gig of pure uncompressed text is a lot of texts.
Okay, what is in here is actually a significant amount of information about places. So it鈥檚 from a site called you names. Org, they freely license this out for for research on my belief potentially for commercial use as well. But we鈥檙e doing it for educational research purposes right now in our 1.4 KGB file.
We have a whole bunch of lines at every single line has these properties to its there鈥檚 19 different properties for every single line. We have a name. We鈥檝e alternative names with the longitude and latitude of where I whatever it is. We鈥檙e talking about. I鈥檓 the features of it. So there鈥檚 some some mapping information here and population elevation and the time zone that it鈥檚 in and then the date of the records are so say like a hundred fifty thousand people we have that over and over and over and over again, and I鈥檓 in fact if I just double-check this I鈥檓 just running their ears out.
111 million / 12 million lines in a text file. Okay, just shy of 12 million lines. So obviously that鈥檚 a lot of data. And what we鈥檙e going to do is we are going to use Fina to query that data. So we鈥檙e going to say we have 11 million lines or 12 million lines of city data in this is 3 bucket.
What we are going to do is go over to Athena now because I鈥檝e already used Athena I get jumped right into the default setup, which is basically about databases on the left hand side. And we鈥檝e queries on the right. What we鈥檙e going to do is create a new database I believe or create a new table.
Create a table from an S3 bucket. That gives us the new Wizard so we can walk through this. So we鈥檙e going to create a new database. We鈥檙e going to call this road to reinvent. Geonames, that鈥檚 our database name table is places its road to reinvent / Geo names of the interesting thing here is that so we鈥檝e given our database and name just in this case road to reinvent geonames.
We鈥檙e going to call this table places because of the info for putting in there is just a bunch of places and what we have now is an S3 folder. So you need a folder because by default Athena will treat everything in that folder as one big database. So we have 1/4 Mile in her folder and that鈥檚 okay.
What would be more efficient for both of Fina and for pricing which will cover in a second would be if instead of one big file. We had those files of broken up by at let鈥檚 say the first letter so some way that I feel could index them better. But in this case, we鈥檙e just going to give it one big giant file.
It鈥檚 not an encrypted data set if it was you can actually passing the key and say use this key to decrypt it. Click the blue next button and I forgot lower. We need a little underscores. Not the hyphens believe. Let me double-check some quick little things. Next question is data format.
What鈥檚 that text? What are those text files in that bucket look like in this case. We have a bunch of default format. You can see there鈥檚 things like a Json documents and parquet is an optimized format for Athena, which is great and it鈥檚 used and some other stuff like Hive but we have a TSB which is tab separated values.
You can also see us, separated values in our case. There鈥檚 a tab between every fields of name tab next field tab. That鈥檚 what we鈥檙e going to use tsv. And then homes what are a if we separate each line up by tabs. What do we have? And in this case? I鈥檝e already pre-populated this to make it easier.
I鈥檓 going to say bulk, and I鈥檓 going to just Pace this in but you can see we have the name of the the name of the column in this case the Jew Name ID and then the type it鈥檚 an integer next columns a place name. Little Bear still driving, so this is hot The next is a place name in that the string so very simple you can do this by hand as well.
But because we have 19 field. I just pre-populated it. I鈥檓 going to click add and you鈥檒l see it populated into Athena鈥檚 very simple call Amanda type. We wanted to change it. We could select a different type here and what we鈥檝e already knocked us out. So we scroll down to the bottom and now we鈥檙e going to click the blue next button again.
Now, I know this is kind of an in-depth is got a little more technical than we鈥檝e gone. And I think that鈥檚 okay. I think this is just, you know, this introduces some of the the power of Athena some of the the interesting facts for the next part is going to be what we鈥檙e we鈥檙e setting up to which is a quick side part sew week with the blue crate table.
And you see this is Chase takes a second. It actually gives us the query that it has created to build a stable. So if you want to save this weekend if we want to rebuild this in any time, but we鈥檙e just going to click on the new On the new query button here on the new query tab, you can see on the left hand side under tables.
We have our places table with all of the things in here. Now, if we want to preview this we click on the three. What鈥檚 the click preview table that this is going to give us a default query, which is Select * from our database places table. Give me the 10 records limit of 10.
If you scroll down here, you鈥檒l see that it鈥檚 already pulled these out. I鈥檓 going to go out on a limb and say these are probably in Iceland or Greenland there somewhere in the Nordic region based on the language. But if we want to what we can do is write our own query so we can say we鈥檙e place name.
like New York It鈥檚 what I鈥檝e done here is we say select star or select everything from our database name our table name where the place names of the name of the place is like New York. Now the little modulus little percentage sign means New York just needs to be in there somewhere.
It can be there could be stuff before it. There can be stuffed after write such a great way to search through of a fuzzy search for text and if we click run query going to see at the bottom. It鈥檚 running the query takes took about a second to go over those 12 million lines, which is pretty amazing and then we have all over these things.
So we have placed name New York. We have all of these items in New York. So see there鈥檚 a lot of hotels in the state of Base, New York, New York, New York Palace Fairfield it. So there鈥檚 all these a different places with in New York, but we鈥檙e getting a ton of information here.
So what we鈥檙e actually going to do is we鈥檙e going to get a little pickier. So we鈥檙e going to say give me place name. Give me latitude longitude. elevation population in time zone so that if we click the blue run query button again is going to give us just a little better filter, right? So here we鈥檝e gotten rid of a bunch of the junk is standard ask you to write.
This is standard database query and that鈥檚 really the key takeaway from Athena鈥檚 it lets you treat text files or Json documents are parking photos. All these files that are sitting in S3. You can treat them just like a normal database. Okay. Now you guys might be faking. Okay. That鈥檚 way too technical.
Why are we doing this? Does this really matter? What is one last step in Athena? And then we鈥檙e going to get you to the really cool stuff. So if I click create I can actually create a new table or a new view from this query. So why don鈥檛 you do is create a new table.
I鈥檓 going to keep it into my current database. guess a New York places places in New York and then my output bucket is going to be the same. Okay, so my output location and then it tells you which format is going to put them out. If you鈥檙e going to just leave it in Athena stick with the default app RK cuz that鈥檚 going to be the least expensive now.
I would say it鈥檚 going to write this in we say create and now it鈥檚 actually written another database for us another table for us and more. It鈥檚 still crunching on that and watch it finishes that will should see another table over here before we were very expensive if we look at Athena pricing and you鈥檙e going to see that the price is actually run per query.
So it鈥檚 $5 in u.s. East per terabyte of data scan. So every time works running if we鈥檙e right now, we鈥檙e scanning 1.5 gig of data so we can only run Just shy of what we can run 900 or so queries before we鈥檙e charge $5. That鈥檚 because we just dumped all of our data into one big text file if we have broken our data up or put it in a better format.
Then we wouldn鈥檛 be going across all of it all the time and we would be able to reduce that query cost down significantly. So in this case are adding failed because the place already exists because I was running tests previously. Sorry, but you get the point is that you can say about the results of queries as smaller tables and then that makes it cheaper to query things repeatedly.
So we have our big data set then we we know we want to do a lot of work with the stuff in New York that we can reduce the cost by outputting just the New York results as a whole new table then we only query that because now we鈥檙e doing way less information we go for my 1.4 gig down to a couple hundred kilobytes and that would give us way cheaper queries as we move forward.
Is far more interesting because quicksight is all about visualizing your data. It is a data visualization service. It is interesting enough. It was the very first date of your service to use its own top-level. Domain. AWS quicksight. Adriana to get pushed into that since you have a bunch of data use Quick site to look at it.
You make interactive dashboards email reports. You can do analytics. So we鈥檙e going to do that right now. We鈥檙e going to switch over to Kwik site as if you go to Services we go to Kwik site is going to pull this is the first time this account has ever been using quick site.
So it鈥檚 going to ask us to sign up for it, which is just enabling it on this. On this side Temple and you can see your own this account. You can see there鈥檚 actually charges for Kwik site, right? So I鈥檓 very clearly if we start this at UniFirst author with a gig of data is free team.
Trial is 60 days. I鈥檓 a knee after that you鈥檙e paying per additional author. So be aware of the cost of for one author with one gig of data. It鈥檚 actually free. So we鈥檙e just going to click continue. We鈥檙e going to give ourselves a unique name. We鈥檙e going to give ourselves our email notification address that we are going to automatically discover some data.
We are going to let quick sites access RS3 bucket. So this you鈥檒l see here is the data sources for quick side cuz it needs to pull data to visualize from somewhere. We鈥檙e going to give it access to our S3 bucket that we are using for this asshole seminar for all these things and we just click finish and it鈥檚 going to create our account in the background.
It鈥檚 a setting up all the permissions that we need and you notice I didn鈥檛 say give all access test 3. I just said that one bucket, which is a really really good practice. You want to restrict it to Brian鈥檚 got a great question here asking whether quicksight is similar to power bi it is but not as mature.
So quit site is very much about visualizing data and making some interactive dashboards and there is some really cool powerful stuff here and depending on what you鈥檙e using a power bi for in Microsoft. You have you may find what you need here. I think this is far more powerful and flexible, but you got to do a lot more work to get stuff out of it and what you鈥檒l see as we go here by and keep up the question if you have a specific one.
I鈥檓 around power bi but I think I would say just maturity-wise quick sites not there yet and but it definitely has the potential. So we got this little welcome dialogue cuz it鈥檚 the first time we鈥檝e ever been here. Hey it lets us quickly visualize analyze my data connect to your data super fast performance visualizing analyze right share insights with anyone and that鈥檚 really the key here so we can set up a 10 sites and then send them along.
So we have some default analyses if we鈥檇 like to and let鈥檚 just clicking that for a second. This is an analysis of website and social media its pre-made this is just a sample database and you can see here that we鈥檝e got website visits graft out overtime with website page views mailing list subscription data, so somebody鈥檚 going through and actually made this dashboard of a free vs.
Paid signups Twitter mentions and follows pretty straightforward standard analytical stuff. Right? So you can experiment around and you can play around with this and see what you want to do in our case. What we鈥檙e actually going to do is we are going to go back to the main and we鈥檙e going to start a new analysis alysis.
We are going to create a new dataset and then just increase the size of this a little bit so you can upload files directly. We can connect to a number of services including at Salesforce at 3 raw Athena. So that means we can take our data from S3. We could carve it up better in Athena and Process it there.
I鈥檓 so what we鈥檙e actually going to do is work going to connect to Athena because we鈥檝e already told Athena how our data is stored. So we can also go to Twitter with a new Twitter analytics, which is pretty cool. I have organized say Athena the data source name is going to be geonames.
Play create a new data source. So it鈥檚 loading my table and now I need to connect to my Athena thing. You notice is the exact same road to reinvent geonames is what we made in Afena is loading that up. It says there鈥檚 one table which is places. That鈥檚 the one we want to click select.
So if we have more than 25 million rows or over 25 gig this could fail So if it fails we鈥檙e going to see what鈥檚 what鈥檚 going on so we can import it or we could directly query in our case. We know we鈥檙e over the limit. So we鈥檙e not going to import it directly into quicksight, even though that would be faster.
We鈥檙e going to directly query the data which is a little more expensive but it鈥檚 going to be more useful for us. So one of the things we have is in our day that we have longitude and latitude which gives us the ability to do geospatial analysis. Which means where is it on a map is may or may not work.
We鈥檒l see but hey, look it worked. That鈥檚 always good. You never know with this stuff when you鈥檙e doing it live weather is going to work or not. And this is a little slower because we are doing it. Out of we鈥檙e not doing it in Athena itself. So it鈥檚 maximized that there we go.
So we because we鈥檙e pulling right for my Athena. We鈥檙e not pulling it from website where you can see every one of these dots is actually a place in our data sets of rental. We had New York and there鈥檚 a bunch of items in New York. This is where it has mapped all these things out so you can see each of his.
The stock has four places in their right. Quickly. All we did was point in our data and said hey, by the way, there are there is geospatial data in here, right so we can give us eyes as well which we have population which gives us our size. So now we have not only are we mapping it on the on the On the map but we鈥檝e also giving a size of.
So if we scroll this a little bit UCR. Sad bunch of them changed now, it鈥檚 really tricky to kind of see let me zoom in here. So let鈥檚 get there we go. That鈥檚 getting better. So you can see that the dots are changing transparency based on how big they are right so we can see now we鈥檝e got population.
So we have New York tag with population. We鈥檝e got Newark tag with his population and that鈥檚 how we very simply are doing a geospatial analysis right to all we did was we dragged from this list of our data and we put it into the right Fields very simply done.
Right? So if we rename this, this is a map that鈥檚 really handy. Let鈥檚 create a second analysis. So let鈥檚 do something like let鈥檚 do a scatter plot. So quick a scatter plot, and now we have x-axis and y-axis. So let鈥檚 go by population to elevation. And then we鈥檙e going to group by country code.
So we鈥檒l see how big things are based on population on the x-axis and elevation on the y-axis and then we鈥檒l see the country. So we see way up here. The US has the US have the number of u.s. Items of the population is obviously a sum of all the population figures.
So it鈥檚 not the actual country population. I鈥檓 in a some of the elevation. So there鈥檚 far more entries in the US rather than somewhere like empty empty has a bunch of stuff with no elevation tag needs the India here has low elevation but a bunch of entries included. So the interesting thing here is that we鈥檝e spotted that our data is not rolling up the way we would expect right? What we want to see is instead of a total of everything grouped by country code.
We want to see just the spread of population versus elevation for items. So because we were grouping it, we were actually changing the date. Rolling up everything in that country code, which is not accurate. There鈥檚 not 26 billion people in the United States, but here we now start to get a little bit more if we group this.
Sorry. Let me add one more thing we Group by asking name. So by name What we should see is a little more readable data. Right. So now we鈥檙e still graphing by population and elevation is a population along the X elevation along the Y and in this case were grouping it together.
So now we鈥檝e got a different distribution. So if we look here we got 40 million people in this particular area, which looks to be let鈥檚 see if that鈥檚 going to load up a second. So we get these different ways of playing around with our data right in the case of are there we go in the case of our particular data set.
That is not fast at all in the case of her particular. They decided to Simply. It鈥檚 a grouping of places with some data about those places. So in that case, what makes the most sense is that map that first map we did but play the scatterplot. Maybe we find something interesting a cute little fact that we can use for something else, but it鈥檚 very simple to and did this just kill everything.
This might have killed it everything. That鈥檚 a first down the stream. I鈥檝e never had Chrome lock up on me and then again chromes not my normal browser, but So the point is qwiksite makes it really really simple to visualize our data no matter where it is from. There we go.
All right, it鈥檚 responding again where it鈥檚 from in this case. What we did was we took a bunch of data that was sitting in one big honking file in S3. We created in Athena query to make that a little more understandable and then we pulled that Athena query into Amazon quicksight and with two or three clicks, we made a map that mapped out all of our data.
Here we go again, right if we look at this so we maximize it right we could look at all of our data and we can see very clearly, you know, sort of how things break up our clocks these places if you wanted to dive into to the nail salons and see what鈥檚 going on at what places are in Amsterdam, we鈥檝e got that right.
I鈥檓 so very very simply put you can still walk to the sample stuff is a ton of different chart types of the ton of different ways to connect data into a quick site in the back. Here鈥檚 what鈥檚 really really interesting. This date is still just sitting in that one file in S3.
We have moved it, right? We did move it at all. We sit in at 3 week from Athena we visualizing quick. So we did another 20 minutes or under 30 minutes, right? I鈥檓 very simply made some mistakes along the way we鈥檙e going to keep making some mistakes and but I think this is just kind of gives you some of the idea of what is possible with these tools in the beauty is that you know, you鈥檙e Athena queries you need to make sure that you鈥檙e understand how it queries to keep the price down so I can we start with a big data set we could chuck that up with other queries a thing like you just give us the New York stuff.
Give us the US. Whatever the case maybe save those queries back into S3 is optimized files which will reduce the cost because now we鈥檝e gone from 1.5 gig down to like a couple hundred kilobytes. So are Costco鈥檚 way lower because remember it鈥檚 $5 for every terabyte you query with Athena that cannot very very quickly.
So cut up at data and then we pulled it in. Quicksight to start looking at what鈥檚 going on at Kwik site has a pretty generous free tier and then it鈥檚 $9 a month or $12 a month for additional authors. And then you pay per gig as well and it can start to add up but it鈥檚 also extremely powerful now think about pulling in sales date instead of just like an open data set that we did you鈥檙e pulling in at your sales data or log data.
If you go back to these samples, right? So we go people overview analysis. This is just a breakdown of employees by region hiring by region. You can very easily create these kind of visualizations and then click the share button publish this out or share it. You can also capture this right into a storyboard which is you can download or you can just view a full-scale like this very very simple to do very very easy to do and if you go to the sharing options you have the ability was just click back in here.
You have the ability to publish this dashboard and then give people the ability to You play around with it or not, depending on what you want and you can send a whole bunch of permissions around that so you don鈥檛 need to make it public to share it with in your team.
So that was a quick glance at 2. Very very powerful AWS Services. The key here is that they leave the data where it sits in S3. You need to be aware that pricing like I鈥檝e said repeatedly cuz it is really important and I want to get hit by a big Bill unexpectedly, but from a privacy and data perspective your data only ever stays in that one bucket or in those areas in S3 where you have control over it and then you make sense of it with Athena and with quicksight at and you can share it out.
I also the interesting thing with Tina is that it then is a standard database output is a standard database connectivity, which means if you don鈥檛 want you to excel you can go to silence a new data source, it鈥檚 a relational database it鈥檚 here and give it to the details for thin as long as you set up a promotions correctly and all of a sudden now from Excel your query in your day today.
Meaning of three that鈥檚 really really powerful stuff. It鈥檚 also the beginning of the AWS data pipeline. You鈥檙e going to see that come up at reinvent. The last two years burner has spent a good chunk of his keynote talking about the AWS data lifecycle in the data pipeline. You probably heard the term data links that all starts with S 3 at Fina and quicksight are extremely powerful tools on top of that.
Okay, and hopefully this is giving you a bit of a taste. Thank you very much for your comments or appreciate the feedback. Thank you Brian for the feedback in the great question and I will continue to monitor This Thread as always to see if there is additional questions. I鈥檓 publishing this video over on YouTube and then again on my website, so there鈥檚 an additional questions or suggestions and let me know we鈥檙e looking for future topics as well.
We鈥檝e got 60 days before being bad for the next to Big major timeline, October 15th. Make sure you are on board for October 15th so that you can get your seat in the sessions that you want the way to start doing that is log into the session catalog right now and start fly.
The ones you鈥檙e interested in as I鈥檓 interested the little star that creates a quick list over on the side. I鈥檒l be giving a couple talks at the show up with that info up on my website shortly. But thank you for joining me today. I really appreciate it. I hope you are set up.
Hope you learned a little bit here and I hope you are set up for a fantastic. Wednesday will talk to you soon. As I launch my pennant you not aggressively hopefully. All right. Good morning, everybody. How you doing? Today? I鈥檓 back for another episode Another stream of road to reinvent 2019 and we鈥檝e been doing a few of these.
I already threw up the link in the comments here on LinkedIn where you can go back and check all of the previous episodes posting their YouTube along with the transcript. I鈥檓 so you can kind of go through on your own pace. I鈥檓 in Asia can review them here on LinkedIn if you can manage to find him in my feed, that鈥檚 one of the frustrations with LinkedIn great way to connect activity outside of the normal course of the day got any questions while we鈥檙e going through this as always fire them up here in a LinkedIn chat.
I am monitoring it. I have the systems were down at this point. I am where I can see them in the life. I was fond of them live on the stream here if you have suggested for future content go for it and the next episode. I will probably be Friday maybe Monday depending on scheduling going to double-check on that.
A little higher level right now. We鈥檝e been diving in the specific Services. We鈥檙e going to talk a little bit about strategy how to build how to pick what you鈥檙e going to build. I am a next episode have but more on that later. So little bit housekeeping today shockingly enough is already October 2nd October the to October 15th is when reserved seating goes live for AWS reinvent and that means that if you are interested in getting a seat in a particular session, then that鈥檚 what you need to be looking for on your calendar response the website here in the background and they don鈥檛 actually have it up on the hope you have a do.
Okay. So let me flip over to that right away and show you this as it goes. Let me just make sure I just saw that one window session catalog is already live for reinvent. So if you have a ticket, you can log in and start starring sessions that you鈥檙e interested in this check through what One on October 15th.
Are they going to say that on the main page here? And now they鈥檙e going to go back to the main page. You鈥檒l see here October 15th. Can鈥檛 highlight it cuz image but October 15th is when not I鈥檓 reserved seating actually opens up, which means if you want to actually get a seat, if you want to see these sessions live you got to be on the ball on October 15th, 10% 15 to 10:10 to 15% of all the seats in each session are held back for a walk UPS in 10 minutes before session time.
If you as a reserved seat person and not showed up your seat gets released to the Walk of Life. I have the biggest disappointment. Most people have to reinvent is not actually getting the decisions they want so remember October 15th at we are going to be coming back to that several times to make sure that you do not forget because it鈥檚 absolutely critical that you get into those seats.
I鈥檓 actually giving a couple talks. I鈥檒l give you that information on the website in the coming days, but go to the session catalog now flag at I鈥檓 interested in this and it鈥檒l create a little short list that makes it easier to Poke your sessions. So I鈥檓 today back to today.
We鈥檙e going to go for about 25 minutes or so. We鈥檙e going to go around 10:30. We鈥檙e going to talk into two of my favorite services and they probably say that way too often on Amazon, Athena and Amazon quicksight at these are sort of lesser-known services and but I think they鈥檙e extremely important.
So let鈥檚 put up some time in the clock and you see that in the bottom right corner and we鈥檙e going to switch back over to the browser. We鈥檙e going to go through we鈥檙e going to see what鈥檚 going on. Alright. So Amazon, Athena. First of all is a query service.
It鈥檚 the way that you can treat data in S3. Like it was in a database. Why would you care? Why do we need to put your data in a database? Well as 3 is to be really inexpensive to store large amounts of data and there鈥檚 a great life cycle behind S3.
We covered that in the previous stream when we talked about us three and put the thing about Athena is that inquiries all of your data in place. So you don鈥檛 have this process where you鈥檙e putting data in s When you鈥檙e pulling it out to either in ec2 instance or locally or pushing it into another database to do work on it at Fina lets you work on it where it is.
Now that I can have cost savings that definitely far more efficient from an operational perspective from a security and privacy perspective. I like it cuz the data stays where it is, you don鈥檛 need to pull the data out. So we鈥檙e going to dive into Athena and we鈥檙e going to see what happens because a lot of people think is really complicated.
I鈥檓 and it can be Do you need to take a sip of my way to hot tea the and I should have known that the Little Bear shows up in the when the Bear shows up driving. It鈥檚 too hot. Should I look for that? So with Athena we鈥檙e going to dive in and do a very simple basic thing to connect to Quick seitan will taco quicksight and as the second part of this, so I鈥檝e already cheated a little bit.
Sorry, but bear with me I鈥檓 basically already uploaded a big file cuz I don鈥檛 think you guys would want to watch me upload a file into a story. So we have our bucket that we made previously road to reinvent and we鈥檝e created a folder called you names. We鈥檝e uploaded this 1.4 gigabyte text file 1.4 gig in one text file that is has the least optimized way to do this, but I wanted to demonstrate the power to 1.4 gig of pure uncompressed text is a lot of texts.
Okay, what is in here is actually a significant amount of information about places. So it鈥檚 from a site called you names. Org, they freely license this out for for research on my belief potentially for commercial use as well. But we鈥檙e doing it for educational research purposes right now in our 1.4 KGB file.
We have a whole bunch of lines at every single line has these properties to its there鈥檚 19 different properties for every single line. We have a name. We鈥檝e alternative names with the longitude and latitude of where I whatever it is. We鈥檙e talking about. I鈥檓 the features of it. So there鈥檚 some some mapping information here and population elevation and the time zone that it鈥檚 in and then the date of the records are so say like a hundred fifty thousand people we have that over and over and over and over again, and I鈥檓 in fact if I just double-check this I鈥檓 just running their ears out.
111 million / 12 million lines in a text file. Okay, just shy of 12 million lines. So obviously that鈥檚 a lot of data. And what we鈥檙e going to do is we are going to use Fina to query that data. So we鈥檙e going to say we have 11 million lines or 12 million lines of city data in this is 3 bucket.
What we are going to do is go over to Athena now because I鈥檝e already used Athena I get jumped right into the default setup, which is basically about databases on the left hand side. And we鈥檝e queries on the right. What we鈥檙e going to do is create a new database I believe or create a new table.
Create a table from an S3 bucket. That gives us the new Wizard so we can walk through this. So we鈥檙e going to create a new database. We鈥檙e going to call this road to reinvent. Geonames, that鈥檚 our database name table is places its road to reinvent / Geo names of the interesting thing here is that so we鈥檝e given our database and name just in this case road to reinvent geonames.
We鈥檙e going to call this table places because of the info for putting in there is just a bunch of places and what we have now is an S3 folder. So you need a folder because by default Athena will treat everything in that folder as one big database. So we have 1/4 Mile in her folder and that鈥檚 okay.
What would be more efficient for both of Fina and for pricing which will cover in a second would be if instead of one big file. We had those files of broken up by at let鈥檚 say the first letter so some way that I feel could index them better. But in this case, we鈥檙e just going to give it one big giant file.
It鈥檚 not an encrypted data set if it was you can actually passing the key and say use this key to decrypt it. Click the blue next button and I forgot lower. We need a little underscores. Not the hyphens believe. Let me double-check some quick little things. Next question is data format.
What鈥檚 that text? What are those text files in that bucket look like in this case. We have a bunch of default format. You can see there鈥檚 things like a Json documents and parquet is an optimized format for Athena, which is great and it鈥檚 used and some other stuff like Hive but we have a TSB which is tab separated values.
You can also see us, separated values in our case. There鈥檚 a tab between every fields of name tab next field tab. That鈥檚 what we鈥檙e going to use tsv. And then homes what are a if we separate each line up by tabs. What do we have? And in this case? I鈥檝e already pre-populated this to make it easier.
I鈥檓 going to say bulk, and I鈥檓 going to just Pace this in but you can see we have the name of the the name of the column in this case the Jew Name ID and then the type it鈥檚 an integer next columns a place name. Little Bear still driving, so this is hot The next is a place name in that the string so very simple you can do this by hand as well.
But because we have 19 field. I just pre-populated it. I鈥檓 going to click add and you鈥檒l see it populated into Athena鈥檚 very simple call Amanda type. We wanted to change it. We could select a different type here and what we鈥檝e already knocked us out. So we scroll down to the bottom and now we鈥檙e going to click the blue next button again.
Now, I know this is kind of an in-depth is got a little more technical than we鈥檝e gone. And I think that鈥檚 okay. I think this is just, you know, this introduces some of the the power of Athena some of the the interesting facts for the next part is going to be what we鈥檙e we鈥檙e setting up to which is a quick side part sew week with the blue crate table.
And you see this is Chase takes a second. It actually gives us the query that it has created to build a stable. So if you want to save this weekend if we want to rebuild this in any time, but we鈥檙e just going to click on the new On the new query button here on the new query tab, you can see on the left hand side under tables.
We have our places table with all of the things in here. Now, if we want to preview this we click on the three. What鈥檚 the click preview table that this is going to give us a default query, which is Select * from our database places table. Give me the 10 records limit of 10.
If you scroll down here, you鈥檒l see that it鈥檚 already pulled these out. I鈥檓 going to go out on a limb and say these are probably in Iceland or Greenland there somewhere in the Nordic region based on the language. But if we want to what we can do is write our own query so we can say we鈥檙e place name.
like New York It鈥檚 what I鈥檝e done here is we say select star or select everything from our database name our table name where the place names of the name of the place is like New York. Now the little modulus little percentage sign means New York just needs to be in there somewhere.
It can be there could be stuff before it. There can be stuffed after write such a great way to search through of a fuzzy search for text and if we click run query going to see at the bottom. It鈥檚 running the query takes took about a second to go over those 12 million lines, which is pretty amazing and then we have all over these things.
So we have placed name New York. We have all of these items in New York. So see there鈥檚 a lot of hotels in the state of Base, New York, New York, New York Palace Fairfield it. So there鈥檚 all these a different places with in New York, but we鈥檙e getting a ton of information here.
So what we鈥檙e actually going to do is we鈥檙e going to get a little pickier. So we鈥檙e going to say give me place name. Give me latitude longitude. elevation population in time zone so that if we click the blue run query button again is going to give us just a little better filter, right? So here we鈥檝e gotten rid of a bunch of the junk is standard ask you to write.
This is standard database query and that鈥檚 really the key takeaway from Athena鈥檚 it lets you treat text files or Json documents are parking photos. All these files that are sitting in S3. You can treat them just like a normal database. Okay. Now you guys might be faking. Okay. That鈥檚 way too technical.
Why are we doing this? Does this really matter? What is one last step in Athena? And then we鈥檙e going to get you to the really cool stuff. So if I click create I can actually create a new table or a new view from this query. So why don鈥檛 you do is create a new table.
I鈥檓 going to keep it into my current database. guess a New York places places in New York and then my output bucket is going to be the same. Okay, so my output location and then it tells you which format is going to put them out. If you鈥檙e going to just leave it in Athena stick with the default app RK cuz that鈥檚 going to be the least expensive now.
I would say it鈥檚 going to write this in we say create and now it鈥檚 actually written another database for us another table for us and more. It鈥檚 still crunching on that and watch it finishes that will should see another table over here before we were very expensive if we look at Athena pricing and you鈥檙e going to see that the price is actually run per query.
So it鈥檚 $5 in u.s. East per terabyte of data scan. So every time works running if we鈥檙e right now, we鈥檙e scanning 1.5 gig of data so we can only run Just shy of what we can run 900 or so queries before we鈥檙e charge $5. That鈥檚 because we just dumped all of our data into one big text file if we have broken our data up or put it in a better format.
Then we wouldn鈥檛 be going across all of it all the time and we would be able to reduce that query cost down significantly. So in this case are adding failed because the place already exists because I was running tests previously. Sorry, but you get the point is that you can say about the results of queries as smaller tables and then that makes it cheaper to query things repeatedly.
So we have our big data set then we we know we want to do a lot of work with the stuff in New York that we can reduce the cost by outputting just the New York results as a whole new table then we only query that because now we鈥檙e doing way less information we go for my 1.4 gig down to a couple hundred kilobytes and that would give us way cheaper queries as we move forward.
Is far more interesting because quicksight is all about visualizing your data. It is a data visualization service. It is interesting enough. It was the very first date of your service to use its own top-level. Domain. AWS quicksight. Adriana to get pushed into that since you have a bunch of data use Quick site to look at it.
You make interactive dashboards email reports. You can do analytics. So we鈥檙e going to do that right now. We鈥檙e going to switch over to Kwik site as if you go to Services we go to Kwik site is going to pull this is the first time this account has ever been using quick site.
So it鈥檚 going to ask us to sign up for it, which is just enabling it on this. On this side Temple and you can see your own this account. You can see there鈥檚 actually charges for Kwik site, right? So I鈥檓 very clearly if we start this at UniFirst author with a gig of data is free team.
Trial is 60 days. I鈥檓 a knee after that you鈥檙e paying per additional author. So be aware of the cost of for one author with one gig of data. It鈥檚 actually free. So we鈥檙e just going to click continue. We鈥檙e going to give ourselves a unique name. We鈥檙e going to give ourselves our email notification address that we are going to automatically discover some data.
We are going to let quick sites access RS3 bucket. So this you鈥檒l see here is the data sources for quick side cuz it needs to pull data to visualize from somewhere. We鈥檙e going to give it access to our S3 bucket that we are using for this asshole seminar for all these things and we just click finish and it鈥檚 going to create our account in the background.
It鈥檚 a setting up all the permissions that we need and you notice I didn鈥檛 say give all access test 3. I just said that one bucket, which is a really really good practice. You want to restrict it to Brian鈥檚 got a great question here asking whether quicksight is similar to power bi it is but not as mature.
So quit site is very much about visualizing data and making some interactive dashboards and there is some really cool powerful stuff here and depending on what you鈥檙e using a power bi for in Microsoft. You have you may find what you need here. I think this is far more powerful and flexible, but you got to do a lot more work to get stuff out of it and what you鈥檒l see as we go here by and keep up the question if you have a specific one.
I鈥檓 around power bi but I think I would say just maturity-wise quick sites not there yet and but it definitely has the potential. So we got this little welcome dialogue cuz it鈥檚 the first time we鈥檝e ever been here. Hey it lets us quickly visualize analyze my data connect to your data super fast performance visualizing analyze right share insights with anyone and that鈥檚 really the key here so we can set up a 10 sites and then send them along.
So we have some default analyses if we鈥檇 like to and let鈥檚 just clicking that for a second. This is an analysis of website and social media its pre-made this is just a sample database and you can see here that we鈥檝e got website visits graft out overtime with website page views mailing list subscription data, so somebody鈥檚 going through and actually made this dashboard of a free vs.
Paid signups Twitter mentions and follows pretty straightforward standard analytical stuff. Right? So you can experiment around and you can play around with this and see what you want to do in our case. What we鈥檙e actually going to do is we are going to go back to the main and we鈥檙e going to start a new analysis alysis.
We are going to create a new dataset and then just increase the size of this a little bit so you can upload files directly. We can connect to a number of services including at Salesforce at 3 raw Athena. So that means we can take our data from S3. We could carve it up better in Athena and Process it there.
I鈥檓 so what we鈥檙e actually going to do is work going to connect to Athena because we鈥檝e already told Athena how our data is stored. So we can also go to Twitter with a new Twitter analytics, which is pretty cool. I have organized say Athena the data source name is going to be geonames.
Play create a new data source. So it鈥檚 loading my table and now I need to connect to my Athena thing. You notice is the exact same road to reinvent geonames is what we made in Afena is loading that up. It says there鈥檚 one table which is places. That鈥檚 the one we want to click select.
So if we have more than 25 million rows or over 25 gig this could fail So if it fails we鈥檙e going to see what鈥檚 what鈥檚 going on so we can import it or we could directly query in our case. We know we鈥檙e over the limit. So we鈥檙e not going to import it directly into quicksight, even though that would be faster.
We鈥檙e going to directly query the data which is a little more expensive but it鈥檚 going to be more useful for us. So one of the things we have is in our day that we have longitude and latitude which gives us the ability to do geospatial analysis. Which means where is it on a map is may or may not work.
We鈥檒l see but hey, look it worked. That鈥檚 always good. You never know with this stuff when you鈥檙e doing it live weather is going to work or not. And this is a little slower because we are doing it. Out of we鈥檙e not doing it in Athena itself. So it鈥檚 maximized that there we go.
So we because we鈥檙e pulling right for my Athena. We鈥檙e not pulling it from website where you can see every one of these dots is actually a place in our data sets of rental. We had New York and there鈥檚 a bunch of items in New York. This is where it has mapped all these things out so you can see each of his.
The stock has four places in their right. Quickly. All we did was point in our data and said hey, by the way, there are there is geospatial data in here, right so we can give us eyes as well which we have population which gives us our size. So now we have not only are we mapping it on the on the On the map but we鈥檝e also giving a size of.
So if we scroll this a little bit UCR. Sad bunch of them changed now, it鈥檚 really tricky to kind of see let me zoom in here. So let鈥檚 get there we go. That鈥檚 getting better. So you can see that the dots are changing transparency based on how big they are right so we can see now we鈥檝e got population.
So we have New York tag with population. We鈥檝e got Newark tag with his population and that鈥檚 how we very simply are doing a geospatial analysis right to all we did was we dragged from this list of our data and we put it into the right Fields very simply done.
Right? So if we rename this, this is a map that鈥檚 really handy. Let鈥檚 create a second analysis. So let鈥檚 do something like let鈥檚 do a scatter plot. So quick a scatter plot, and now we have x-axis and y-axis. So let鈥檚 go by population to elevation. And then we鈥檙e going to group by country code.
So we鈥檒l see how big things are based on population on the x-axis and elevation on the y-axis and then we鈥檒l see the country. So we see way up here. The US has the US have the number of u.s. Items of the population is obviously a sum of all the population figures.
So it鈥檚 not the actual country population. I鈥檓 in a some of the elevation. So there鈥檚 far more entries in the US rather than somewhere like empty empty has a bunch of stuff with no elevation tag needs the India here has low elevation but a bunch of entries included. So the interesting thing here is that we鈥檝e spotted that our data is not rolling up the way we would expect right? What we want to see is instead of a total of everything grouped by country code.
We want to see just the spread of population versus elevation for items. So because we were grouping it, we were actually changing the date. Rolling up everything in that country code, which is not accurate. There鈥檚 not 26 billion people in the United States, but here we now start to get a little bit more if we group this.
Sorry. Let me add one more thing we Group by asking name. So by name What we should see is a little more readable data. Right. So now we鈥檙e still graphing by population and elevation is a population along the X elevation along the Y and in this case were grouping it together.
So now we鈥檝e got a different distribution. So if we look here we got 40 million people in this particular area, which looks to be let鈥檚 see if that鈥檚 going to load up a second. So we get these different ways of playing around with our data right in the case of are there we go in the case of our particular data set.
That is not fast at all in the case of her particular. They decided to Simply. It鈥檚 a grouping of places with some data about those places. So in that case, what makes the most sense is that map that first map we did but play the scatterplot. Maybe we find something interesting a cute little fact that we can use for something else, but it鈥檚 very simple to and did this just kill everything.
This might have killed it everything. That鈥檚 a first down the stream. I鈥檝e never had Chrome lock up on me and then again chromes not my normal browser, but So the point is qwiksite makes it really really simple to visualize our data no matter where it is from. There we go.
All right, it鈥檚 responding again where it鈥檚 from in this case. What we did was we took a bunch of data that was sitting in one big honking file in S3. We created in Athena query to make that a little more understandable and then we pulled that Athena query into Amazon quicksight and with two or three clicks, we made a map that mapped out all of our data.
Here we go again, right if we look at this so we maximize it right we could look at all of our data and we can see very clearly, you know, sort of how things break up our clocks these places if you wanted to dive into to the nail salons and see what鈥檚 going on at what places are in Amsterdam, we鈥檝e got that right.
I鈥檓 so very very simply put you can still walk to the sample stuff is a ton of different chart types of the ton of different ways to connect data into a quick site in the back. Here鈥檚 what鈥檚 really really interesting. This date is still just sitting in that one file in S3.
We have moved it, right? We did move it at all. We sit in at 3 week from Athena we visualizing quick. So we did another 20 minutes or under 30 minutes, right? I鈥檓 very simply made some mistakes along the way we鈥檙e going to keep making some mistakes and but I think this is just kind of gives you some of the idea of what is possible with these tools in the beauty is that you know, you鈥檙e Athena queries you need to make sure that you鈥檙e understand how it queries to keep the price down so I can we start with a big data set we could chuck that up with other queries a thing like you just give us the New York stuff.
Give us the US. Whatever the case maybe save those queries back into S3 is optimized files which will reduce the cost because now we鈥檝e gone from 1.5 gig down to like a couple hundred kilobytes. So are Costco鈥檚 way lower because remember it鈥檚 $5 for every terabyte you query with Athena that cannot very very quickly.
So cut up at data and then we pulled it in. Quicksight to start looking at what鈥檚 going on at Kwik site has a pretty generous free tier and then it鈥檚 $9 a month or $12 a month for additional authors. And then you pay per gig as well and it can start to add up but it鈥檚 also extremely powerful now think about pulling in sales date instead of just like an open data set that we did you鈥檙e pulling in at your sales data or log data.
If you go back to these samples, right? So we go people overview analysis. This is just a breakdown of employees by region hiring by region. You can very easily create these kind of visualizations and then click the share button publish this out or share it. You can also capture this right into a storyboard which is you can download or you can just view a full-scale like this very very simple to do very very easy to do and if you go to the sharing options you have the ability was just click back in here.
You have the ability to publish this dashboard and then give people the ability to You play around with it or not, depending on what you want and you can send a whole bunch of permissions around that so you don鈥檛 need to make it public to share it with in your team.
So that was a quick glance at 2. Very very powerful AWS Services. The key here is that they leave the data where it sits in S3. You need to be aware that pricing like I鈥檝e said repeatedly cuz it is really important and I want to get hit by a big Bill unexpectedly, but from a privacy and data perspective your data only ever stays in that one bucket or in those areas in S3 where you have control over it and then you make sense of it with Athena and with quicksight at and you can share it out.
I also the interesting thing with Tina is that it then is a standard database output is a standard database connectivity, which means if you don鈥檛 want you to excel you can go to silence a new data source, it鈥檚 a relational database it鈥檚 here and give it to the details for thin as long as you set up a promotions correctly and all of a sudden now from Excel your query in your day today.
Meaning of three that鈥檚 really really powerful stuff. It鈥檚 also the beginning of the AWS data pipeline. You鈥檙e going to see that come up at reinvent. The last two years burner has spent a good chunk of his keynote talking about the AWS data lifecycle in the data pipeline. You probably heard the term data links that all starts with S 3 at Fina and quicksight are extremely powerful tools on top of that.
Okay, and hopefully this is giving you a bit of a taste. Thank you very much for your comments or appreciate the feedback. Thank you Brian for the feedback in the great question and I will continue to monitor This Thread as always to see if there is additional questions. I鈥檓 publishing this video over on YouTube and then again on my website, so there鈥檚 an additional questions or suggestions and let me know we鈥檙e looking for future topics as well.
We鈥檝e got 60 days before being bad for the next to Big major timeline, October 15th. Make sure you are on board for October 15th so that you can get your seat in the sessions that you want the way to start doing that is log into the session catalog right now and start fly.
The ones you鈥檙e interested in as I鈥檓 interested the little star that creates a quick list over on the side. I鈥檒l be giving a couple talks at the show up with that info up on my website shortly. But thank you for joining me today. I really appreciate it. I hope you are set up.
Hope you learned a little bit here and I hope you are set up for a fantastic. Wednesday will talk to you soon. As I launch my pennant you not aggressively hopefully.