AWS DataSync
In this blog post, let us look at architecture, use cases and steps to setup AWS DataSync to replicate storage from on premise to AWS Cloud.
I will also share some notes and considerations from my recent experience in a Cloud Migration project. Though DataSync can be used in variety of scenarios, for the scope of this blog we will only discuss storage migration from on-premise to AWS Cloud scenario.
Introduction
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates moving data between on-premises storage systems and AWS storage services, and also between AWS storage services. DataSync can copy data between Network File System (NFS), Server Message Block (SMB) file servers, self-managed object storage, AWS Snowcone, Amazon Simple Storage Service (Amazon S3) buckets, Amazon EFS file systems, and Amazon FSx for Windows File Server file systems.
Use Cases
- Data migration from on-premise datacenter to AWS.
- Archiving cold data directly from on-premise to S3 and transition to Glacier.
- Data transfer between AWS and on-premise for processing in Hybrid cloud setup scenarios.
Architecture
Below is the high level architecture of AWS DataSync for copying/migrating data NFS data from on-premise to AWS.
The DataSync agent is deployed on a supported VM in On premise Data centre, connected to storage server and AWS Cloud DataSync service endpoint.
The service endpoint at the AWS end could be 3 types;
1. Private Service Endpoint or VPC End Point
In this type of service endpoint, all communication from on premise DataSync agent to AWS occurs through the VPC endpoint, meaning its uses private network(VPN/Direct connect) established between on-premised data centre and AWS.
2. Public Service Endpoint
As the name suggests, all communication from on premise DataSync agent to AWS occurs over the public internet. In this scenario, you do not need to establish direct connect or VPN from your on-premise Datacenter to AWS Cloud.
3. FIPS Service Endpoint
In this type service endpoint, DataSync communicates with the AWS GovCloud (US) or Canada (Central) Region.
Important Consideration
- If you have a single VPC shared across multiple AWS accounts using AWS Resource Access Manager (RAM), you can only create the VPC endpoint in the account where VPC is created and you cannot share the VPC endpoint to another account.
- Hence the DataSync agent must be setup in the same AWS account where VPC is created. If you have a requirement to transfer the data directly to child account and you are sharing single VPC across all accounts, you can spin up a separate new VPC in child account for the transfer purpose.
Prerequisites
Before we start executing technical setup, we will see the prerequisites you should have or decisions one should make to get started with AWS DataSync.
- Decide the target AWS account where you want to host your EFS, S3 or FSX.
- Confirm the transfer method, whether it using private VPC endpoints over direct connect/VPN or public endpoints over the internet.
- Direct connect/VPN tunnel established to AWS if decided to use VPC endpoint.
- DataSync agent setup on on-premise as a Virtual Machine(VM).
- VMware ESXi Hypervisor (version 6.5, 6.7, or 7.0)
- Microsoft Hyper-V Hypervisor (version 2012 R2 or 2016)
- Linux Kernel-based Virtual Machine (KVM) *(Not supported for AWS EC2)
4. Opening network ports 2049 (NFS), 139/445 (SMB) and 443/80 ( Self managed object storage) between DataSync agentVM and storage server.
5. Opening network ports 1024–1064 for control traffic, 443 (HTTPS) for data transfer if using VPC endpoints scenario.
6. Opening ports 443 to public below endpoints on your corporate firewall if using Public Service endpoint.
activation.datasync.$region
.amazonaws.com — for DataSync activation
datasync.$region
.amazonaws.com — for API endpoints
$taskId.datasync-dp.$region.amazonaws.com, cp.datasync.$region
.amazonaws.com — Data transfer endpoints
repo.$region
.amazonaws.com, repo.default.amazonaws.com, packages.$region
.amazonaws.com — Agent updates.
DataSync Setup
Below are the high level steps to setup DataSync in an on-premise to cloud storage migration scenario from AWS Console, AWS DataSync supports CLI and cloud formation as well.
- Logon to target AWS account console and navigate to AWS DataSync services, select agents and click on create agent. Select the Hypervisor and download the image from the link. the image needs be deployed on on-premise VM.
- once DataSync agent is deployed, from the AWS console select and fill other options such as Service Endpoint, in case of VPC endpoint, select endpoint, subnets and default security group and activate the agent either connecting the agent through browser or entering the key manually. once agent is activate its should appear as online.
3. Now create locations, locations are source and destinations for copying the files, in our example source is NFS on-premise storage server and destination is EFS. Create source location by selecting options as NFS, agents, Storage server address (Agent must have access to storage server as mentioned in Prerequisites section) and mount path on storage server which you want to copy. Similarly create destination location and select option as EFS.
4. Now you are ready to create the tasks to copy the files from source to destination. create a task from tasks tab, select source location created in step 3 and go next and select destination location created in step 3 and next. In next page you can configure the task settings.
5. Once task is created, it can be kicked off immediately by going into the tasks and starting them manually or can be scheduled at certain time of the dat in task settings. Once task becomes available you are ready to copy the files.
Conclusion
AWS DataSync is an online data transfer service, which automates and simplifies transferring large amounts of data from on-premise to AWS cloud storage services.
It is more beneficial to use DataSync than a custom script for transferring as it saves up script development time. DataSync is easy to setup, uses parallel multithreaded architecture, encrypts and validates the integration of transferred files. Also does have an option to use VPC endpoints over direct connect/VPN tunnels, so that your data never leaves your corporate network.
It also have an option to limit the use of bandwidth on direct connect/VPN, this is useful if you have other workloads in cloud using the same connection.
Consider the latency, if you are only migrating shared storage to EFS and presenting them to on-premise servers, test it thoroughly in all possible scenarios. It’s always a good idea to combine the migration of storage and servers together to achieve low latency to shared storage.
Hope you like this blog, feel free to comment and will respond as soon as I can. You can connect me on LinkedIn @ www.linkedin.com/in/donthinenis