数据科学与大数据技术的案例_数据科学实验室的案例
数据科学与大数据技术的案例By Mark Sellors, Technical Architect – Mango Solutions作者:Mango Solutions技术架构师Mark SellorsAs more and more Data Science moves from individuals working alone, with small data sets on th...
数据科学与大数据技术的案例
By Mark Sellors, Technical Architect – Mango Solutions
作者:Mango Solutions技术架构师Mark Sellors
As more and more Data Science moves from individuals working alone, with small data sets on their laptops, to more productionised, or analytically mature settings, an increasing number of restrictions are being placed on Data Scientists in the workplace.
随着越来越多的数据科学从单人工作,笔记本电脑上的小型数据集转移到生产能力更高或分析成熟的环境中,对工作场所中的数据科学家的限制越来越多。
Perhaps, your organisation has standardised on a particular version of Python or R, or perhaps you’re using a limited subset of all available big data tools. This sort of standardisation can be incredibly empowering for the business. It ensures all analysts are working with a common set of tools and allows analyses to be run anywhere across the organisation It doesn’t matter if it’s a laptop, server, or a large-scale cluster, Data Scientists and the wider business, can be safe in the knowledge that the versions of your analytic tools are the same in each environment.
也许您的组织已经在特定版本的Python或R上进行了标准化,或者您使用的是所有可用大数据工具的有限子集。 这种标准化可以极大地增强业务能力。 它确保所有分析师都使用一套通用的工具,并允许在组织中的任何地方运行分析。无论是笔记本电脑,服务器还是大型集群,数据科学家和更广泛的业务都可以在每个环境中分析工具的版本都相同的前提下,这样做很安全。
While incredibly useful for the business, this can, at times, feel very restricting for the individual Data Scientist. Maybe you want to try a new package that isn’t available for your ‘official’ version of R, or you want to try a new tool or technique that hasn’t made it into your officially supported environment yet. In all of these instances a Data Science Lab or Analytic Lab environment can prove invaluable to maintain pace with the fast paced data science world outside of your organisation.
尽管这对企业非常有用,但有时对于单个数据科学家而言,这可能会受到很大限制。 也许您想尝试一个新的软件包,而该软件包对于您的R的“官方”版本不可用,或者您想尝试一种尚未进入正式支持的环境的新工具或技术。 在所有这些情况下,数据科学实验室或分析实验室的环境对于与组织外部快速发展的数据科学世界保持同步是非常宝贵的。
An effective lab environment should be designed from the ground up to support innovation, both with new tools as well as new techniques and approaches. For the most part it’s rare that any two labs would be the same from one organisation to the next, however, the principles behind the implementation and operation are universal. The lab should provide a sandbox of sorts, where Data Scientists can work to improve what they do currently, as well as prepare for the challenges of tomorrow. A well implemented lab can be a source of immense value to it’s users as it can be a space for continual professional development. The benefits to the business however, can be even greater. By giving your Data Scientists the opportunity to be a part of driving requirements for your future analytic solutions, and with those solutions based on solid foundations derived from experiments and testing performed in the lab, the business can achieve and maintain true analytic maturity and meet new analytic challenges head-on.
应该从头开始设计一个有效的实验室环境,以使用新工具以及新技术和方法来支持创新。 在很大程度上,很少有两个实验室在一个组织与下一个组织之间是相同的,但是,实施和操作背后的原理是通用的。 实验室应该提供各种沙箱,数据科学家可以在其中工作,以改善他们目前的工作,并为明天的挑战做准备。 一个良好实施的实验室可以为其用户带来巨大价值,因为它可以作为持续专业发展的空间。 但是,给企业带来的好处可能更大。 通过使您的数据科学家有机会成为未来分析解决方案的推动需求的一部分,并借助基于实验室中进行的实验和测试获得的坚实基础的解决方案,企业可以实现并维持真正的分析成熟度并满足新的需求迎接分析挑战。
In order to successfully implement a lab in your business, you must first establish the need. If your Data Scientists are using whatever tools are handy and nobody has a decent grasp on what tools are used, with what additional libraries, and at what versions, then you have bigger fish to fry right now and should come back when that’s sorted out!
为了在您的企业中成功实施实验室,您必须首先确定需求。 如果您的数据科学家正在使用任何方便的工具,并且没有人对使用的工具,附加的库以及使用的版本了解得很清楚,那么您现在有更多的鱼要炸,并且应该在解决后再回来!
If your business analytic landscape is well understood and documented, you must first identify and distil your existing tool set into a set of core tools. As these tools constitute the day-to-day analytic workhorses of your business, they will form the backbone of the lab. In a lot of cases, this may be a particular Hadoop distribution and version, or perhaps a particular version of python with scikit-learn and numpy, or a combination.
如果您的业务分析环境得到了很好的理解和记录,则必须首先确定现有工具集并将其分散到一组核心工具中。 由于这些工具构成了您企业日常的分析工作,因此它们将构成实验室的骨干。 在很多情况下,这可能是特定的Hadoop发行版和版本,或者可能是带有scikit-learn和numpy的特定版本的python,或者是它们的组合。
The next step, can often be the most challenging, as it often requires moving outside of the Data Science or Advanced Analytics team and working closely with your IT department in order to provision environments upon which the lab will be based. Naturally, if you’re lucky enough to have a suitable Data Engineer or DataOps professional on your team then you may avoid this requirement. A lot of that is going to depend on the agility model of you business and how reliant on strict silos it is.
下一步通常是最具挑战性的,因为它通常需要移出数据科学或高级分析团队,并与您的IT部门密切合作,以提供实验室所基于的环境。 自然,如果您很幸运在团队中拥有合适的数据工程师或DataOps专业人员,那么您可以避免此要求。 其中很多将取决于您业务的敏捷性模型以及它对严格孤岛的依赖程度。
Ideally any environments provisioned at this stage should be capable of being rapidly re-provisioned and re-purposed as needs arise, so working with a modern infrastructure is a high priority. It’s often wise at this stage to consider some form of image management for containers or VM’s, to speed deployment and ensure environments are properly managed. You need to be able to adapt the environment to the changing needs of the user base with the minimum of effort and fuss.
理想情况下,在此阶段提供的任何环境都应能够根据需要快速进行重新配置和重新利用,因此使用现代基础架构是当务之急。 在此阶段,通常明智的做法是为容器或VM考虑某种形式的映像管理,以加快部署速度并确保对环境进行适当的管理。 您需要能够以最小的努力和大惊小怪的方式使环境适应用户群不断变化的需求。
Once you have rapidly deployable environments at your disposal, you’re ready to start work. What form that work takes should be left largely up to your Data Science team, but broadly speaking they should be free to use and evaluate new tools or approaches. Remember, the lab is not a place where production work is done with ad hoc tools, it’s a safe space for experimentation and innovation, just like a real laboratory environment. Using the knowledge gained from running tests or trials in the lab however, can and should inform the evolution of your production tools and techniques.
一旦拥有可快速部署的环境,就可以开始工作了。 采取何种形式的工作应很大程度上取决于您的数据科学团队,但从广义上讲,他们应该可以自由使用和评估新工具或方法。 请记住,实验室不是使用临时工具完成生产工作的地方,而是像真实实验室环境一样,是进行实验和创新的安全空间。 但是,使用在实验室中进行测试或试验获得的知识可以并且应该为您的生产工具和技术的发展提供信息。
A final word of warning for the business: A successful lab environment can’t be achieved through lip-service. The business must set aside time for Analysts or Data Scientists to develop the future analytic solutions that are increasingly becoming central to the success of the modern business.
对企业的最后警告是:通过口头服务无法获得成功的实验室环境。 企业必须留出时间让分析师或数据科学家开发未来的分析解决方案,这些解决方案对于现代企业的成功越来越重要。
翻译自: https://www.pybloggers.com/2015/12/the-case-for-a-data-science-lab/
数据科学与大数据技术的案例
魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。
更多推荐



所有评论(0)