使用Python进行数据类型转换(上)

csv文件无法识别数据类型,所以我们用python读取之后,数据类型全都是string字符串类型。很显然不是所有数据都是string类型,所以我们需要做转换,不然会出现运算错误。并且,后续debug时候更方便。
先看看enrollmnets里都有哪些数据:account_key账号,cancel_date取消日期,days_to_cancel多少天内取消,is_canceled是否取消,is_udacity是否是udacity员工,join_date加入日期,status状态。
很显然account_key是数字类型,days_to_cancel也是数字,整数。cancel_date和join_date是日期型。is_canceled和is_udacity显然是布尔型。

{u’account_key’: u’448′,
u’cancel_date’: u’2015-01-14′,
u’days_to_cancel’: u’65’,
u’is_canceled’: u’True’,
u’is_udacity’: u’True’,
u’join_date’: u’2014-11-10′,
u’status’: u’canceled’}

然后再看看engagement里都有哪些数据。acct是账号,lessons_completed是完成课程数,num_courses_visited是访问课程数,projects_completed完成作业数,total_minutes_visited总访问时间,utc_date应该是记录下这些数据的时间(完成多少门课,总访问时间多少分钟)。

{u’acct’: u’0′,
u’lessons_completed’: u’0.0′,
u’num_courses_visited’: u’1.0′,
u’projects_completed’: u’0.0′,
u’total_minutes_visited’: u’11.6793745′,
u’utc_date’: u’2015-01-09′}

最后再看看submissions的数据。account_key账号,assigned_rating作业评估状态,completion_date完成时间,creation_date作业创建时间,lesson_key课程代码,processing_state作业评估状态。

{u’account_key’: u’256′,
u’assigned_rating’: u’UNGRADED’,
u’completion_date’: datetime.datetime(2015, 1, 16, 0, 0),
u’creation_date’: datetime.datetime(2015, 1, 14, 0, 0),
u’lesson_key’: u’3176718735′,
u’processing_state’: u’EVALUATED’}
好,我们总结一下,目前有数据型、日期型、布尔型被当成了字符串,我们要将它们转换。我们分别来看,account_key是账号,但它不参与运算,所以转换不转换都一样,我们选择不转换。days_to_cancel直接用int就可以转换;total_ninutes_visited是小数,用float也可以直接转换;is_canceled要转成布尔,让它和字节’True’比较就好了,相同返回值就是布尔类型True,不相同返回值是布尔类型False,一共就这两种情况,因为is_canceled不是’True’就是’False’。最后是日期型,这个不知道怎么转换,查官方文档,记住我们用的是2.7版本,而不是3.5。以“date”为关键词搜,得到很多结果,第一个搜索结果感觉挺像的,Basic date and time types。
date
点开看看,内容很多!大致了解到datetime是一个模块,还有个tzinfo时区属性。 有几个类,datetime.date是典型的日期格式,属性有year,month,day;第二个类是datetime.time,时间格式,属性有hour, minute, second, microsecond,和tzinfo;第三个类是datetime.datetime,是date和time的结合,属性有year, month, day, hour, minute, second, microsecond和tzinfo;还有一个值得注意的是datetime.timedelta,做减法的,date,time,datetime都可以做,百万分之一秒级别。
date class
我们继续往下看,内容真的很多,很好,找到了一个类方法:
date class

classmethod datetime.strptime(date_string, format)
Return a datetime corresponding to date_string, parsed according to format. This is equivalent to datetime(*(time.strptime(date_string, format)[0:6])). ValueError is raised if the date_string and format can’t be parsed by time.strptime() or if it returns a value which isn’t a time tuple. For a complete list of formatting directives, see section strftime() and strptime() Behavior.

基本上说清了这个类方法的用法,这个方法有两个参数,第1个是date_sting,第二个是format;date_string好理解,就是字符串形式的日期,例如’2015-01-09’;那format是个什么东西?格式!想必是日期格式,日期格式有很多表达,到底怎么表达?其实官方文档里给出了示例。

>>dt = datetime.strptime("21/11/06 16:30", "%d/%m/%y %H:%M")  
 >>dt  
 datetime.datetime(2006, 11, 21, 16, 30)>

strptime

原来有规定的格式,其实刚才datetime.strptime(date_string, format)方法里最后一句有指向链接For a complete list of formatting directives, see section strftime() and strptime() Behavior.我们通过链接找到,发现Y-M-D的大小写都有区别:
directive
好,我们尝试将enrollments里的数据进行转换:

   from datetime import datetime    
    for enrollment in enrollments:  
        enrollments['days_to_cancel'] = int(enrollments['days_to_cancel'])
        enrollments['is_canceled'] = (enrollments['is_canceled'] == 'True')
        enrollments['is_udacity']  = (enrollments['is_udacity'] == 'True')
        enrollments['cancel_date'] = datetime.strptime((enrollments['cancel_date'],'%Y-%m-%d')
        enrollments['join_date'] = datetime.strptime((enrollments['join_date'],'%Y-%m-%d')

直接报错:ValueError: invalid literal for int() with base 10: 。开始debug,一行一行试。
空值报错

怎么一行一行试呢?下面文件读取代码保留(下文称为“读取代码”),试新的代码。

import unicodecsv

 def read_csv(filename):
     with open(filename, 'rb') as f:
     reader = unicodecsv.DictReader(f)
     return list(reader)

 enrollments = read_csv('enrollments.csv')
 daily_engagement = read_csv('daily_engagement.csv')
 project_submissions = read_csv('project_submissions.csv')

我们先试试布尔的转换是否成功:

from datetime import datetime    

 for enrollment in enrollments:  
     enrollment['is_canceled'] = (enrollment['is_canceled'] == 'True')
     enrollment['is_udacity']  = (enrollment['is_udacity'] == 'True')

 enrollments[0]

运行没问题,结果如下,可以看到is_canceled和is_udacity已经成功被转换成布尔值:
{u’account_key’: u’448′,
u’cancel_date’: datetime.datetime(2015, 1, 14, 0, 0),
u’days_to_cancel’: 65,
u’is_canceled’: True,
u’is_udacity’: True,
u’join_date’: datetime.datetime(2014, 11, 10, 0, 0),
u’status’: u’canceled’}
接下来再把int转换加上再试。记得试新代码前要先把“读取代码”运行一遍,不然enrollment[‘is_canceled’] 已经被转换成布尔值了,布尔值和字符串 ‘True’做比较,肯定每次都是不相等,最后所有布尔值都是False。

from datetime import datetime    

 for enrollment in enrollments:  
     enrollment['is_canceled'] = (enrollment['is_canceled'] == 'True')
     enrollment['is_udacity']  = (enrollment['is_udacity'] == 'True')
     enrollments['days_to_cancel'] = int(enrollments['days_to_cancel'])

 enrollments[0]

运行,程序报错,和第一次运行报错一样:ValueError: invalid literal for int() with base 10: 。什么意思?int()函数的参数出现了非法参数。检查一下原始数据,发现days_to_cancel的值有是空值的,也就是说目前还是活跃账户状态,并没有取消课程。做一下调整:

from datetime import datetime    

 for enrollment in enrollments:
     enrollment['is_canceled'] = (enrollment['is_canceled'] == 'True')
     enrollment['is_udacity']  = (enrollment['is_udacity'] == 'True')

     if enrollment['days_to_cancel'] !=  '':
        enrollment['days_to_cancel'] = int(enrollment['days_to_cancel'])

 enrollments[0]

运行结果如下,可以看到days_to_cancel也被成功转换了:
{u’account_key’: u’448′,
u’cancel_date’: u’2015-01-14′,
u’days_to_cancel’: 65,
u’is_canceled’: True,
u’is_udacity’: True,
u’join_date’: u’2014-11-10′,
u’status’: u’canceled’}
接下来就是日期的转换:

from datetime import datetime    

 for enrollment in enrollments:
     enrollment['is_canceled'] = (enrollment['is_canceled'] == 'True')
     enrollment['is_udacity']  = (enrollment['is_udacity'] == 'True')

     if enrollment['days_to_cancel'] !=  '':
        enrollment['days_to_cancel'] = int(enrollment['days_to_cancel'])

     enrollment['cancel_date'] = datetime.strptime(enrollment['cancel_date'],'%Y-%m-%d')
     enrollment['join_date'] = datetime.strptime(enrollment['join_date'],'%Y-%m-%d')

 enrollments[0]

程序报错了:time data ” does not match format ‘%Y-%m-%d’。看起来还是空值的问题。必须要调整一下:

from datetime import datetime    

 for enrollment in enrollments:
     enrollment['is_canceled'] = (enrollment['is_canceled'] == 'True')
     enrollment['is_udacity']  = (enrollment['is_udacity'] == 'True')  

     if enrollment['days_to_cancel'] !=  '':
        enrollment['days_to_cancel'] = int(enrollment['days_to_cancel'])
     if enrollment['cancel_date'] != '':
        enrollment['cancel_date'] = datetime.strptime(enrollment['cancel_date'],'%Y-%m-%d')
     if enrollment['join_date'] != '':
        enrollment['join_date'] = datetime.strptime(enrollment['join_date'],'%Y-%m-%d')

 enrollments[0]

这次程序顺利运行,结果如下:
{u’account_key’: u’448′,
u’cancel_date’: datetime.datetime(2015, 1, 14, 0, 0),
u’days_to_cancel’: 65,
u’is_canceled’: True,
u’is_udacity’: True,
u’join_date’: datetime.datetime(2014, 11, 10, 0, 0),
u’status’: u’canceled’}

好,我们再来处理一下其他两个文件:daily_engagement和project_submissions。先来看daily_engagement。

for engagement in daily_engagement:  
     engagement['lessons_completed'] = int (engagement['lessons_completed'])
     engagement['num_courses_visited'] = int (engagement['num_courses_visited'])
     engagement['projects_completed'] = int (engagement['projects_completed'])
     engagement['total_minutes_visited'] = int (engagement['total_minutes_visited'])

     if engagement['utc_date'] !=  '':
        engagement['utc_date'] = int(engagement['utc_date'])

daily_engagement[0]

又报错了:ValueError: invalid literal for int() with base 10: ‘0.0’。原来lessons_completed和num_courses_visited都需要先经过float转换,继续调整一下。

for engagement in daily_engagement:
     engagement['lessons_completed'] = int(float(engagement['lessons_completed']))
     engagement['num_courses_visited'] = int(float(engagement['num_courses_visited']))
     engagement['projects_completed'] = int(float(engagement['projects_completed']))
     engagement['total_minutes_visited'] = float(engagement['total_minutes_visited'])
     engagement['utc_date'] = datetime.strptime(engagement['utc_date'],'%Y-%m-%d')

daily_engagement[0]

程序正常运行,结果如下:
{u’acct’: u’0′,
u’lessons_completed’: 0,
u’num_courses_visited’: 1,
u’projects_completed’: 0,
u’total_minutes_visited’: 11.6793745,
u’utc_date’: datetime.datetime(2015, 1, 9, 0, 0)}
最后,对于project_submissions的数据也可以依次处理,当做练习来做很适合。
不过这次的代码还不够精简,后续可以做优化。
最后,将所有代码贴出来:

import unicodecsv

 def read_csv(filename):
     with open(filename, 'rb') as f:
          reader = unicodecsv.DictReader(f)
          return list(reader)

enrollments = read_csv('enrollments.csv')
daily_engagement = read_csv('daily_engagement.csv')
project_submissions = read_csv('project_submissions.csv')

from datetime import date  

for enrollment in enrollments:
    enrollment['is_canceled'] = (enrollment['is_canceled'] == 'True')
    enrollment['is_udacity']  = (enrollment['is_udacity'] == 'True')

    if enrollment['days_to_cancel'] !=  '':
       enrollment['days_to_cancel'] = int(enrollment['days_to_cancel'])
    if enrollment['cancel_date'] != '':
       enrollment['cancel_date'] = datetime.strptime(enrollment['cancel_date'],'%Y-%m-%d')
    if enrollment['join_date'] != '':
       enrollment['join_date'] = datetime.strptime(enrollment['join_date'],'%Y-%m-%d')

enrollments[0]

for engagement in daily_engagement:
    engagement['lessons_completed'] = int(float(engagement['lessons_completed']))
    engagement['num_courses_visited'] = int(float(engagement['num_courses_visited']))
    engagement['projects_completed'] = int(float(engagement['projects_completed']))
    engagement['total_minutes_visited'] = float(engagement['total_minutes_visited'])
    engagement['utc_date'] = datetime.strptime(engagement['utc_date'],'%Y-%m-%d')

daily_engagement[0]

看起来有点多,我们下篇文章进行简化。

You may also like...

发表评论

电子邮件地址不会被公开。 必填项已用*标注