python里遍历筛选xml文件_遍历python中XML标记中的所有子标记和字符串,而无需指定子标记名称...
My question is an add on from here, but I'm not meant to use the answer section for add-on questions.If I have part of an XML file like this:Inclusion Criteria:-women undergoing cesarean section for..
My question is an add on from here, but I'm not meant to use the answer section for add-on questions.
If I have part of an XML file like this:
Inclusion Criteria:
- women undergoing cesarean section for any indication
- literate in german language
Exclusion Criteria:
- history of keloids
- previous transversal suprapubic scars
- known patient hypersensitivity to any of the suture materials used in the protocol
- a medical disorder that could affect wound healing (eg, diabetes mellitus, chronic
corticosteroid use)
Female
18 Years
45 Years
No
I want to pull out all of the strings in this eligibility section (i.e the string in the textblock section and the gender, minimum age, maximum age and healthy volunteers sections)
using the code above I did this:
import sys
from bs4 import BeautifulSoup
soup = BeautifulSoup(open(sys.argv[1], 'r'), 'lxml')
eligibi = []
for eligibility in soup.find_all('eligibility'):
d = {'other_name':eligibility.criteria.textblock.string, 'gender':eligibility.gender.string}
eligibi.append(d)
print eligibi
My problem is I have many files. Sometimes the structure of the XML file might be:
eligibility -> criteria -> textblock -> text
eligibility -> other things (e.g. gender as above) -> text
eligibility -> text
e.g.
if there way to just take 'take all of the sub-headings and their texts'
so in the above example, the list/dictionary would contain:
{criteria textblock: inclusion and exclusion criteria, gender: xxx, minimum_age: xxx, maximum_age: xxx, healthy_volunteers: xxx}
My problem is, in reality, I am not going to know all the specific sub-tags of the eligibility tag, as each experiment could be different (e.g. maybe some say 'pregnant women accepted', 'drug history of XXX accepted' etc)
So I just want, if I give it a tag name, it will give me all the sub-tags and text of those sub-tags in a dictionary.
Extended XML for comment:
Subcutaneous Adaption and Cosmetic Outcome Following Caesarean Delivery
Klinikum Klagenfurt am Wörthersee
...and then the eligibility XML section above.
解决方案
Since you have lxml installed you can try the following (this code assumes leaf elements within a given element i.e eligibility are unique) :
from lxml import etree
tree = etree.parse(sys.argv[1])
root = tree.getroot()
eligibi = []
for eligibility in root.xpath('//eligibility'):
d = {}
for e in eligibility.xpath('.//*[not(*)]'):
d[e.tag] = e.text
eligibi.append(d)
print eligibi
XPath explanation :
.//* : find all elements within current eligibility, no matter its depth (//) and tag name (*)
[not(*)] : filter elements found by the previous bit to those that don't have any child element aka leaf elements
魔乐社区(Modelers.cn) 是一个中立、公益的人工智能社区,提供人工智能工具、模型、数据的托管、展示与应用协同服务,为人工智能开发及爱好者搭建开放的学习交流平台。社区通过理事会方式运作,由全产业链共同建设、共同运营、共同享有,推动国产AI生态繁荣发展。
更多推荐
所有评论(0)